95 lines
3.9 KiB
Plaintext
95 lines
3.9 KiB
Plaintext
TODO file for GNU ptx - last revised 05 November 1993.
|
||
Copyright (C) 1992, 1993 Free Software Foundation, Inc.
|
||
Francois Pinard <pinard@iro.umontreal.ca>, 1992.
|
||
|
||
The following are more or less in decreasing order of priority.
|
||
|
||
* Use rx instead of regex.
|
||
|
||
* Correct the infinite loop using -S '$' or -S '^'.
|
||
|
||
* Use mmap for swallowing files (maybe wrong when memory edited).
|
||
|
||
* Understand and mimic `-t' option, if I can.
|
||
|
||
* Sort keywords intelligently for Latin-1 code. See how to interface
|
||
this character set with various output formats. Also, introduce
|
||
options to inverse-sort and possibly to reverse-sort.
|
||
|
||
* Improve speed for Ignore and Only tables. Consider hashing instead
|
||
of sorting. Consider playing with obstacks to digest them.
|
||
|
||
* Provide better handling of format effectors obtained from input, and
|
||
also attempt white space compression on output which would still
|
||
maximize full output width usage.
|
||
|
||
* See how TeX mode could be made more useful, and if a texinfo mode
|
||
would mean something to someone.
|
||
|
||
* Provide multiple language support
|
||
|
||
Most of the boosting work should go along the line of fast recognition
|
||
of multiple and complex boundaries, which define various `languages'.
|
||
Each such language has its own rules for words, sentences, paragraphs,
|
||
and reporting requests. This is less difficult than I first thought:
|
||
|
||
. Recognize language modifiers with each option. At least -b, -i, -o,
|
||
-W, -S, and also new language switcher options, will have such
|
||
modifiers. Modifiers on language switchers will allow or disallow
|
||
language transitions.
|
||
|
||
. Complete the transformation of underlying variables into arrays in
|
||
the code.
|
||
|
||
. Implement a heap of positions in the input file. There is one entry
|
||
in the heap for each compiled regexp; it is initialized by a re_search
|
||
after each regexp compile. Regexps reschedule themselves in the heap
|
||
when their position passes while scanning input. In this way, looking
|
||
simultaneously for a lot of regexps should not be too inefficient,
|
||
once the scanning starts. If this works ok, maybe consider accepting
|
||
regexps in Only and Ignore tables.
|
||
|
||
. Merge with language processing boundary processing options, really
|
||
integrating -S processing as a special case. Maybe, implement several
|
||
level of boundaries. See how to implement a stack of languages, for
|
||
handling quotations. See if more sophisticated references could be
|
||
handled as another special case of a language.
|
||
|
||
* Tackle other aspects, in a more long term view
|
||
|
||
. Add options for statistics, frequency lists, referencing, and all
|
||
other prescreening tools and subsidiary tasks of concordance
|
||
production.
|
||
|
||
. Develop an interactive mode. Even better, construct a GNU emacs
|
||
interface. I'm looking at Gene Myers <gene@cs.arizona.edu> suffix
|
||
arrays as a possible implementation along those ideas.
|
||
|
||
. Implement hooks so word classification and tagging should be merged
|
||
in. See how to effectively hook in lemmatisation or other
|
||
morphological features. It is far from being clear by now how to
|
||
interface this correctly, so some experimentation is mandatory.
|
||
|
||
. Profile and speed up the whole thing.
|
||
|
||
. Make it work on small address space machines. Consider three levels
|
||
of hugeness for files, and three corresponding algorithms to make
|
||
optimal use of memory. The first case is when all the input files and
|
||
all the word references fit in memory: this is the case currently
|
||
implemented. The second case is when the files cannot fit all together
|
||
in memory, but the word references do. The third case is when even
|
||
the word references cannot fit in memory.
|
||
|
||
. There also are subsidiary developments for in-core incremental sort
|
||
routines as well as for external sort packages. The need for more
|
||
flexible sort packages comes partly from the fact that linguists use
|
||
kinds of keys which compare in unusual and more sophisticated ways.
|
||
GNU `sort' and `ptx' could evolve together.
|
||
|
||
|
||
Local Variables:
|
||
mode: outline
|
||
outline-regexp: " *[-+*.] \\|"
|
||
eval: (hide-body)
|
||
End:
|