95 lines
3.9 KiB
Plaintext
95 lines
3.9 KiB
Plaintext
|
TODO file for GNU ptx - last revised 05 November 1993.
|
|||
|
Copyright (C) 1992, 1993 Free Software Foundation, Inc.
|
|||
|
Francois Pinard <pinard@iro.umontreal.ca>, 1992.
|
|||
|
|
|||
|
The following are more or less in decreasing order of priority.
|
|||
|
|
|||
|
* Use rx instead of regex.
|
|||
|
|
|||
|
* Correct the infinite loop using -S '$' or -S '^'.
|
|||
|
|
|||
|
* Use mmap for swallowing files (maybe wrong when memory edited).
|
|||
|
|
|||
|
* Understand and mimic `-t' option, if I can.
|
|||
|
|
|||
|
* Sort keywords intelligently for Latin-1 code. See how to interface
|
|||
|
this character set with various output formats. Also, introduce
|
|||
|
options to inverse-sort and possibly to reverse-sort.
|
|||
|
|
|||
|
* Improve speed for Ignore and Only tables. Consider hashing instead
|
|||
|
of sorting. Consider playing with obstacks to digest them.
|
|||
|
|
|||
|
* Provide better handling of format effectors obtained from input, and
|
|||
|
also attempt white space compression on output which would still
|
|||
|
maximize full output width usage.
|
|||
|
|
|||
|
* See how TeX mode could be made more useful, and if a texinfo mode
|
|||
|
would mean something to someone.
|
|||
|
|
|||
|
* Provide multiple language support
|
|||
|
|
|||
|
Most of the boosting work should go along the line of fast recognition
|
|||
|
of multiple and complex boundaries, which define various `languages'.
|
|||
|
Each such language has its own rules for words, sentences, paragraphs,
|
|||
|
and reporting requests. This is less difficult than I first thought:
|
|||
|
|
|||
|
. Recognize language modifiers with each option. At least -b, -i, -o,
|
|||
|
-W, -S, and also new language switcher options, will have such
|
|||
|
modifiers. Modifiers on language switchers will allow or disallow
|
|||
|
language transitions.
|
|||
|
|
|||
|
. Complete the transformation of underlying variables into arrays in
|
|||
|
the code.
|
|||
|
|
|||
|
. Implement a heap of positions in the input file. There is one entry
|
|||
|
in the heap for each compiled regexp; it is initialized by a re_search
|
|||
|
after each regexp compile. Regexps reschedule themselves in the heap
|
|||
|
when their position passes while scanning input. In this way, looking
|
|||
|
simultaneously for a lot of regexps should not be too inefficient,
|
|||
|
once the scanning starts. If this works ok, maybe consider accepting
|
|||
|
regexps in Only and Ignore tables.
|
|||
|
|
|||
|
. Merge with language processing boundary processing options, really
|
|||
|
integrating -S processing as a special case. Maybe, implement several
|
|||
|
level of boundaries. See how to implement a stack of languages, for
|
|||
|
handling quotations. See if more sophisticated references could be
|
|||
|
handled as another special case of a language.
|
|||
|
|
|||
|
* Tackle other aspects, in a more long term view
|
|||
|
|
|||
|
. Add options for statistics, frequency lists, referencing, and all
|
|||
|
other prescreening tools and subsidiary tasks of concordance
|
|||
|
production.
|
|||
|
|
|||
|
. Develop an interactive mode. Even better, construct a GNU emacs
|
|||
|
interface. I'm looking at Gene Myers <gene@cs.arizona.edu> suffix
|
|||
|
arrays as a possible implementation along those ideas.
|
|||
|
|
|||
|
. Implement hooks so word classification and tagging should be merged
|
|||
|
in. See how to effectively hook in lemmatisation or other
|
|||
|
morphological features. It is far from being clear by now how to
|
|||
|
interface this correctly, so some experimentation is mandatory.
|
|||
|
|
|||
|
. Profile and speed up the whole thing.
|
|||
|
|
|||
|
. Make it work on small address space machines. Consider three levels
|
|||
|
of hugeness for files, and three corresponding algorithms to make
|
|||
|
optimal use of memory. The first case is when all the input files and
|
|||
|
all the word references fit in memory: this is the case currently
|
|||
|
implemented. The second case is when the files cannot fit all together
|
|||
|
in memory, but the word references do. The third case is when even
|
|||
|
the word references cannot fit in memory.
|
|||
|
|
|||
|
. There also are subsidiary developments for in-core incremental sort
|
|||
|
routines as well as for external sort packages. The need for more
|
|||
|
flexible sort packages comes partly from the fact that linguists use
|
|||
|
kinds of keys which compare in unusual and more sophisticated ways.
|
|||
|
GNU `sort' and `ptx' could evolve together.
|
|||
|
|
|||
|
|
|||
|
Local Variables:
|
|||
|
mode: outline
|
|||
|
outline-regexp: " *[-+*.] \\|"
|
|||
|
eval: (hide-body)
|
|||
|
End:
|