Use -lgnuregex properly

Install infopages
1995-01-12 01:30:34 +00:00 · 1995-01-12 01:30:34 +00:00 · 72faa78f5b
commit 72faa78f5b
parent 1df8e837d0
4 changed files with 559 additions and 2 deletions
--- a/gnu/usr.bin/ptx/Makefile
+++ b/gnu/usr.bin/ptx/Makefile
@ -2,7 +2,7 @@ PROG=	ptx
 SRCS=	argmatch.c diacrit.c error.c getopt.c getopt1.c ptx.c xmalloc.c

 LDADD+=	-lgnuregex
-DPADD+= ${GNUREGEX}
+DPADD+= ${LIBGNUREGEX}
 CFLAGS+= -DHAVE_CONFIG_H -DDEFAULT_IGNORE_FILE=\"/usr/share/dict/eign\"

 NOMAN=
--- a/gnu/usr.bin/ptx/doc/Makefile
+++ b/gnu/usr.bin/ptx/doc/Makefile
@ -0,0 +1,3 @@
+INFO = ptx
+
+.include <bsd.info.mk>
--- a/gnu/usr.bin/ptx/doc/ptx.texinfo
+++ b/gnu/usr.bin/ptx/doc/ptx.texinfo
@ -0,0 +1,554 @@
+\input texinfo @c -*-texinfo-*-
+@c %**start of header
+@setfilename ptx.info
+@settitle GNU @code{ptx} reference manual
+@finalout
+@c %**end of header
+
+@ifinfo
+This file documents the @code{ptx} command, which has the purpose of
+generated permuted indices for group of files.
+
+Copyright (C) 1990, 1991, 1993 by the Free Software Foundation, Inc.
+
+Permission is granted to make and distribute verbatim copies of
+this manual provided the copyright notice and this permission notice
+are preserved on all copies.
+
+@ignore
+Permission is granted to process this file through TeX and print the
+results, provided the printed document carries copying permission
+notice identical to this one except for the removal of this paragraph
+(this paragraph not being relevant to the printed manual).
+
+@end ignore
+Permission is granted to copy and distribute modified versions of this
+manual under the conditions for verbatim copying, provided that the entire
+resulting derived work is distributed under the terms of a permission
+notice identical to this one.
+
+Permission is granted to copy and distribute translations of this manual
+into another language, under the above conditions for modified versions,
+except that this permission notice may be stated in a translation approved
+by the Foundation.
+@end ifinfo
+
+@titlepage
+@title ptx
+@subtitle The GNU permuted indexer
+@subtitle Edition 0.3, for ptx version 0.3
+@subtitle November 1993
+@author by Francois Pinard
+
+@page
+@vskip 0pt plus 1filll
+Copyright @copyright{} 1990, 1991, 1993 Free Software Foundation, Inc.
+
+Permission is granted to make and distribute verbatim copies of
+this manual provided the copyright notice and this permission notice
+are preserved on all copies.
+
+Permission is granted to copy and distribute modified versions of this
+manual under the conditions for verbatim copying, provided that the entire
+resulting derived work is distributed under the terms of a permission
+notice identical to this one.
+
+Permission is granted to copy and distribute translations of this manual
+into another language, under the above conditions for modified versions,
+except that this permission notice may be stated in a translation approved
+by the Foundation.
+@end titlepage
+
+@node Top, Invoking ptx, (dir), (dir)
+@chapter Introduction
+
+This is the 0.3 beta release of @code{ptx}, the GNU version of a
+permuted index generator.  This software has the main goal of providing
+a replacement for the traditional @code{ptx} as found on System V
+machines, able to handle small files quickly, while providing a platform
+for more development.
+
+This version reimplements and extends traditional @code{ptx}.  Among
+other things, it can produce a readable @dfn{KWIC} (keywords in their
+context) without the need of @code{nroff}, there is also an option to
+produce @TeX{} compatible output.  This version does not handle huge
+input files, that is, those files which do not fit in memory all at
+once.
+
+@emph{Please note} that an overall renaming of all options is
+foreseeable.  In fact, GNU ptx specifications are not frozen yet.
+
+@menu
+* Invoking ptx::                How to use this program
+* Compatibility::               The GNU extensions to @code{ptx}
+
+ --- The Detailed Node Listing ---
+
+How to use this program
+
+* General options::             Options which affect general program behaviour.
+* Charset selection::           Underlying character set considerations.
+* Input processing::            Input fields, contexts, and keyword selection.
+* Output formatting::           Types of output format, and sizing the fields.
+@end menu
+
+@node Invoking ptx, Compatibility, Top, Top
+@chapter How to use this program
+
+This tool reads a text file and essentially produces a permuted index, with
+each keyword in its context.  The calling sketch is one of:
+
+@example
+ptx [@var{option} @dots{}] [@var{file} @dots{}]
+@end example
+
+or:
+
+@example
+ptx -G [@var{option} @dots{}] [@var{input} [@var{output}]]
+@end example
+
+The @samp{-G} (or its equivalent: @samp{--traditional}) option disables
+all GNU extensions and revert to traditional mode, thus introducing some
+limitations, and changes several of the program's default option values.
+When @samp{-G} is not specified, GNU extensions are always enabled.  GNU
+extensions to @code{ptx} are documented wherever appropriate in this
+document.  See @xref{Compatibility} for an explicit list of them.
+
+Individual options are explained later in this document.
+
+When GNU extensions are enabled, there may be zero, one or several
+@var{file} after the options.  If there is no @var{file}, the program
+reads the standard input.  If there is one or several @var{file}, they
+give the name of input files which are all read in turn, as if all the
+input files were concatenated.  However, there is a full contextual
+break between each file and, when automatic referencing is requested,
+file names and line numbers refer to individual text input files.  In
+all cases, the program produces the permuted index onto the standard
+output.
+  
+When GNU extensions are @emph{not} enabled, that is, when the program
+operates in traditional mode, there may be zero, one or two parameters
+besides the options.  If there is no parameters, the program reads the
+standard input and produces the permuted index onto the standard output.
+If there is only one parameter, it names the text @var{input} to be read
+instead of the standard input.  If two parameters are given, they give
+respectively the name of the @var{input} file to read and the name of
+the @var{output} file to produce.  @emph{Be very careful} to note that,
+in this case, the contents of file given by the second parameter is
+destroyed.  This behaviour is dictated only by System V @code{ptx}
+compatibility, because GNU Standards discourage output parameters not
+introduced by an option.
+
+Note that for @emph{any} file named as the value of an option or as an
+input text file, a single dash @kbd{-} may be used, in which case
+standard input is assumed.  However, it would not make sense to use this
+convention more than once per program invocation.
+
+@menu
+* General options::             Options which affect general program behaviour.
+* Charset selection::           Underlying character set considerations.
+* Input processing::            Input fields, contexts, and keyword selection.
+* Output formatting::           Types of output format, and sizing the fields.
+@end menu
+
+@node General options, Charset selection, Invoking ptx, Invoking ptx
+@section General options
+
+@table @code
+
+@item -C
+@itemx --copyright
+Prints a short note about the Copyright and copying conditions, then
+exit without further processing.
+
+@item -G
+@itemx --traditional
+As already explained, this option disables all GNU extensions to
+@code{ptx} and switch to traditional mode.
+
+@item --help
+Prints a short help on standard output, then exit without further
+processing.
+
+@item --version
+Prints the program verison on standard output, then exit without further
+processing.
+
+@end table
+
+@node Charset selection, Input processing, General options, Invoking ptx
+@section Charset selection
+
+As it is setup now, the program assumes that the input file is coded
+using 8-bit ISO 8859-1 code, also known as Latin-1 character set,
+@emph{unless} if it is compiled for MS-DOS, in which case it uses the
+character set of the IBM-PC.  (GNU @code{ptx} is not known to work on
+smaller MS-DOS machines anymore.)  Compared to 7-bit ASCII, the set of
+characters which are letters is then different, this fact alters the
+behaviour of regular expression matching.  Thus, the default regular
+expression for a keyword allows foreign or diacriticized letters.
+Keyword sorting, however, is still crude; it obeys the underlying
+character set ordering quite blindly.
+
+@table @code
+
+@item -f
+@itemx --ignore-case
+Fold lower case letters to upper case for sorting.
+
+@end table
+
+@node Input processing, Output formatting, Charset selection, Invoking ptx
+@section Word selection
+
+@table @code
+
+@item -b @var{file}
+@item --break-file=@var{file}
+
+This option is an alternative way to option @code{-W} for describing
+which characters make up words.  This option introduces the name of a
+file which contains a list of characters which can@emph{not} be part of
+one word, this file is called the @dfn{Break file}.  Any character which
+is not part of the Break file is a word constituent.  If both options
+@code{-b} and @code{-W} are specified, then @code{-W} has precedence and
+@code{-b} is ignored.
+
+When GNU extensions are enabled, the only way to avoid newline as a
+break character is to write all the break characters in the file with no
+newline at all, not even at the end of the file.  When GNU extensions
+are disabled, spaces, tabs and newlines are always considered as break
+characters even if not included in the Break file.
+
+@item -i @var{file}
+@itemx --ignore-file=@var{file}
+
+The file associated with this option contains a list of words which will
+never be taken as keywords in concordance output.  It is called the
+@dfn{Ignore file}.  The file contains exactly one word in each line; the
+end of line separation of words is not subject to the value of the
+@code{-S} option.
+
+There is a default Ignore file used by @code{ptx} when this option is
+not specified, usually found in @file{/usr/local/lib/eign} if this has
+not been changed at installation time.  If you want to deactivate the
+default Ignore file, specify @code{/dev/null} instead.
+
+@item -o @var{file}
+@itemx --only-file=@var{file}
+
+The file associated with this option contains a list of words which will
+be retained in concordance output, any word not mentioned in this file
+is ignored.  The file is called the @dfn{Only file}.  The file contains
+exactly one word in each line; the end of line separation of words is
+not subject to the value of the @code{-S} option.
+
+There is no default for the Only file.  In the case there are both an
+Only file and an Ignore file, a word will be subject to be a keyword
+only if it is given in the Only file and not given in the Ignore file.
+
+@item -r
+@itemx --references
+
+On each input line, the leading sequence of non white characters will be
+taken to be a reference that has the purpose of identifying this input
+line on the produced permuted index.  See @xref{Output formatting} for
+more information about reference production.  Using this option change
+the default value for option @code{-S}.
+
+Using this option, the program does not try very hard to remove
+references from contexts in output, but it succeeds in doing so
+@emph{when} the context ends exactly at the newline.  If option
+@code{-r} is used with @code{-S} default value, or when GNU extensions
+are disabled, this condition is always met and references are completely
+excluded from the output contexts.
+
+@item -S @var{regexp}
+@itemx --sentence-regexp=@var{regexp}
+
+This option selects which regular expression will describe the end of a
+line or the end of a sentence.  In fact, there is other distinction
+between end of lines or end of sentences than the effect of this regular
+expression, and input line boundaries have no special significance
+outside this option.  By default, when GNU extensions are enabled and if
+@code{-r} option is not used, end of sentences are used.  In this
+case, the precise @var{regex} is imported from GNU emacs:
+
+@example
+[.?!][]\"')@}]*\\($\\|\t\\|  \\)[ \t\n]*
+@end example
+
+Whenever GNU extensions are disabled or if @code{-r} option is used, end
+of lines are used; in this case, the default @var{regexp} is just:
+
+@example
+\n
+@end example
+
+Using an empty REGEXP is equivalent to completely disabling end of line or end
+of sentence recognition.  In this case, the whole file is considered to
+be a single big line or sentence.  The user might want to disallow all
+truncation flag generation as well, through option @code{-F ""}.
+@xref{Regexps, , Syntax of Regular Expressions, emacs, The GNU Emacs
+Manual}.
+
+When the keywords happen to be near the beginning of the input line or
+sentence, this often creates an unused area at the beginning of the
+output context line; when the keywords happen to be near the end of the
+input line or sentence, this often creates an unused area at the end of
+the output context line.  The program tries to fill those unused areas
+by wrapping around context in them; the tail of the input line or
+sentence is used to fill the unused area on the left of the output line;
+the head of the input line or sentence is used to fill the unused area
+on the right of the output line.
+
+As a matter of convenience to the user, many usual backslashed escape
+sequences, as found in the C language, are recognized and converted to
+the corresponding characters by @code{ptx} itself.
+
+@item -W @var{regexp}
+@itemx --word-regexp=@var{regexp}
+
+This option selects which regular expression will describe each keyword.
+By default, if GNU extensions are enabled, a word is a sequence of
+letters; the @var{regexp} used is @code{\w+}.  When GNU extensions are
+disabled, a word is by default anything which ends with a space, a tab
+or a newline; the @var{regexp} used is @code{[^ \t\n]+}.
+
+An empty REGEXP is equivalent to not using this option, letting the
+default dive in.  @xref{Regexps, , Syntax of Regular Expressions, emacs,
+The GNU Emacs Manual}.
+
+As a matter of convenience to the user, many usual backslashed escape
+sequences, as found in the C language, are recognized and converted to
+the corresponding characters by @code{ptx} itself.
+
+@end table
+
+@node Output formatting,  , Input processing, Invoking ptx
+@section Output formatting
+
+Output format is mainly controlled by @code{-O} and @code{-T} options,
+described in the table below.  When neither @code{-O} nor @code{-T} is
+selected, and if GNU extensions are enabled, the program choose an
+output format suited for a dumb terminal.  Each keyword occurrence is
+output to the center of one line, surrounded by its left and right
+contexts.  Each field is properly justified, so the concordance output
+could readily be observed.  As a special feature, if automatic
+references are selected by option @code{-A} and are output before the
+left context, that is, if option @code{-R} is @emph{not} selected, then
+a colon is added after the reference; this nicely interfaces with GNU
+Emacs @code{next-error} processing.  In this default output format, each
+white space character, like newline and tab, is merely changed to
+exactly one space, with no special attempt to compress consecutive
+spaces.  This might change in the future.  Except for those white space
+characters, every other character of the underlying set of 256
+characters is transmitted verbatim.
+
+Output format is further controlled by the following options.
+
+@table @code
+
+@item -g @var{number}
+@itemx --gap-size=@var{number}
+
+Select the size of the minimum white gap between the fields on the output
+line.
+
+@item -w @var{number}
+@itemx --width=@var{number}
+
+Select the output maximum width of each final line.  If references are
+used, they are included or excluded from the output maximum width
+depending on the value of option @code{-R}.  If this option is not
+selected, that is, when references are output before the left context,
+the output maximum width takes into account the maximum length of all
+references.  If this options is selected, that is, when references are
+output after the right context, the output maximum width does not take
+into account the space taken by references, nor the gap that precedes
+them.
+
+@item -A
+@itemx --auto-reference
+
+Select automatic references.  Each input line will have an automatic
+reference made up of the file name and the line ordinal, with a single
+colon between them.  However, the file name will be empty when standard
+input is being read.  If both @code{-A} and @code{-r} are selected, then
+the input reference is still read and skipped, but the automatic
+reference is used at output time, overriding the input reference.
+
+@item -R
+@itemx --right-side-refs
+
+In default output format, when option @code{-R} is not used, any
+reference produced by the effect of options @code{-r} or @code{-A} are
+given to the far right of output lines, after the right context.  In
+default output format, when option @code{-R} is specified, references
+are rather given to the beginning of each output line, before the left
+context.  For any other output format, option @code{-R} is almost
+ignored, except for the fact that the width of references is @emph{not}
+taken into account in total output width given by @code{-w} whenever
+@code{-R} is selected.
+
+This option is automatically selected whenever GNU extensions are
+disabled.
+
+@item -F @var{string}
+@itemx --flac-truncation=@var{string}
+
+This option will request that any truncation in the output be reported
+using the string @var{string}.  Most output fields theoretically extend
+towards the beginning or the end of the current line, or current
+sentence, as selected with option @code{-S}.  But there is a maximum
+allowed output line width, changeable through option @code{-w}, which is
+further divided into space for various output fields.  When a field has
+to be truncated because cannot extend until the beginning or the end of
+the current line to fit in the, then a truncation occurs.  By default,
+the string used is a single slash, as in @code{-F /}.
+
+@var{string} may have more than one character, as in @code{-F ...}.
+Also, in the particular case @var{string} is empty (@code{-F ""}),
+truncation flagging is disabled, and no truncation marks are appended in
+this case.
+
+As a matter of convenience to the user, many usual backslashed escape
+sequences, as found in the C language, are recognized and converted to
+the corresponding characters by @code{ptx} itself.
+
+@item -M @var{string}
+@itemx --macro-name=@var{string}
+
+Select another @var{string} to be used instead of @samp{xx}, while
+generating output suitable for @code{nroff}, @code{troff} or @TeX{}.
+
+@item -O
+@itemx --format=roff
+
+Choose an output format suitable for @code{nroff} or @code{troff}
+processing.  Each output line will look like:
+
+@example
+.xx "@var{tail}" "@var{before}" "@var{keyword_and_after}" "@var{head}" "@var{ref}"
+@end example
+
+so it will be possible to write an @samp{.xx} roff macro to take care of
+the output typesetting.  This is the default output format when GNU
+extensions are disabled.  Option @samp{-M} might be used to change
+@samp{xx} to another macro name.
+
+In this output format, each non-graphical character, like newline and
+tab, is merely changed to exactly one space, with no special attempt to
+compress consecutive spaces.  Each quote character: @kbd{"} is doubled
+so it will be correctly processed by @code{nroff} or @code{troff}.
+
+@item -T
+@itemx --format=tex
+
+Choose an output format suitable for @TeX{} processing.  Each output
+line will look like:
+
+@example
+\xx @{@var{tail}@}@{@var{before}@}@{@var{keyword}@}@{@var{after}@}@{@var{head}@}@{@var{ref}@}
+@end example
+
+@noindent
+so it will be possible to write write a @code{\xx} definition to take
+care of the output typesetting.  Note that when references are not being
+produced, that is, neither option @code{-A} nor option @code{-r} is
+selected, the last parameter of each @code{\xx} call is inhibited.
+Option @samp{-M} might be used to change @samp{xx} to another macro
+name.
+
+In this output format, some special characters, like @kbd{$}, @kbd{%},
+@kbd{&}, @kbd{#} and @kbd{_} are automatically protected with a
+backslash.  Curly brackets @kbd{@{}, @kbd{@}} are also protected with a
+backslash, but also enclosed in a pair of dollar signs to force
+mathematical mode.  The backslash itself produces the sequence
+@code{\backslash@{@}}.  Circumflex and tilde diacritics produce the
+sequence @code{^\@{ @}} and @code{~\@{ @}} respectively.  Other
+diacriticized characters of the underlying character set produce an
+appropriate @TeX{} sequence as far as possible.  The other non-graphical
+characters, like newline and tab, and all others characters which are
+not part of ASCII, are merely changed to exactly one space, with no
+special attempt to compress consecutive spaces.  Let me know how to
+improve this special character processing for @TeX{}.
+
+@end table
+
+@node Compatibility,  , Invoking ptx, Top
+@chapter The GNU extensions to @code{ptx}
+
+This version of @code{ptx} contains a few features which do not exist in
+System V @code{ptx}.  These extra features are suppressed by using the
+@samp{-G} command line option, unless overridden by other command line
+options.  Some GNU extensions cannot be recovered by overriding, so the
+simple rule is to avoid @samp{-G} if you care about GNU extensions.
+Here are the differences between this program and System V @code{ptx}.
+
+@itemize @bullet
+
+@item
+This program can read many input files at once, it always writes the
+resulting concordance on standard output.  On the other end, System V
+@code{ptx} reads only one file and produce the result on standard output
+or, if a second @var{file} parameter is given on the command, to that
+@var{file}.
+
+Having output parameters not introduced by options is a quite dangerous
+practice which GNU avoids as far as possible.  So, for using @code{ptx}
+portably between GNU and System V, you should pay attention to always
+use it with a single input file, and always expect the result on
+standard output.  You might also want to automatically configure in a
+@samp{-G} option to @code{ptx} calls in products using @code{ptx}, if
+the configurator finds that the installed @code{ptx} accepts @samp{-G}.
+
+@item
+The only options available in System V @code{ptx} are options @samp{-b},
+@samp{-f}, @samp{-g}, @samp{-i}, @samp{-o}, @samp{-r}, @samp{-t} and
+@samp{-w}.  All other options are GNU extensions and are not repeated in
+this enumeration.  Moreover, some options have a slightly different
+meaning when GNU extensions are enabled, as explained below.
+
+@item
+By default, concordance output is not formatted for @code{troff} or
+@code{nroff}.  It is rather formatted for a dumb terminal.  @code{troff}
+or @code{nroff} output may still be selected through option @code{-O}.
+
+@item
+Unless @code{-R} option is used, the maximum reference width is
+subtracted from the total output line width.  With GNU extensions
+disabled, width of references is not taken into account in the output
+line width computations.
+
+@item
+All 256 characters, even @kbd{NUL}s, are always read and processed from
+input file with no adverse effect, even if GNU extensions are disabled.
+However, System V @code{ptx} does not accept 8-bit characters, a few
+control characters are rejected, and the tilda @kbd{~} is condemned.
+
+@item
+Input line length is only limited by available memory, even if GNU
+extensions are disabled.  However, System V @code{ptx} processes only
+the first 200 characters in each line.
+
+@item
+The break (non-word) characters default to be every character except all
+letters of the underlying character set, diacriticized or not.  When GNU
+extensions are disabled, the break characters default to space, tab and
+newline only.
+
+@item
+The program makes better use of output line width.  If GNU extensions
+are disabled, the program rather tries to imitate System V @code{ptx},
+but still, there are some slight disposition glitches this program does
+not completely reproduce.
+
+@item
+The user can specify both an Ignore file and an Only file.  This is not
+allowed with System V @code{ptx}.
+
+@end itemize
+
+@bye
--- a/gnu/usr.bin/ptx/ptx.c
+++ b/gnu/usr.bin/ptx/ptx.c
@ -101,7 +101,7 @@ extern int errno;

 #include "bumpalloc.h"
 #include "diacrit.h"
-#include "regex.h"
+#include "gnuregex.h"

 #ifndef __STDC__
 void *xmalloc ();