commit 717f769197ece6814ebfcb7e3e7066bbdec2106b Author: Nate Williams Date: Tue Jun 29 06:04:45 1993 +0000 Gnu e?grep 1.6 Notes: svn path=/vendor/misc-GNU/dist1/; revision=61 svn path=/vendor/misc-GNU/GREP_1_6/; revision=63; tag=vendor/misc-GNU/grep/1.6 diff --git a/gnu/usr.bin/grep/COPYING b/gnu/usr.bin/grep/COPYING new file mode 100644 index 000000000000..a43ea2126fb6 --- /dev/null +++ b/gnu/usr.bin/grep/COPYING @@ -0,0 +1,339 @@ + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc. + 675 Mass Ave, Cambridge, MA 02139, USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Library General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. + + END OF TERMS AND CONDITIONS + + Appendix: How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + + Copyright (C) 19yy + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + +Also add information on how to contact you by electronic and paper mail. + +If the program is interactive, make it output a short notice like this +when it starts in an interactive mode: + + Gnomovision version 69, Copyright (C) 19yy name of author + Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, the commands you use may +be called something other than `show w' and `show c'; they could even be +mouse-clicks or menu items--whatever suits your program. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the program, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the program + `Gnomovision' (which makes passes at compilers) written by James Hacker. + + , 1 April 1989 + Ty Coon, President of Vice + +This General Public License does not permit incorporating your program into +proprietary programs. If your program is a subroutine library, you may +consider it more useful to permit linking proprietary applications with the +library. If this is what you want to do, use the GNU Library General +Public License instead of this License. diff --git a/gnu/usr.bin/grep/Makefile b/gnu/usr.bin/grep/Makefile new file mode 100644 index 000000000000..092e523ebe68 --- /dev/null +++ b/gnu/usr.bin/grep/Makefile @@ -0,0 +1,7 @@ + +PROG= grep +SRCS= dfa.c regex.o grep.o +CFLAGS+=-DSTDC_HEADERS=1 -DHAVE_UNISTD_H=1 +MLINKS= grep.1 egrep.1 + +.include diff --git a/gnu/usr.bin/grep/README b/gnu/usr.bin/grep/README new file mode 100644 index 000000000000..27f5bae42d8d --- /dev/null +++ b/gnu/usr.bin/grep/README @@ -0,0 +1,70 @@ +This README documents GNU e?grep version 1.6. All bugs reported for +previous versions have been fixed. + +See the file INSTALL for compilation and installation instructions. + +Send bug reports to bug-gnu-utils@prep.ai.mit.edu. + +GNU e?grep is provided "as is" with no warranty. The exact terms +under which you may use and (re)distribute this program are detailed +in the GNU General Public License, in the file COPYING. + +GNU e?grep is based on a fast lazy-state deterministic matcher (about +twice as fast as stock Unix egrep) hybridized with a Boyer-Moore-Gosper +search for a fixed string that eliminates impossible text from being +considered by the full regexp matcher without necessarily having to +look at every character. The result is typically many times faster +than Unix grep or egrep. (Regular expressions containing backreferencing +may run more slowly, however.) + +GNU e?grep is brought to you by the efforts of several people: + + Mike Haertel wrote the deterministic regexp code and the bulk + of the program. + + James A. Woods is responsible for the hybridized search strategy + of using Boyer-Moore-Gosper fixed-string search as a filter + before calling the general regexp matcher. + + Arthur David Olson contributed code that finds fixed strings for + the aforementioned BMG search for a large class of regexps. + + Richard Stallman wrote the backtracking regexp matcher that is + used for \ backreferences, as well as the getopt that + is provided for 4.2BSD sites. The backtracking matcher was + originally written for GNU Emacs. + + D. A. Gwyn wrote the C alloca emulation that is provided so + System V machines can run this program. (Alloca is used only + by RMS' backtracking matcher, and then only rarely, so there + is no loss if your machine doesn't have a "real" alloca.) + + Scott Anderson and Henry Spencer designed the regression tests + used in the "regress" script. + + Paul Placeway wrote the manual page, based on this README. + +If you are interested in improving this program, you may wish to try +any of the following: + +1. Replace the fast search loop with a faster search loop. + There are several things that could be improved, the most notable + of which would be to calculate a minimal delta2 to use. + +2. Make backreferencing \ faster. Right now, backreferencing is + handled by calling the Emacs backtracking matcher to verify the partial + match. This is slow; if the DFA routines could handle backreferencing + themselves a speedup on the order of three to four times might occur + in those cases where the backtracking matcher is called to verify nearly + every line. Also, some portability problems due to the inclusion of the + emacs matcher would be solved because it could then be eliminated. + Note that expressions with backreferencing are not true regular + expressions, and thus are not equivalent to any DFA. So this is hard. + +3. Handle POSIX style regexps. I'm not sure if this could be called an + improvement; some of the things on regexps in the POSIX draft I have + seen are pretty sickening. But it would be useful in the interests of + conforming to the standard. + +4. Replace the main driver program grep.c with the much cleaner main driver + program used in GNU fgrep. diff --git a/gnu/usr.bin/grep/dfa.c b/gnu/usr.bin/grep/dfa.c new file mode 100644 index 000000000000..08b383d04bd7 --- /dev/null +++ b/gnu/usr.bin/grep/dfa.c @@ -0,0 +1,2276 @@ +/* dfa.c - determinisitic extended regexp routines for GNU + Copyright (C) 1988 Free Software Foundation, Inc. + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2, or (at your option) + any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ + +/* Written June, 1988 by Mike Haertel + Modified July, 1988 by Arthur David Olson to assist BMG speedups */ + +#include +#include + +#if defined(USG) || defined(STDC_HEADERS) +#include +#ifndef index +#define index strchr +#endif +#else +#include +#endif + +#include "dfa.h" + +#if __STDC__ +typedef void *ptr_t; +#else +typedef char *ptr_t; +#endif + +static void regmust(); + +static ptr_t +xcalloc(n, s) + int n; + size_t s; +{ + ptr_t r = calloc(n, s); + + if (!r) + regerror("Memory exhausted"); + return r; +} + +ptr_t /* Not static, so alloca.o can use it. */ +xmalloc(n) + size_t n; +{ + ptr_t r = malloc(n); + + assert(n != 0); + if (!r) + regerror("Memory exhausted"); + return r; +} + +static ptr_t +xrealloc(p, n) + ptr_t p; + size_t n; +{ + ptr_t r = realloc(p, n); + + assert(n != 0); + if (!r) + regerror("Memory exhausted"); + return r; +} + +#define CALLOC(p, t, n) ((p) = (t *) xcalloc((n), sizeof (t))) +#define MALLOC(p, t, n) ((p) = (t *) xmalloc((n) * sizeof (t))) +#define REALLOC(p, t, n) ((p) = (t *) xrealloc((ptr_t) (p), (n) * sizeof (t))) + +/* Reallocate an array of type t if nalloc is too small for index. */ +#define REALLOC_IF_NECESSARY(p, t, nalloc, index) \ + if ((index) >= (nalloc)) \ + { \ + while ((index) >= (nalloc)) \ + (nalloc) *= 2; \ + REALLOC(p, t, nalloc); \ + } + +#ifdef DEBUG +#include + +static void +prtok(t) + _token t; +{ + char *s; + + if (t < 0) + fprintf(stderr, "END"); + else if (t < _NOTCHAR) + fprintf(stderr, "%c", t); + else + { + switch (t) + { + case _EMPTY: s = "EMPTY"; break; + case _BACKREF: s = "BACKREF"; break; + case _BEGLINE: s = "BEGLINE"; break; + case _ALLBEGLINE: s = "ALLBEGLINE"; break; + case _ENDLINE: s = "ENDLINE"; break; + case _ALLENDLINE: s = "ALLENDLINE"; break; + case _BEGWORD: s = "BEGWORD"; break; + case _ENDWORD: s = "ENDWORD"; break; + case _LIMWORD: s = "LIMWORD"; break; + case _NOTLIMWORD: s = "NOTLIMWORD"; break; + case _QMARK: s = "QMARK"; break; + case _STAR: s = "STAR"; break; + case _PLUS: s = "PLUS"; break; + case _CAT: s = "CAT"; break; + case _OR: s = "OR"; break; + case _LPAREN: s = "LPAREN"; break; + case _RPAREN: s = "RPAREN"; break; + default: s = "SET"; break; + } + fprintf(stderr, "%s", s); + } +} +#endif /* DEBUG */ + +/* Stuff pertaining to charsets. */ + +static int +tstbit(b, c) + int b; + _charset c; +{ + return c[b / INTBITS] & 1 << b % INTBITS; +} + +static void +setbit(b, c) + int b; + _charset c; +{ + c[b / INTBITS] |= 1 << b % INTBITS; +} + +static void +clrbit(b, c) + int b; + _charset c; +{ + c[b / INTBITS] &= ~(1 << b % INTBITS); +} + +static void +copyset(src, dst) + const _charset src; + _charset dst; +{ + int i; + + for (i = 0; i < _CHARSET_INTS; ++i) + dst[i] = src[i]; +} + +static void +zeroset(s) + _charset s; +{ + int i; + + for (i = 0; i < _CHARSET_INTS; ++i) + s[i] = 0; +} + +static void +notset(s) + _charset s; +{ + int i; + + for (i = 0; i < _CHARSET_INTS; ++i) + s[i] = ~s[i]; +} + +static int +equal(s1, s2) + const _charset s1; + const _charset s2; +{ + int i; + + for (i = 0; i < _CHARSET_INTS; ++i) + if (s1[i] != s2[i]) + return 0; + return 1; +} + +/* A pointer to the current regexp is kept here during parsing. */ +static struct regexp *reg; + +/* Find the index of charset s in reg->charsets, or allocate a new charset. */ +static int +charset_index(s) + const _charset s; +{ + int i; + + for (i = 0; i < reg->cindex; ++i) + if (equal(s, reg->charsets[i])) + return i; + REALLOC_IF_NECESSARY(reg->charsets, _charset, reg->calloc, reg->cindex); + ++reg->cindex; + copyset(s, reg->charsets[i]); + return i; +} + +/* Syntax bits controlling the behavior of the lexical analyzer. */ +static syntax_bits, syntax_bits_set; + +/* Flag for case-folding letters into sets. */ +static case_fold; + +/* Entry point to set syntax options. */ +void +regsyntax(bits, fold) + int bits; + int fold; +{ + syntax_bits_set = 1; + syntax_bits = bits; + case_fold = fold; +} + +/* Lexical analyzer. */ +static const char *lexstart; /* Pointer to beginning of input string. */ +static const char *lexptr; /* Pointer to next input character. */ +static lexleft; /* Number of characters remaining. */ +static caret_allowed; /* True if backward context allows ^ + (meaningful only if RE_CONTEXT_INDEP_OPS + is turned off). */ +static closure_allowed; /* True if backward context allows closures + (meaningful only if RE_CONTEXT_INDEP_OPS + is turned off). */ + +/* Note that characters become unsigned here. */ +#define FETCH(c, eoferr) \ + { \ + if (! lexleft) \ + if (eoferr) \ + regerror(eoferr); \ + else \ + return _END; \ + (c) = (unsigned char) *lexptr++; \ + --lexleft; \ + } + +static _token +lex() +{ + _token c, c2; + int invert; + _charset cset; + + FETCH(c, (char *) 0); + switch (c) + { + case '^': + if (! (syntax_bits & RE_CONTEXT_INDEP_OPS) + && (!caret_allowed || + (syntax_bits & RE_TIGHT_VBAR) && lexptr - 1 != lexstart)) + goto normal_char; + caret_allowed = 0; + return syntax_bits & RE_TIGHT_VBAR ? _ALLBEGLINE : _BEGLINE; + + case '$': + if (syntax_bits & RE_CONTEXT_INDEP_OPS || !lexleft + || (! (syntax_bits & RE_TIGHT_VBAR) + && ((syntax_bits & RE_NO_BK_PARENS + ? lexleft > 0 && *lexptr == ')' + : lexleft > 1 && *lexptr == '\\' && lexptr[1] == ')') + || (syntax_bits & RE_NO_BK_VBAR + ? lexleft > 0 && *lexptr == '|' + : lexleft > 1 && *lexptr == '\\' && lexptr[1] == '|')))) + return syntax_bits & RE_TIGHT_VBAR ? _ALLENDLINE : _ENDLINE; + goto normal_char; + + case '\\': + FETCH(c, "Unfinished \\ quote"); + switch (c) + { + case '1': + case '2': + case '3': + case '4': + case '5': + case '6': + case '7': + case '8': + case '9': + caret_allowed = 0; + closure_allowed = 1; + return _BACKREF; + + case '<': + caret_allowed = 0; + return _BEGWORD; + + case '>': + caret_allowed = 0; + return _ENDWORD; + + case 'b': + caret_allowed = 0; + return _LIMWORD; + + case 'B': + caret_allowed = 0; + return _NOTLIMWORD; + + case 'w': + case 'W': + zeroset(cset); + for (c2 = 0; c2 < _NOTCHAR; ++c2) + if (ISALNUM(c2)) + setbit(c2, cset); + if (c == 'W') + notset(cset); + caret_allowed = 0; + closure_allowed = 1; + return _SET + charset_index(cset); + + case '?': + if (syntax_bits & RE_BK_PLUS_QM) + goto qmark; + goto normal_char; + + case '+': + if (syntax_bits & RE_BK_PLUS_QM) + goto plus; + goto normal_char; + + case '|': + if (! (syntax_bits & RE_NO_BK_VBAR)) + goto or; + goto normal_char; + + case '(': + if (! (syntax_bits & RE_NO_BK_PARENS)) + goto lparen; + goto normal_char; + + case ')': + if (! (syntax_bits & RE_NO_BK_PARENS)) + goto rparen; + goto normal_char; + + default: + goto normal_char; + } + + case '?': + if (syntax_bits & RE_BK_PLUS_QM) + goto normal_char; + qmark: + if (! (syntax_bits & RE_CONTEXT_INDEP_OPS) && !closure_allowed) + goto normal_char; + return _QMARK; + + case '*': + if (! (syntax_bits & RE_CONTEXT_INDEP_OPS) && !closure_allowed) + goto normal_char; + return _STAR; + + case '+': + if (syntax_bits & RE_BK_PLUS_QM) + goto normal_char; + plus: + if (! (syntax_bits & RE_CONTEXT_INDEP_OPS) && !closure_allowed) + goto normal_char; + return _PLUS; + + case '|': + if (! (syntax_bits & RE_NO_BK_VBAR)) + goto normal_char; + or: + caret_allowed = 1; + closure_allowed = 0; + return _OR; + + case '\n': + if (! (syntax_bits & RE_NEWLINE_OR)) + goto normal_char; + goto or; + + case '(': + if (! (syntax_bits & RE_NO_BK_PARENS)) + goto normal_char; + lparen: + caret_allowed = 1; + closure_allowed = 0; + return _LPAREN; + + case ')': + if (! (syntax_bits & RE_NO_BK_PARENS)) + goto normal_char; + rparen: + caret_allowed = 0; + closure_allowed = 1; + return _RPAREN; + + case '.': + zeroset(cset); + notset(cset); + clrbit('\n', cset); + caret_allowed = 0; + closure_allowed = 1; + return _SET + charset_index(cset); + + case '[': + zeroset(cset); + FETCH(c, "Unbalanced ["); + if (c == '^') + { + FETCH(c, "Unbalanced ["); + invert = 1; + } + else + invert = 0; + do + { + FETCH(c2, "Unbalanced ["); + if (c2 == '-') + { + FETCH(c2, "Unbalanced ["); + while (c <= c2) + { + setbit(c, cset); + if (case_fold) + if (ISUPPER(c)) + setbit(tolower(c), cset); + else if (ISLOWER(c)) + setbit(toupper(c), cset); + ++c; + } + FETCH(c, "Unbalanced ["); + } + else + { + setbit(c, cset); + if (case_fold) + if (ISUPPER(c)) + setbit(tolower(c), cset); + else if (ISLOWER(c)) + setbit(toupper(c), cset); + c = c2; + } + } + while (c != ']'); + if (invert) + notset(cset); + caret_allowed = 0; + closure_allowed = 1; + return _SET + charset_index(cset); + + default: + normal_char: + caret_allowed = 0; + closure_allowed = 1; + if (case_fold && ISALPHA(c)) + { + zeroset(cset); + if (isupper(c)) + c = tolower(c); + setbit(c, cset); + setbit(toupper(c), cset); + return _SET + charset_index(cset); + } + return c; + } +} + +/* Recursive descent parser for regular expressions. */ + +static _token tok; /* Lookahead token. */ +static depth; /* Current depth of a hypothetical stack + holding deferred productions. This is + used to determine the depth that will be + required of the real stack later on in + reganalyze(). */ + +/* Add the given token to the parse tree, maintaining the depth count and + updating the maximum depth if necessary. */ +static void +addtok(t) + _token t; +{ + REALLOC_IF_NECESSARY(reg->tokens, _token, reg->talloc, reg->tindex); + reg->tokens[reg->tindex++] = t; + + switch (t) + { + case _QMARK: + case _STAR: + case _PLUS: + break; + + case _CAT: + case _OR: + --depth; + break; + + default: + ++reg->nleaves; + case _EMPTY: + ++depth; + break; + } + if (depth > reg->depth) + reg->depth = depth; +} + +/* The grammar understood by the parser is as follows. + + start: + regexp + _ALLBEGLINE regexp + regexp _ALLENDLINE + _ALLBEGLINE regexp _ALLENDLINE + + regexp: + regexp _OR branch + branch + + branch: + branch closure + closure + + closure: + closure _QMARK + closure _STAR + closure _PLUS + atom + + atom: + + _SET + _BACKREF + _BEGLINE + _ENDLINE + _BEGWORD + _ENDWORD + _LIMWORD + _NOTLIMWORD + + + The parser builds a parse tree in postfix form in an array of tokens. */ + +#if __STDC__ +static void regexp(void); +#else +static void regexp(); +#endif + +static void +atom() +{ + if (tok >= 0 && tok < _NOTCHAR || tok >= _SET || tok == _BACKREF + || tok == _BEGLINE || tok == _ENDLINE || tok == _BEGWORD + || tok == _ENDWORD || tok == _LIMWORD || tok == _NOTLIMWORD) + { + addtok(tok); + tok = lex(); + } + else if (tok == _LPAREN) + { + tok = lex(); + regexp(); + if (tok != _RPAREN) + regerror("Unbalanced ("); + tok = lex(); + } + else + addtok(_EMPTY); +} + +static void +closure() +{ + atom(); + while (tok == _QMARK || tok == _STAR || tok == _PLUS) + { + addtok(tok); + tok = lex(); + } +} + +static void +branch() +{ + closure(); + while (tok != _RPAREN && tok != _OR && tok != _ALLENDLINE && tok >= 0) + { + closure(); + addtok(_CAT); + } +} + +static void +regexp() +{ + branch(); + while (tok == _OR) + { + tok = lex(); + branch(); + addtok(_OR); + } +} + +/* Main entry point for the parser. S is a string to be parsed, len is the + length of the string, so s can include NUL characters. R is a pointer to + the struct regexp to parse into. */ +void +regparse(s, len, r) + const char *s; + size_t len; + struct regexp *r; +{ + reg = r; + lexstart = lexptr = s; + lexleft = len; + caret_allowed = 1; + closure_allowed = 0; + + if (! syntax_bits_set) + regerror("No syntax specified"); + + tok = lex(); + depth = r->depth; + + if (tok == _ALLBEGLINE) + { + addtok(_BEGLINE); + tok = lex(); + regexp(); + addtok(_CAT); + } + else + regexp(); + + if (tok == _ALLENDLINE) + { + addtok(_ENDLINE); + addtok(_CAT); + tok = lex(); + } + + if (tok != _END) + regerror("Unbalanced )"); + + addtok(_END - r->nregexps); + addtok(_CAT); + + if (r->nregexps) + addtok(_OR); + + ++r->nregexps; +} + +/* Some primitives for operating on sets of positions. */ + +/* Copy one set to another; the destination must be large enough. */ +static void +copy(src, dst) + const _position_set *src; + _position_set *dst; +{ + int i; + + for (i = 0; i < src->nelem; ++i) + dst->elems[i] = src->elems[i]; + dst->nelem = src->nelem; +} + +/* Insert a position in a set. Position sets are maintained in sorted + order according to index. If position already exists in the set with + the same index then their constraints are logically or'd together. + S->elems must point to an array large enough to hold the resulting set. */ +static void +insert(p, s) + _position p; + _position_set *s; +{ + int i; + _position t1, t2; + + for (i = 0; i < s->nelem && p.index < s->elems[i].index; ++i) + ; + if (i < s->nelem && p.index == s->elems[i].index) + s->elems[i].constraint |= p.constraint; + else + { + t1 = p; + ++s->nelem; + while (i < s->nelem) + { + t2 = s->elems[i]; + s->elems[i++] = t1; + t1 = t2; + } + } +} + +/* Merge two sets of positions into a third. The result is exactly as if + the positions of both sets were inserted into an initially empty set. */ +static void +merge(s1, s2, m) + _position_set *s1; + _position_set *s2; + _position_set *m; +{ + int i = 0, j = 0; + + m->nelem = 0; + while (i < s1->nelem && j < s2->nelem) + if (s1->elems[i].index > s2->elems[j].index) + m->elems[m->nelem++] = s1->elems[i++]; + else if (s1->elems[i].index < s2->elems[j].index) + m->elems[m->nelem++] = s2->elems[j++]; + else + { + m->elems[m->nelem] = s1->elems[i++]; + m->elems[m->nelem++].constraint |= s2->elems[j++].constraint; + } + while (i < s1->nelem) + m->elems[m->nelem++] = s1->elems[i++]; + while (j < s2->nelem) + m->elems[m->nelem++] = s2->elems[j++]; +} + +/* Delete a position from a set. */ +static void +delete(p, s) + _position p; + _position_set *s; +{ + int i; + + for (i = 0; i < s->nelem; ++i) + if (p.index == s->elems[i].index) + break; + if (i < s->nelem) + for (--s->nelem; i < s->nelem; ++i) + s->elems[i] = s->elems[i + 1]; +} + +/* Find the index of the state corresponding to the given position set with + the given preceding context, or create a new state if there is no such + state. Newline and letter tell whether we got here on a newline or + letter, respectively. */ +static int +state_index(r, s, newline, letter) + struct regexp *r; + _position_set *s; + int newline; + int letter; +{ + int hash = 0; + int constraint; + int i, j; + + newline = newline ? 1 : 0; + letter = letter ? 1 : 0; + + for (i = 0; i < s->nelem; ++i) + hash ^= s->elems[i].index + s->elems[i].constraint; + + /* Try to find a state that exactly matches the proposed one. */ + for (i = 0; i < r->sindex; ++i) + { + if (hash != r->states[i].hash || s->nelem != r->states[i].elems.nelem + || newline != r->states[i].newline || letter != r->states[i].letter) + continue; + for (j = 0; j < s->nelem; ++j) + if (s->elems[j].constraint + != r->states[i].elems.elems[j].constraint + || s->elems[j].index != r->states[i].elems.elems[j].index) + break; + if (j == s->nelem) + return i; + } + + /* We'll have to create a new state. */ + REALLOC_IF_NECESSARY(r->states, _dfa_state, r->salloc, r->sindex); + r->states[i].hash = hash; + MALLOC(r->states[i].elems.elems, _position, s->nelem); + copy(s, &r->states[i].elems); + r->states[i].newline = newline; + r->states[i].letter = letter; + r->states[i].backref = 0; + r->states[i].constraint = 0; + r->states[i].first_end = 0; + for (j = 0; j < s->nelem; ++j) + if (r->tokens[s->elems[j].index] < 0) + { + constraint = s->elems[j].constraint; + if (_SUCCEEDS_IN_CONTEXT(constraint, newline, 0, letter, 0) + || _SUCCEEDS_IN_CONTEXT(constraint, newline, 0, letter, 1) + || _SUCCEEDS_IN_CONTEXT(constraint, newline, 1, letter, 0) + || _SUCCEEDS_IN_CONTEXT(constraint, newline, 1, letter, 1)) + r->states[i].constraint |= constraint; + if (! r->states[i].first_end) + r->states[i].first_end = r->tokens[s->elems[j].index]; + } + else if (r->tokens[s->elems[j].index] == _BACKREF) + { + r->states[i].constraint = _NO_CONSTRAINT; + r->states[i].backref = 1; + } + + ++r->sindex; + + return i; +} + +/* Find the epsilon closure of a set of positions. If any position of the set + contains a symbol that matches the empty string in some context, replace + that position with the elements of its follow labeled with an appropriate + constraint. Repeat exhaustively until no funny positions are left. + S->elems must be large enough to hold the result. */ +void +epsclosure(s, r) + _position_set *s; + struct regexp *r; +{ + int i, j; + int *visited; + _position p, old; + + MALLOC(visited, int, r->tindex); + for (i = 0; i < r->tindex; ++i) + visited[i] = 0; + + for (i = 0; i < s->nelem; ++i) + if (r->tokens[s->elems[i].index] >= _NOTCHAR + && r->tokens[s->elems[i].index] != _BACKREF + && r->tokens[s->elems[i].index] < _SET) + { + old = s->elems[i]; + p.constraint = old.constraint; + delete(s->elems[i], s); + if (visited[old.index]) + { + --i; + continue; + } + visited[old.index] = 1; + switch (r->tokens[old.index]) + { + case _BEGLINE: + p.constraint &= _BEGLINE_CONSTRAINT; + break; + case _ENDLINE: + p.constraint &= _ENDLINE_CONSTRAINT; + break; + case _BEGWORD: + p.constraint &= _BEGWORD_CONSTRAINT; + break; + case _ENDWORD: + p.constraint &= _ENDWORD_CONSTRAINT; + break; + case _LIMWORD: + p.constraint &= _LIMWORD_CONSTRAINT; + break; + case _NOTLIMWORD: + p.constraint &= _NOTLIMWORD_CONSTRAINT; + break; + default: + break; + } + for (j = 0; j < r->follows[old.index].nelem; ++j) + { + p.index = r->follows[old.index].elems[j].index; + insert(p, s); + } + /* Force rescan to start at the beginning. */ + i = -1; + } + + free(visited); +} + +/* Perform bottom-up analysis on the parse tree, computing various functions. + Note that at this point, we're pretending constructs like \< are real + characters rather than constraints on what can follow them. + + Nullable: A node is nullable if it is at the root of a regexp that can + match the empty string. + * _EMPTY leaves are nullable. + * No other leaf is nullable. + * A _QMARK or _STAR node is nullable. + * A _PLUS node is nullable if its argument is nullable. + * A _CAT node is nullable if both its arguments are nullable. + * An _OR node is nullable if either argument is nullable. + + Firstpos: The firstpos of a node is the set of positions (nonempty leaves) + that could correspond to the first character of a string matching the + regexp rooted at the given node. + * _EMPTY leaves have empty firstpos. + * The firstpos of a nonempty leaf is that leaf itself. + * The firstpos of a _QMARK, _STAR, or _PLUS node is the firstpos of its + argument. + * The firstpos of a _CAT node is the firstpos of the left argument, union + the firstpos of the right if the left argument is nullable. + * The firstpos of an _OR node is the union of firstpos of each argument. + + Lastpos: The lastpos of a node is the set of positions that could + correspond to the last character of a string matching the regexp at + the given node. + * _EMPTY leaves have empty lastpos. + * The lastpos of a nonempty leaf is that leaf itself. + * The lastpos of a _QMARK, _STAR, or _PLUS node is the lastpos of its + argument. + * The lastpos of a _CAT node is the lastpos of its right argument, union + the lastpos of the left if the right argument is nullable. + * The lastpos of an _OR node is the union of the lastpos of each argument. + + Follow: The follow of a position is the set of positions that could + correspond to the character following a character matching the node in + a string matching the regexp. At this point we consider special symbols + that match the empty string in some context to be just normal characters. + Later, if we find that a special symbol is in a follow set, we will + replace it with the elements of its follow, labeled with an appropriate + constraint. + * Every node in the firstpos of the argument of a _STAR or _PLUS node is in + the follow of every node in the lastpos. + * Every node in the firstpos of the second argument of a _CAT node is in + the follow of every node in the lastpos of the first argument. + + Because of the postfix representation of the parse tree, the depth-first + analysis is conveniently done by a linear scan with the aid of a stack. + Sets are stored as arrays of the elements, obeying a stack-like allocation + scheme; the number of elements in each set deeper in the stack can be + used to determine the address of a particular set's array. */ +void +reganalyze(r, searchflag) + struct regexp *r; + int searchflag; +{ + int *nullable; /* Nullable stack. */ + int *nfirstpos; /* Element count stack for firstpos sets. */ + _position *firstpos; /* Array where firstpos elements are stored. */ + int *nlastpos; /* Element count stack for lastpos sets. */ + _position *lastpos; /* Array where lastpos elements are stored. */ + int *nalloc; /* Sizes of arrays allocated to follow sets. */ + _position_set tmp; /* Temporary set for merging sets. */ + _position_set merged; /* Result of merging sets. */ + int wants_newline; /* True if some position wants newline info. */ + int *o_nullable; + int *o_nfirst, *o_nlast; + _position *o_firstpos, *o_lastpos; + int i, j; + _position *pos; + +#ifdef DEBUG + fprintf(stderr, "reganalyze:\n"); + for (i = 0; i < r->tindex; ++i) + { + fprintf(stderr, " %d:", i); + prtok(r->tokens[i]); + } + putc('\n', stderr); +#endif + + r->searchflag = searchflag; + + MALLOC(nullable, int, r->depth); + o_nullable = nullable; + MALLOC(nfirstpos, int, r->depth); + o_nfirst = nfirstpos; + MALLOC(firstpos, _position, r->nleaves); + o_firstpos = firstpos, firstpos += r->nleaves; + MALLOC(nlastpos, int, r->depth); + o_nlast = nlastpos; + MALLOC(lastpos, _position, r->nleaves); + o_lastpos = lastpos, lastpos += r->nleaves; + MALLOC(nalloc, int, r->tindex); + for (i = 0; i < r->tindex; ++i) + nalloc[i] = 0; + MALLOC(merged.elems, _position, r->nleaves); + + CALLOC(r->follows, _position_set, r->tindex); + + for (i = 0; i < r->tindex; ++i) +#ifdef DEBUG + { /* Nonsyntactic #ifdef goo... */ +#endif + switch (r->tokens[i]) + { + case _EMPTY: + /* The empty set is nullable. */ + *nullable++ = 1; + + /* The firstpos and lastpos of the empty leaf are both empty. */ + *nfirstpos++ = *nlastpos++ = 0; + break; + + case _STAR: + case _PLUS: + /* Every element in the firstpos of the argument is in the follow + of every element in the lastpos. */ + tmp.nelem = nfirstpos[-1]; + tmp.elems = firstpos; + pos = lastpos; + for (j = 0; j < nlastpos[-1]; ++j) + { + merge(&tmp, &r->follows[pos[j].index], &merged); + REALLOC_IF_NECESSARY(r->follows[pos[j].index].elems, _position, + nalloc[pos[j].index], merged.nelem - 1); + copy(&merged, &r->follows[pos[j].index]); + } + + case _QMARK: + /* A _QMARK or _STAR node is automatically nullable. */ + if (r->tokens[i] != _PLUS) + nullable[-1] = 1; + break; + + case _CAT: + /* Every element in the firstpos of the second argument is in the + follow of every element in the lastpos of the first argument. */ + tmp.nelem = nfirstpos[-1]; + tmp.elems = firstpos; + pos = lastpos + nlastpos[-1]; + for (j = 0; j < nlastpos[-2]; ++j) + { + merge(&tmp, &r->follows[pos[j].index], &merged); + REALLOC_IF_NECESSARY(r->follows[pos[j].index].elems, _position, + nalloc[pos[j].index], merged.nelem - 1); + copy(&merged, &r->follows[pos[j].index]); + } + + /* The firstpos of a _CAT node is the firstpos of the first argument, + union that of the second argument if the first is nullable. */ + if (nullable[-2]) + nfirstpos[-2] += nfirstpos[-1]; + else + firstpos += nfirstpos[-1]; + --nfirstpos; + + /* The lastpos of a _CAT node is the lastpos of the second argument, + union that of the first argument if the second is nullable. */ + if (nullable[-1]) + nlastpos[-2] += nlastpos[-1]; + else + { + pos = lastpos + nlastpos[-2]; + for (j = nlastpos[-1] - 1; j >= 0; --j) + pos[j] = lastpos[j]; + lastpos += nlastpos[-2]; + nlastpos[-2] = nlastpos[-1]; + } + --nlastpos; + + /* A _CAT node is nullable if both arguments are nullable. */ + nullable[-2] = nullable[-1] && nullable[-2]; + --nullable; + break; + + case _OR: + /* The firstpos is the union of the firstpos of each argument. */ + nfirstpos[-2] += nfirstpos[-1]; + --nfirstpos; + + /* The lastpos is the union of the lastpos of each argument. */ + nlastpos[-2] += nlastpos[-1]; + --nlastpos; + + /* An _OR node is nullable if either argument is nullable. */ + nullable[-2] = nullable[-1] || nullable[-2]; + --nullable; + break; + + default: + /* Anything else is a nonempty position. (Note that special + constructs like \< are treated as nonempty strings here; + an "epsilon closure" effectively makes them nullable later. + Backreferences have to get a real position so we can detect + transitions on them later. But they are nullable. */ + *nullable++ = r->tokens[i] == _BACKREF; + + /* This position is in its own firstpos and lastpos. */ + *nfirstpos++ = *nlastpos++ = 1; + --firstpos, --lastpos; + firstpos->index = lastpos->index = i; + firstpos->constraint = lastpos->constraint = _NO_CONSTRAINT; + + /* Allocate the follow set for this position. */ + nalloc[i] = 1; + MALLOC(r->follows[i].elems, _position, nalloc[i]); + break; + } +#ifdef DEBUG + /* ... balance the above nonsyntactic #ifdef goo... */ + fprintf(stderr, "node %d:", i); + prtok(r->tokens[i]); + putc('\n', stderr); + fprintf(stderr, nullable[-1] ? " nullable: yes\n" : " nullable: no\n"); + fprintf(stderr, " firstpos:"); + for (j = nfirstpos[-1] - 1; j >= 0; --j) + { + fprintf(stderr, " %d:", firstpos[j].index); + prtok(r->tokens[firstpos[j].index]); + } + fprintf(stderr, "\n lastpos:"); + for (j = nlastpos[-1] - 1; j >= 0; --j) + { + fprintf(stderr, " %d:", lastpos[j].index); + prtok(r->tokens[lastpos[j].index]); + } + putc('\n', stderr); + } +#endif + + /* For each follow set that is the follow set of a real position, replace + it with its epsilon closure. */ + for (i = 0; i < r->tindex; ++i) + if (r->tokens[i] < _NOTCHAR || r->tokens[i] == _BACKREF + || r->tokens[i] >= _SET) + { +#ifdef DEBUG + fprintf(stderr, "follows(%d:", i); + prtok(r->tokens[i]); + fprintf(stderr, "):"); + for (j = r->follows[i].nelem - 1; j >= 0; --j) + { + fprintf(stderr, " %d:", r->follows[i].elems[j].index); + prtok(r->tokens[r->follows[i].elems[j].index]); + } + putc('\n', stderr); +#endif + copy(&r->follows[i], &merged); + epsclosure(&merged, r); + if (r->follows[i].nelem < merged.nelem) + REALLOC(r->follows[i].elems, _position, merged.nelem); + copy(&merged, &r->follows[i]); + } + + /* Get the epsilon closure of the firstpos of the regexp. The result will + be the set of positions of state 0. */ + merged.nelem = 0; + for (i = 0; i < nfirstpos[-1]; ++i) + insert(firstpos[i], &merged); + epsclosure(&merged, r); + + /* Check if any of the positions of state 0 will want newline context. */ + wants_newline = 0; + for (i = 0; i < merged.nelem; ++i) + if (_PREV_NEWLINE_DEPENDENT(merged.elems[i].constraint)) + wants_newline = 1; + + /* Build the initial state. */ + r->salloc = 1; + r->sindex = 0; + MALLOC(r->states, _dfa_state, r->salloc); + state_index(r, &merged, wants_newline, 0); + + free(o_nullable); + free(o_nfirst); + free(o_firstpos); + free(o_nlast); + free(o_lastpos); + free(nalloc); + free(merged.elems); +} + +/* Find, for each character, the transition out of state s of r, and store + it in the appropriate slot of trans. + + We divide the positions of s into groups (positions can appear in more + than one group). Each group is labeled with a set of characters that + every position in the group matches (taking into account, if necessary, + preceding context information of s). For each group, find the union + of the its elements' follows. This set is the set of positions of the + new state. For each character in the group's label, set the transition + on this character to be to a state corresponding to the set's positions, + and its associated backward context information, if necessary. + + If we are building a searching matcher, we include the positions of state + 0 in every state. + + The collection of groups is constructed by building an equivalence-class + partition of the positions of s. + + For each position, find the set of characters C that it matches. Eliminate + any characters from C that fail on grounds of backward context. + + Search through the groups, looking for a group whose label L has nonempty + intersection with C. If L - C is nonempty, create a new group labeled + L - C and having the same positions as the current group, and set L to + the intersection of L and C. Insert the position in this group, set + C = C - L, and resume scanning. + + If after comparing with every group there are characters remaining in C, + create a new group labeled with the characters of C and insert this + position in that group. */ +void +regstate(s, r, trans) + int s; + struct regexp *r; + int trans[]; +{ + _position_set grps[_NOTCHAR]; /* As many as will ever be needed. */ + _charset labels[_NOTCHAR]; /* Labels corresponding to the groups. */ + int ngrps = 0; /* Number of groups actually used. */ + _position pos; /* Current position being considered. */ + _charset matches; /* Set of matching characters. */ + int matchesf; /* True if matches is nonempty. */ + _charset intersect; /* Intersection with some label set. */ + int intersectf; /* True if intersect is nonempty. */ + _charset leftovers; /* Stuff in the label that didn't match. */ + int leftoversf; /* True if leftovers is nonempty. */ + static _charset letters; /* Set of characters considered letters. */ + static _charset newline; /* Set of characters that aren't newline. */ + _position_set follows; /* Union of the follows of some group. */ + _position_set tmp; /* Temporary space for merging sets. */ + int state; /* New state. */ + int wants_newline; /* New state wants to know newline context. */ + int state_newline; /* New state on a newline transition. */ + int wants_letter; /* New state wants to know letter context. */ + int state_letter; /* New state on a letter transition. */ + static initialized; /* Flag for static initialization. */ + int i, j, k; + + /* Initialize the set of letters, if necessary. */ + if (! initialized) + { + initialized = 1; + for (i = 0; i < _NOTCHAR; ++i) + if (ISALNUM(i)) + setbit(i, letters); + setbit('\n', newline); + } + + zeroset(matches); + + for (i = 0; i < r->states[s].elems.nelem; ++i) + { + pos = r->states[s].elems.elems[i]; + if (r->tokens[pos.index] >= 0 && r->tokens[pos.index] < _NOTCHAR) + setbit(r->tokens[pos.index], matches); + else if (r->tokens[pos.index] >= _SET) + copyset(r->charsets[r->tokens[pos.index] - _SET], matches); + else + continue; + + /* Some characters may need to be eliminated from matches because + they fail in the current context. */ + if (pos.constraint != 0xFF) + { + if (! _MATCHES_NEWLINE_CONTEXT(pos.constraint, + r->states[s].newline, 1)) + clrbit('\n', matches); + if (! _MATCHES_NEWLINE_CONTEXT(pos.constraint, + r->states[s].newline, 0)) + for (j = 0; j < _CHARSET_INTS; ++j) + matches[j] &= newline[j]; + if (! _MATCHES_LETTER_CONTEXT(pos.constraint, + r->states[s].letter, 1)) + for (j = 0; j < _CHARSET_INTS; ++j) + matches[j] &= ~letters[j]; + if (! _MATCHES_LETTER_CONTEXT(pos.constraint, + r->states[s].letter, 0)) + for (j = 0; j < _CHARSET_INTS; ++j) + matches[j] &= letters[j]; + + /* If there are no characters left, there's no point in going on. */ + for (j = 0; j < _CHARSET_INTS && !matches[j]; ++j) + ; + if (j == _CHARSET_INTS) + continue; + } + + for (j = 0; j < ngrps; ++j) + { + /* If matches contains a single character only, and the current + group's label doesn't contain that character, go on to the + next group. */ + if (r->tokens[pos.index] >= 0 && r->tokens[pos.index] < _NOTCHAR + && !tstbit(r->tokens[pos.index], labels[j])) + continue; + + /* Check if this group's label has a nonempty intersection with + matches. */ + intersectf = 0; + for (k = 0; k < _CHARSET_INTS; ++k) + (intersect[k] = matches[k] & labels[j][k]) ? intersectf = 1 : 0; + if (! intersectf) + continue; + + /* It does; now find the set differences both ways. */ + leftoversf = matchesf = 0; + for (k = 0; k < _CHARSET_INTS; ++k) + { + /* Even an optimizing compiler can't know this for sure. */ + int match = matches[k], label = labels[j][k]; + + (leftovers[k] = ~match & label) ? leftoversf = 1 : 0; + (matches[k] = match & ~label) ? matchesf = 1 : 0; + } + + /* If there were leftovers, create a new group labeled with them. */ + if (leftoversf) + { + copyset(leftovers, labels[ngrps]); + copyset(intersect, labels[j]); + MALLOC(grps[ngrps].elems, _position, r->nleaves); + copy(&grps[j], &grps[ngrps]); + ++ngrps; + } + + /* Put the position in the current group. Note that there is no + reason to call insert() here. */ + grps[j].elems[grps[j].nelem++] = pos; + + /* If every character matching the current position has been + accounted for, we're done. */ + if (! matchesf) + break; + } + + /* If we've passed the last group, and there are still characters + unaccounted for, then we'll have to create a new group. */ + if (j == ngrps) + { + copyset(matches, labels[ngrps]); + zeroset(matches); + MALLOC(grps[ngrps].elems, _position, r->nleaves); + grps[ngrps].nelem = 1; + grps[ngrps].elems[0] = pos; + ++ngrps; + } + } + + MALLOC(follows.elems, _position, r->nleaves); + MALLOC(tmp.elems, _position, r->nleaves); + + /* If we are a searching matcher, the default transition is to a state + containing the positions of state 0, otherwise the default transition + is to fail miserably. */ + if (r->searchflag) + { + wants_newline = 0; + wants_letter = 0; + for (i = 0; i < r->states[0].elems.nelem; ++i) + { + if (_PREV_NEWLINE_DEPENDENT(r->states[0].elems.elems[i].constraint)) + wants_newline = 1; + if (_PREV_LETTER_DEPENDENT(r->states[0].elems.elems[i].constraint)) + wants_letter = 1; + } + copy(&r->states[0].elems, &follows); + state = state_index(r, &follows, 0, 0); + if (wants_newline) + state_newline = state_index(r, &follows, 1, 0); + else + state_newline = state; + if (wants_letter) + state_letter = state_index(r, &follows, 0, 1); + else + state_letter = state; + for (i = 0; i < _NOTCHAR; ++i) + if (i == '\n') + trans[i] = state_newline; + else if (ISALNUM(i)) + trans[i] = state_letter; + else + trans[i] = state; + } + else + for (i = 0; i < _NOTCHAR; ++i) + trans[i] = -1; + + for (i = 0; i < ngrps; ++i) + { + follows.nelem = 0; + + /* Find the union of the follows of the positions of the group. + This is a hideously inefficient loop. Fix it someday. */ + for (j = 0; j < grps[i].nelem; ++j) + for (k = 0; k < r->follows[grps[i].elems[j].index].nelem; ++k) + insert(r->follows[grps[i].elems[j].index].elems[k], &follows); + + /* If we are building a searching matcher, throw in the positions + of state 0 as well. */ + if (r->searchflag) + for (j = 0; j < r->states[0].elems.nelem; ++j) + insert(r->states[0].elems.elems[j], &follows); + + /* Find out if the new state will want any context information. */ + wants_newline = 0; + if (tstbit('\n', labels[i])) + for (j = 0; j < follows.nelem; ++j) + if (_PREV_NEWLINE_DEPENDENT(follows.elems[j].constraint)) + wants_newline = 1; + + wants_letter = 0; + for (j = 0; j < _CHARSET_INTS; ++j) + if (labels[i][j] & letters[j]) + break; + if (j < _CHARSET_INTS) + for (j = 0; j < follows.nelem; ++j) + if (_PREV_LETTER_DEPENDENT(follows.elems[j].constraint)) + wants_letter = 1; + + /* Find the state(s) corresponding to the union of the follows. */ + state = state_index(r, &follows, 0, 0); + if (wants_newline) + state_newline = state_index(r, &follows, 1, 0); + else + state_newline = state; + if (wants_letter) + state_letter = state_index(r, &follows, 0, 1); + else + state_letter = state; + + /* Set the transitions for each character in the current label. */ + for (j = 0; j < _CHARSET_INTS; ++j) + for (k = 0; k < INTBITS; ++k) + if (labels[i][j] & 1 << k) + { + int c = j * INTBITS + k; + + if (c == '\n') + trans[c] = state_newline; + else if (ISALNUM(c)) + trans[c] = state_letter; + else if (c < _NOTCHAR) + trans[c] = state; + } + } + + for (i = 0; i < ngrps; ++i) + free(grps[i].elems); + free(follows.elems); + free(tmp.elems); +} + +/* Some routines for manipulating a compiled regexp's transition tables. + Each state may or may not have a transition table; if it does, and it + is a non-accepting state, then r->trans[state] points to its table. + If it is an accepting state then r->fails[state] points to its table. + If it has no table at all, then r->trans[state] is NULL. + TODO: Improve this comment, get rid of the unnecessary redundancy. */ + +static void +build_state(s, r) + int s; + struct regexp *r; +{ + int *trans; /* The new transition table. */ + int i; + + /* Set an upper limit on the number of transition tables that will ever + exist at once. 1024 is arbitrary. The idea is that the frequently + used transition tables will be quickly rebuilt, whereas the ones that + were only needed once or twice will be cleared away. */ + if (r->trcount >= 1024) + { + for (i = 0; i < r->tralloc; ++i) + if (r->trans[i]) + { + free((ptr_t) r->trans[i]); + r->trans[i] = NULL; + } + else if (r->fails[i]) + { + free((ptr_t) r->fails[i]); + r->fails[i] = NULL; + } + r->trcount = 0; + } + + ++r->trcount; + + /* Set up the success bits for this state. */ + r->success[s] = 0; + if (ACCEPTS_IN_CONTEXT(r->states[s].newline, 1, r->states[s].letter, 0, + s, *r)) + r->success[s] |= 4; + if (ACCEPTS_IN_CONTEXT(r->states[s].newline, 0, r->states[s].letter, 1, + s, *r)) + r->success[s] |= 2; + if (ACCEPTS_IN_CONTEXT(r->states[s].newline, 0, r->states[s].letter, 0, + s, *r)) + r->success[s] |= 1; + + MALLOC(trans, int, _NOTCHAR); + regstate(s, r, trans); + + /* Now go through the new transition table, and make sure that the trans + and fail arrays are allocated large enough to hold a pointer for the + largest state mentioned in the table. */ + for (i = 0; i < _NOTCHAR; ++i) + if (trans[i] >= r->tralloc) + { + int oldalloc = r->tralloc; + + while (trans[i] >= r->tralloc) + r->tralloc *= 2; + REALLOC(r->realtrans, int *, r->tralloc + 1); + r->trans = r->realtrans + 1; + REALLOC(r->fails, int *, r->tralloc); + REALLOC(r->success, int, r->tralloc); + REALLOC(r->newlines, int, r->tralloc); + while (oldalloc < r->tralloc) + { + r->trans[oldalloc] = NULL; + r->fails[oldalloc++] = NULL; + } + } + + /* Keep the newline transition in a special place so we can use it as + a sentinel. */ + r->newlines[s] = trans['\n']; + trans['\n'] = -1; + + if (ACCEPTING(s, *r)) + r->fails[s] = trans; + else + r->trans[s] = trans; +} + +static void +build_state_zero(r) + struct regexp *r; +{ + r->tralloc = 1; + r->trcount = 0; + CALLOC(r->realtrans, int *, r->tralloc + 1); + r->trans = r->realtrans + 1; + CALLOC(r->fails, int *, r->tralloc); + MALLOC(r->success, int, r->tralloc); + MALLOC(r->newlines, int, r->tralloc); + build_state(0, r); +} + +/* Search through a buffer looking for a match to the given struct regexp. + Find the first occurrence of a string matching the regexp in the buffer, + and the shortest possible version thereof. Return a pointer to the first + character after the match, or NULL if none is found. Begin points to + the beginning of the buffer, and end points to the first character after + its end. We store a newline in *end to act as a sentinel, so end had + better point somewhere valid. Newline is a flag indicating whether to + allow newlines to be in the matching string. If count is non- + NULL it points to a place we're supposed to increment every time we + see a newline. Finally, if backref is non-NULL it points to a place + where we're supposed to store a 1 if backreferencing happened and the + match needs to be verified by a backtracking matcher. Otherwise + we store a 0 in *backref. */ +char * +regexecute(r, begin, end, newline, count, backref) + struct regexp *r; + char *begin; + char *end; + int newline; + int *count; + int *backref; +{ + register s, s1, tmp; /* Current state. */ + register unsigned char *p; /* Current input character. */ + register **trans, *t; /* Copy of r->trans so it can be optimized + into a register. */ + static sbit[_NOTCHAR]; /* Table for anding with r->success. */ + static sbit_init; + + if (! sbit_init) + { + int i; + + sbit_init = 1; + for (i = 0; i < _NOTCHAR; ++i) + if (i == '\n') + sbit[i] = 4; + else if (ISALNUM(i)) + sbit[i] = 2; + else + sbit[i] = 1; + } + + if (! r->tralloc) + build_state_zero(r); + + s = 0; + p = (unsigned char *) begin; + trans = r->trans; + *end = '\n'; + + for (;;) + { + /* The dreaded inner loop. */ + if (t = trans[s]) + do + { + s1 = t[*p++]; + if (! (t = trans[s1])) + goto last_was_s; + s = t[*p++]; + } + while (t = trans[s]); + goto last_was_s1; + last_was_s: + tmp = s, s = s1, s1 = tmp; + last_was_s1: + + if (s >= 0 && p <= (unsigned char *) end && r->fails[s]) + { + if (r->success[s] & sbit[*p]) + { + if (backref) + if (r->states[s].backref) + *backref = 1; + else + *backref = 0; + return (char *) p; + } + + s1 = s; + s = r->fails[s][*p++]; + continue; + } + + /* If the previous character was a newline, count it. */ + if (count && (char *) p <= end && p[-1] == '\n') + ++*count; + + /* Check if we've run off the end of the buffer. */ + if ((char *) p >= end) + return NULL; + + if (s >= 0) + { + build_state(s, r); + trans = r->trans; + continue; + } + + if (p[-1] == '\n' && newline) + { + s = r->newlines[s1]; + continue; + } + + s = 0; + } +} + +/* Initialize the components of a regexp that the other routines don't + initialize for themselves. */ +void +reginit(r) + struct regexp *r; +{ + r->calloc = 1; + MALLOC(r->charsets, _charset, r->calloc); + r->cindex = 0; + + r->talloc = 1; + MALLOC(r->tokens, _token, r->talloc); + r->tindex = r->depth = r->nleaves = r->nregexps = 0; + + r->searchflag = 0; + r->tralloc = 0; +} + +/* Parse and analyze a single string of the given length. */ +void +regcompile(s, len, r, searchflag) + const char *s; + size_t len; + struct regexp *r; + int searchflag; +{ + if (case_fold) /* dummy folding in service of regmust() */ + { + char *copy; + int i; + + copy = malloc(len); + if (!copy) + regerror("out of memory"); + + /* This is a complete kludge and could potentially break + \ escapes . . . */ + case_fold = 0; + for (i = 0; i < len; ++i) + if (ISUPPER(s[i])) + copy[i] = tolower(s[i]); + else + copy[i] = s[i]; + + reginit(r); + r->mustn = 0; + r->must[0] = '\0'; + regparse(copy, len, r); + free(copy); + regmust(r); + reganalyze(r, searchflag); + case_fold = 1; + reginit(r); + regparse(s, len, r); + reganalyze(r, searchflag); + } + else + { + reginit(r); + regparse(s, len, r); + regmust(r); + reganalyze(r, searchflag); + } +} + +/* Free the storage held by the components of a regexp. */ +void +regfree(r) + struct regexp *r; +{ + int i; + + free((ptr_t) r->charsets); + free((ptr_t) r->tokens); + for (i = 0; i < r->sindex; ++i) + free((ptr_t) r->states[i].elems.elems); + free((ptr_t) r->states); + for (i = 0; i < r->tindex; ++i) + if (r->follows[i].elems) + free((ptr_t) r->follows[i].elems); + free((ptr_t) r->follows); + for (i = 0; i < r->tralloc; ++i) + if (r->trans[i]) + free((ptr_t) r->trans[i]); + else if (r->fails[i]) + free((ptr_t) r->fails[i]); + free((ptr_t) r->realtrans); + free((ptr_t) r->fails); + free((ptr_t) r->newlines); +} + +/* +Having found the postfix representation of the regular expression, +try to find a long sequence of characters that must appear in any line +containing the r.e. +Finding a "longest" sequence is beyond the scope here; +we take an easy way out and hope for the best. +(Take "(ab|a)b"--please.) + +We do a bottom-up calculation of sequences of characters that must appear +in matches of r.e.'s represented by trees rooted at the nodes of the postfix +representation: + sequences that must appear at the left of the match ("left") + sequences that must appear at the right of the match ("right") + lists of sequences that must appear somewhere in the match ("in") + sequences that must constitute the match ("is") +When we get to the root of the tree, we use one of the longest of its +calculated "in" sequences as our answer. The sequence we find is returned in +r->must (where "r" is the single argument passed to "regmust"); +the length of the sequence is returned in r->mustn. + +The sequences calculated for the various types of node (in pseudo ANSI c) +are shown below. "p" is the operand of unary operators (and the left-hand +operand of binary operators); "q" is the right-hand operand of binary operators +. +"ZERO" means "a zero-length sequence" below. + +Type left right is in +---- ---- ----- -- -- +char c # c # c # c # c + +SET ZERO ZERO ZERO ZERO + +STAR ZERO ZERO ZERO ZERO + +QMARK ZERO ZERO ZERO ZERO + +PLUS p->left p->right ZERO p->in + +CAT (p->is==ZERO)? (q->is==ZERO)? (p->is!=ZERO && p->in plus + p->left : q->right : q->is!=ZERO) ? q->in plus + p->is##q->left p->right##q->is p->is##q->is : p->right##q->left + ZERO + +OR longest common longest common (do p->is and substrings common to + leading trailing q->is have same p->in and q->in + (sub)sequence (sub)sequence length and + of p->left of p->right content) ? + and q->left and q->right p->is : NULL + +If there's anything else we recognize in the tree, all four sequences get set +to zero-length sequences. If there's something we don't recognize in the tree, +we just return a zero-length sequence. + +Break ties in favor of infrequent letters (choosing 'zzz' in preference to +'aaa')? + +And. . .is it here or someplace that we might ponder "optimizations" such as + egrep 'psi|epsilon' -> egrep 'psi' + egrep 'pepsi|epsilon' -> egrep 'epsi' + (Yes, we now find "epsi" as a "string + that must occur", but we might also + simplify the *entire* r.e. being sought +) + grep '[c]' -> grep 'c' + grep '(ab|a)b' -> grep 'ab' + grep 'ab*' -> grep 'a' + grep 'a*b' -> grep 'b' +There are several issues: + Is optimization easy (enough)? + + Does optimization actually accomplish anything, + or is the automaton you get from "psi|epsilon" (for example) + the same as the one you get from "psi" (for example)? + + Are optimizable r.e.'s likely to be used in real-life situations + (something like 'ab*' is probably unlikely; something like is + 'psi|epsilon' is likelier)? +*/ + +static char * +icatalloc(old, new) +char * old; +char * new; +{ + register char * result; + register int oldsize, newsize; + + newsize = (new == NULL) ? 0 : strlen(new); + if (old == NULL) + oldsize = 0; + else if (newsize == 0) + return old; + else oldsize = strlen(old); + if (old == NULL) + result = (char *) malloc(newsize + 1); + else result = (char *) realloc((void *) old, oldsize + newsize + 1); + if (result != NULL && new != NULL) + (void) strcpy(result + oldsize, new); + return result; +} + +static char * +icpyalloc(string) +const char * string; +{ + return icatalloc((char *) NULL, string); +} + +static char * +istrstr(lookin, lookfor) +char * lookin; +register char * lookfor; +{ + register char * cp; + register int len; + + len = strlen(lookfor); + for (cp = lookin; *cp != '\0'; ++cp) + if (strncmp(cp, lookfor, len) == 0) + return cp; + return NULL; +} + +static void +ifree(cp) +char * cp; +{ + if (cp != NULL) + free(cp); +} + +static void +freelist(cpp) +register char ** cpp; +{ + register int i; + + if (cpp == NULL) + return; + for (i = 0; cpp[i] != NULL; ++i) { + free(cpp[i]); + cpp[i] = NULL; + } +} + +static char ** +enlist(cpp, new, len) +register char ** cpp; +register char * new; +int len; +{ + register int i, j; + + if (cpp == NULL) + return NULL; + if ((new = icpyalloc(new)) == NULL) { + freelist(cpp); + return NULL; + } + new[len] = '\0'; + /* + ** Is there already something in the list that's new (or longer)? + */ + for (i = 0; cpp[i] != NULL; ++i) + if (istrstr(cpp[i], new) != NULL) { + free(new); + return cpp; + } + /* + ** Eliminate any obsoleted strings. + */ + j = 0; + while (cpp[j] != NULL) + if (istrstr(new, cpp[j]) == NULL) + ++j; + else { + free(cpp[j]); + if (--i == j) + break; + cpp[j] = cpp[i]; + cpp[i] = 0; + } + /* + ** Add the new string. + */ + cpp = (char **) realloc((char *) cpp, (i + 2) * sizeof *cpp); + if (cpp == NULL) + return NULL; + cpp[i] = new; + cpp[i + 1] = NULL; + return cpp; +} + +/* +** Given pointers to two strings, +** return a pointer to an allocated list of their distinct common substrings. +** Return NULL if something seems wild. +*/ + +static char ** +comsubs(left, right) +char * left; +char * right; +{ + register char ** cpp; + register char * lcp; + register char * rcp; + register int i, len; + + if (left == NULL || right == NULL) + return NULL; + cpp = (char **) malloc(sizeof *cpp); + if (cpp == NULL) + return NULL; + cpp[0] = NULL; + for (lcp = left; *lcp != '\0'; ++lcp) { + len = 0; + rcp = index(right, *lcp); + while (rcp != NULL) { + for (i = 1; lcp[i] != '\0' && lcp[i] == rcp[i]; ++i) + ; + if (i > len) + len = i; + rcp = index(rcp + 1, *lcp); + } + if (len == 0) + continue; + if ((cpp = enlist(cpp, lcp, len)) == NULL) + break; + } + return cpp; +} + +static char ** +addlists(old, new) +char ** old; +char ** new; +{ + register int i; + + if (old == NULL || new == NULL) + return NULL; + for (i = 0; new[i] != NULL; ++i) { + old = enlist(old, new[i], strlen(new[i])); + if (old == NULL) + break; + } + return old; +} + +/* +** Given two lists of substrings, +** return a new list giving substrings common to both. +*/ + +static char ** +inboth(left, right) +char ** left; +char ** right; +{ + register char ** both; + register char ** temp; + register int lnum, rnum; + + if (left == NULL || right == NULL) + return NULL; + both = (char **) malloc(sizeof *both); + if (both == NULL) + return NULL; + both[0] = NULL; + for (lnum = 0; left[lnum] != NULL; ++lnum) { + for (rnum = 0; right[rnum] != NULL; ++rnum) { + temp = comsubs(left[lnum], right[rnum]); + if (temp == NULL) { + freelist(both); + return NULL; + } + both = addlists(both, temp); + freelist(temp); + if (both == NULL) + return NULL; + } + } + return both; +} + +typedef struct { + char ** in; + char * left; + char * right; + char * is; +} must; + +static void +resetmust(mp) +register must * mp; +{ + mp->left[0] = mp->right[0] = mp->is[0] = '\0'; + freelist(mp->in); +} + +static void +regmust(reg) +register struct regexp * reg; +{ + register must * musts; + register must * mp; + register char * result; + register int ri; + register int i; + register _token t; + static must must0; + + reg->mustn = 0; + reg->must[0] = '\0'; + musts = (must *) malloc((reg->tindex + 1) * sizeof *musts); + if (musts == NULL) + return; + mp = musts; + for (i = 0; i <= reg->tindex; ++i) + mp[i] = must0; + for (i = 0; i <= reg->tindex; ++i) { + mp[i].in = (char **) malloc(sizeof *mp[i].in); + mp[i].left = malloc(2); + mp[i].right = malloc(2); + mp[i].is = malloc(2); + if (mp[i].in == NULL || mp[i].left == NULL || + mp[i].right == NULL || mp[i].is == NULL) + goto done; + mp[i].left[0] = mp[i].right[0] = mp[i].is[0] = '\0'; + mp[i].in[0] = NULL; + } + result = ""; +#ifdef DEBUG + fprintf(stderr, "regmust:\n"); + for (i = 0; i < reg->tindex; ++i) { + fprintf(stderr, " %d:", i); + prtok(reg->tokens[i]); + } + putc('\n', stderr); +#endif + for (ri = 0; ri < reg->tindex; ++ri) { + switch (t = reg->tokens[ri]) { + case _ALLBEGLINE: + case _ALLENDLINE: + case _LPAREN: + case _RPAREN: + goto done; /* "cannot happen" */ + case _EMPTY: + case _BEGLINE: + case _ENDLINE: + case _BEGWORD: + case _ENDWORD: + case _LIMWORD: + case _NOTLIMWORD: + case _BACKREF: + resetmust(mp); + break; + case _STAR: + case _QMARK: + if (mp <= musts) + goto done; /* "cannot happen" */ + --mp; + resetmust(mp); + break; + case _OR: + if (mp < &musts[2]) + goto done; /* "cannot happen" */ + { + register char ** new; + register must * lmp; + register must * rmp; + register int j, ln, rn, n; + + rmp = --mp; + lmp = --mp; + /* Guaranteed to be. Unlikely, but. . . */ + if (strcmp(lmp->is, rmp->is) != 0) + lmp->is[0] = '\0'; + /* Left side--easy */ + i = 0; + while (lmp->left[i] != '\0' && + lmp->left[i] == rmp->left[i]) + ++i; + lmp->left[i] = '\0'; + /* Right side */ + ln = strlen(lmp->right); + rn = strlen(rmp->right); + n = ln; + if (n > rn) + n = rn; + for (i = 0; i < n; ++i) + if (lmp->right[ln - i - 1] != + rmp->right[rn - i - 1]) + break; + for (j = 0; j < i; ++j) + lmp->right[j] = + lmp->right[(ln - i) + j]; + lmp->right[j] = '\0'; + new = inboth(lmp->in, rmp->in); + if (new == NULL) + goto done; + freelist(lmp->in); + free((char *) lmp->in); + lmp->in = new; + } + break; + case _PLUS: + if (mp <= musts) + goto done; /* "cannot happen" */ + --mp; + mp->is[0] = '\0'; + break; + case _END: + if (mp != &musts[1]) + goto done; /* "cannot happen" */ + for (i = 0; musts[0].in[i] != NULL; ++i) + if (strlen(musts[0].in[i]) > strlen(result)) + result = musts[0].in[i]; + goto done; + case _CAT: + if (mp < &musts[2]) + goto done; /* "cannot happen" */ + { + register must * lmp; + register must * rmp; + + rmp = --mp; + lmp = --mp; + /* + ** In. Everything in left, plus everything in + ** right, plus catenation of + ** left's right and right's left. + */ + lmp->in = addlists(lmp->in, rmp->in); + if (lmp->in == NULL) + goto done; + if (lmp->right[0] != '\0' && + rmp->left[0] != '\0') { + register char * tp; + + tp = icpyalloc(lmp->right); + if (tp == NULL) + goto done; + tp = icatalloc(tp, rmp->left); + if (tp == NULL) + goto done; + lmp->in = enlist(lmp->in, tp, + strlen(tp)); + free(tp); + if (lmp->in == NULL) + goto done; + } + /* Left-hand */ + if (lmp->is[0] != '\0') { + lmp->left = icatalloc(lmp->left, + rmp->left); + if (lmp->left == NULL) + goto done; + } + /* Right-hand */ + if (rmp->is[0] == '\0') + lmp->right[0] = '\0'; + lmp->right = icatalloc(lmp->right, rmp->right); + if (lmp->right == NULL) + goto done; + /* Guaranteed to be */ + if (lmp->is[0] != '\0' && rmp->is[0] != '\0') { + lmp->is = icatalloc(lmp->is, rmp->is); + if (lmp->is == NULL) + goto done; + } else + lmp->is[0] = '\0'; + } + break; + default: + if (t < _END) { + /* "cannot happen" */ + goto done; + } else if (t == '\0') { + /* not on *my* shift */ + goto done; + } else if (t >= _SET) { + /* easy enough */ + resetmust(mp); + } else { + /* plain character */ + resetmust(mp); + mp->is[0] = mp->left[0] = mp->right[0] = t; + mp->is[1] = mp->left[1] = mp->right[1] = '\0'; + mp->in = enlist(mp->in, mp->is, 1); + if (mp->in == NULL) + goto done; + } + break; + } +#ifdef DEBUG + fprintf(stderr, " node: %d:", ri); + prtok(reg->tokens[ri]); + fprintf(stderr, "\n in:"); + for (i = 0; mp->in[i]; ++i) + fprintf(stderr, " \"%s\"", mp->in[i]); + fprintf(stderr, "\n is: \"%s\"\n", mp->is); + fprintf(stderr, " left: \"%s\"\n", mp->left); + fprintf(stderr, " right: \"%s\"\n", mp->right); +#endif + ++mp; + } +done: + (void) strncpy(reg->must, result, MUST_MAX - 1); + reg->must[MUST_MAX - 1] = '\0'; + reg->mustn = strlen(reg->must); + mp = musts; + for (i = 0; i <= reg->tindex; ++i) { + freelist(mp[i].in); + ifree((char *) mp[i].in); + ifree(mp[i].left); + ifree(mp[i].right); + ifree(mp[i].is); + } + free((char *) mp); +} diff --git a/gnu/usr.bin/grep/dfa.h b/gnu/usr.bin/grep/dfa.h new file mode 100644 index 000000000000..33e4bf14195a --- /dev/null +++ b/gnu/usr.bin/grep/dfa.h @@ -0,0 +1,450 @@ +/* dfa.h - declarations for GNU deterministic regexp compiler + Copyright (C) 1988 Free Software Foundation, Inc. + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2, or (at your option) + any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ + +/* Written June, 1988 by Mike Haertel */ + +#ifdef STDC_HEADERS + +#include +#include + +#else /* !STDC_HEADERS */ + +#define const +#include /* For size_t. */ +extern char *calloc(), *malloc(), *realloc(); +extern void free(); + +#ifndef NULL +#define NULL 0 +#endif + +#endif /* ! STDC_HEADERS */ + +#include +#ifndef isascii +#define ISALNUM(c) isalnum(c) +#define ISALPHA(c) isalpha(c) +#define ISUPPER(c) isupper(c) +#define ISLOWER(c) islower(c) +#else +#define ISALNUM(c) (isascii(c) && isalnum(c)) +#define ISALPHA(c) (isascii(c) && isalpha(c)) +#define ISUPPER(c) (isascii(c) && isupper(c)) +#define ISLOWER(c) (isascii(c) && islower(c)) +#endif + +/* 1 means plain parentheses serve as grouping, and backslash + parentheses are needed for literal searching. + 0 means backslash-parentheses are grouping, and plain parentheses + are for literal searching. */ +#define RE_NO_BK_PARENS 1 + +/* 1 means plain | serves as the "or"-operator, and \| is a literal. + 0 means \| serves as the "or"-operator, and | is a literal. */ +#define RE_NO_BK_VBAR 2 + +/* 0 means plain + or ? serves as an operator, and \+, \? are literals. + 1 means \+, \? are operators and plain +, ? are literals. */ +#define RE_BK_PLUS_QM 4 + +/* 1 means | binds tighter than ^ or $. + 0 means the contrary. */ +#define RE_TIGHT_VBAR 8 + +/* 1 means treat \n as an _OR operator + 0 means treat it as a normal character */ +#define RE_NEWLINE_OR 16 + +/* 0 means that a special characters (such as *, ^, and $) always have + their special meaning regardless of the surrounding context. + 1 means that special characters may act as normal characters in some + contexts. Specifically, this applies to: + ^ - only special at the beginning, or after ( or | + $ - only special at the end, or before ) or | + *, +, ? - only special when not after the beginning, (, or | */ +#define RE_CONTEXT_INDEP_OPS 32 + +/* Now define combinations of bits for the standard possibilities. */ +#define RE_SYNTAX_AWK (RE_NO_BK_PARENS | RE_NO_BK_VBAR | RE_CONTEXT_INDEP_OPS) +#define RE_SYNTAX_EGREP (RE_SYNTAX_AWK | RE_NEWLINE_OR) +#define RE_SYNTAX_GREP (RE_BK_PLUS_QM | RE_NEWLINE_OR) +#define RE_SYNTAX_EMACS 0 + +/* Number of bits in an unsigned char. */ +#define CHARBITS 8 + +/* First integer value that is greater than any character code. */ +#define _NOTCHAR (1 << CHARBITS) + +/* INTBITS need not be exact, just a lower bound. */ +#define INTBITS (CHARBITS * sizeof (int)) + +/* Number of ints required to hold a bit for every character. */ +#define _CHARSET_INTS ((_NOTCHAR + INTBITS - 1) / INTBITS) + +/* Sets of unsigned characters are stored as bit vectors in arrays of ints. */ +typedef int _charset[_CHARSET_INTS]; + +/* The regexp is parsed into an array of tokens in postfix form. Some tokens + are operators and others are terminal symbols. Most (but not all) of these + codes are returned by the lexical analyzer. */ +#if __STDC__ + +typedef enum +{ + _END = -1, /* _END is a terminal symbol that matches the + end of input; any value of _END or less in + the parse tree is such a symbol. Accepting + states of the DFA are those that would have + a transition on _END. */ + + /* Ordinary character values are terminal symbols that match themselves. */ + + _EMPTY = _NOTCHAR, /* _EMPTY is a terminal symbol that matches + the empty string. */ + + _BACKREF, /* _BACKREF is generated by \; it + it not completely handled. If the scanner + detects a transition on backref, it returns + a kind of "semi-success" indicating that + the match will have to be verified with + a backtracking matcher. */ + + _BEGLINE, /* _BEGLINE is a terminal symbol that matches + the empty string if it is at the beginning + of a line. */ + + _ALLBEGLINE, /* _ALLBEGLINE is a terminal symbol that + matches the empty string if it is at the + beginning of a line; _ALLBEGLINE applies + to the entire regexp and can only occur + as the first token thereof. _ALLBEGLINE + never appears in the parse tree; a _BEGLINE + is prepended with _CAT to the entire + regexp instead. */ + + _ENDLINE, /* _ENDLINE is a terminal symbol that matches + the empty string if it is at the end of + a line. */ + + _ALLENDLINE, /* _ALLENDLINE is to _ENDLINE as _ALLBEGLINE + is to _BEGLINE. */ + + _BEGWORD, /* _BEGWORD is a terminal symbol that matches + the empty string if it is at the beginning + of a word. */ + + _ENDWORD, /* _ENDWORD is a terminal symbol that matches + the empty string if it is at the end of + a word. */ + + _LIMWORD, /* _LIMWORD is a terminal symbol that matches + the empty string if it is at the beginning + or the end of a word. */ + + _NOTLIMWORD, /* _NOTLIMWORD is a terminal symbol that + matches the empty string if it is not at + the beginning or end of a word. */ + + _QMARK, /* _QMARK is an operator of one argument that + matches zero or one occurences of its + argument. */ + + _STAR, /* _STAR is an operator of one argument that + matches the Kleene closure (zero or more + occurrences) of its argument. */ + + _PLUS, /* _PLUS is an operator of one argument that + matches the positive closure (one or more + occurrences) of its argument. */ + + _CAT, /* _CAT is an operator of two arguments that + matches the concatenation of its + arguments. _CAT is never returned by the + lexical analyzer. */ + + _OR, /* _OR is an operator of two arguments that + matches either of its arguments. */ + + _LPAREN, /* _LPAREN never appears in the parse tree, + it is only a lexeme. */ + + _RPAREN, /* _RPAREN never appears in the parse tree. */ + + _SET /* _SET and (and any value greater) is a + terminal symbol that matches any of a + class of characters. */ +} _token; + +#else /* ! __STDC__ */ + +typedef short _token; + +#define _END -1 +#define _EMPTY _NOTCHAR +#define _BACKREF (_EMPTY + 1) +#define _BEGLINE (_EMPTY + 2) +#define _ALLBEGLINE (_EMPTY + 3) +#define _ENDLINE (_EMPTY + 4) +#define _ALLENDLINE (_EMPTY + 5) +#define _BEGWORD (_EMPTY + 6) +#define _ENDWORD (_EMPTY + 7) +#define _LIMWORD (_EMPTY + 8) +#define _NOTLIMWORD (_EMPTY + 9) +#define _QMARK (_EMPTY + 10) +#define _STAR (_EMPTY + 11) +#define _PLUS (_EMPTY + 12) +#define _CAT (_EMPTY + 13) +#define _OR (_EMPTY + 14) +#define _LPAREN (_EMPTY + 15) +#define _RPAREN (_EMPTY + 16) +#define _SET (_EMPTY + 17) + +#endif /* ! __STDC__ */ + +/* Sets are stored in an array in the compiled regexp; the index of the + array corresponding to a given set token is given by _SET_INDEX(t). */ +#define _SET_INDEX(t) ((t) - _SET) + +/* Sometimes characters can only be matched depending on the surrounding + context. Such context decisions depend on what the previous character + was, and the value of the current (lookahead) character. Context + dependent constraints are encoded as 8 bit integers. Each bit that + is set indicates that the constraint succeeds in the corresponding + context. + + bit 7 - previous and current are newlines + bit 6 - previous was newline, current isn't + bit 5 - previous wasn't newline, current is + bit 4 - neither previous nor current is a newline + bit 3 - previous and current are word-constituents + bit 2 - previous was word-constituent, current isn't + bit 1 - previous wasn't word-constituent, current is + bit 0 - neither previous nor current is word-constituent + + Word-constituent characters are those that satisfy isalnum(). + + The macro _SUCCEEDS_IN_CONTEXT determines whether a a given constraint + succeeds in a particular context. Prevn is true if the previous character + was a newline, currn is true if the lookahead character is a newline. + Prevl and currl similarly depend upon whether the previous and current + characters are word-constituent letters. */ +#define _MATCHES_NEWLINE_CONTEXT(constraint, prevn, currn) \ + ((constraint) & 1 << ((prevn) ? 2 : 0) + ((currn) ? 1 : 0) + 4) +#define _MATCHES_LETTER_CONTEXT(constraint, prevl, currl) \ + ((constraint) & 1 << ((prevl) ? 2 : 0) + ((currl) ? 1 : 0)) +#define _SUCCEEDS_IN_CONTEXT(constraint, prevn, currn, prevl, currl) \ + (_MATCHES_NEWLINE_CONTEXT(constraint, prevn, currn) \ + && _MATCHES_LETTER_CONTEXT(constraint, prevl, currl)) + +/* The following macros give information about what a constraint depends on. */ +#define _PREV_NEWLINE_DEPENDENT(constraint) \ + (((constraint) & 0xc0) >> 2 != ((constraint) & 0x30)) +#define _PREV_LETTER_DEPENDENT(constraint) \ + (((constraint) & 0x0c) >> 2 != ((constraint) & 0x03)) + +/* Tokens that match the empty string subject to some constraint actually + work by applying that constraint to determine what may follow them, + taking into account what has gone before. The following values are + the constraints corresponding to the special tokens previously defined. */ +#define _NO_CONSTRAINT 0xff +#define _BEGLINE_CONSTRAINT 0xcf +#define _ENDLINE_CONSTRAINT 0xaf +#define _BEGWORD_CONSTRAINT 0xf2 +#define _ENDWORD_CONSTRAINT 0xf4 +#define _LIMWORD_CONSTRAINT 0xf6 +#define _NOTLIMWORD_CONSTRAINT 0xf9 + +/* States of the recognizer correspond to sets of positions in the parse + tree, together with the constraints under which they may be matched. + So a position is encoded as an index into the parse tree together with + a constraint. */ +typedef struct +{ + unsigned index; /* Index into the parse array. */ + unsigned constraint; /* Constraint for matching this position. */ +} _position; + +/* Sets of positions are stored as arrays. */ +typedef struct +{ + _position *elems; /* Elements of this position set. */ + int nelem; /* Number of elements in this set. */ +} _position_set; + +/* A state of the regexp consists of a set of positions, some flags, + and the token value of the lowest-numbered position of the state that + contains an _END token. */ +typedef struct +{ + int hash; /* Hash of the positions of this state. */ + _position_set elems; /* Positions this state could match. */ + char newline; /* True if previous state matched newline. */ + char letter; /* True if previous state matched a letter. */ + char backref; /* True if this state matches a \. */ + unsigned char constraint; /* Constraint for this state to accept. */ + int first_end; /* Token value of the first _END in elems. */ +} _dfa_state; + +/* If an r.e. is at most MUST_MAX characters long, we look for a string which + must appear in it; whatever's found is dropped into the struct reg. */ + +#define MUST_MAX 50 + +/* A compiled regular expression. */ +struct regexp +{ + /* Stuff built by the scanner. */ + _charset *charsets; /* Array of character sets for _SET tokens. */ + int cindex; /* Index for adding new charsets. */ + int calloc; /* Number of charsets currently allocated. */ + + /* Stuff built by the parser. */ + _token *tokens; /* Postfix parse array. */ + int tindex; /* Index for adding new tokens. */ + int talloc; /* Number of tokens currently allocated. */ + int depth; /* Depth required of an evaluation stack + used for depth-first traversal of the + parse tree. */ + int nleaves; /* Number of leaves on the parse tree. */ + int nregexps; /* Count of parallel regexps being built + with regparse(). */ + + /* Stuff owned by the state builder. */ + _dfa_state *states; /* States of the regexp. */ + int sindex; /* Index for adding new states. */ + int salloc; /* Number of states currently allocated. */ + + /* Stuff built by the structure analyzer. */ + _position_set *follows; /* Array of follow sets, indexed by position + index. The follow of a position is the set + of positions containing characters that + could conceivably follow a character + matching the given position in a string + matching the regexp. Allocated to the + maximum possible position index. */ + int searchflag; /* True if we are supposed to build a searching + as opposed to an exact matcher. A searching + matcher finds the first and shortest string + matching a regexp anywhere in the buffer, + whereas an exact matcher finds the longest + string matching, but anchored to the + beginning of the buffer. */ + + /* Stuff owned by the executor. */ + int tralloc; /* Number of transition tables that have + slots so far. */ + int trcount; /* Number of transition tables that have + actually been built. */ + int **trans; /* Transition tables for states that can + never accept. If the transitions for a + state have not yet been computed, or the + state could possibly accept, its entry in + this table is NULL. */ + int **realtrans; /* Trans always points to realtrans + 1; this + is so trans[-1] can contain NULL. */ + int **fails; /* Transition tables after failing to accept + on a state that potentially could do so. */ + int *success; /* Table of acceptance conditions used in + regexecute and computed in build_state. */ + int *newlines; /* Transitions on newlines. The entry for a + newline in any transition table is always + -1 so we can count lines without wasting + too many cycles. The transition for a + newline is stored separately and handled + as a special case. Newline is also used + as a sentinel at the end of the buffer. */ + char must[MUST_MAX]; + int mustn; +}; + +/* Some macros for user access to regexp internals. */ + +/* ACCEPTING returns true if s could possibly be an accepting state of r. */ +#define ACCEPTING(s, r) ((r).states[s].constraint) + +/* ACCEPTS_IN_CONTEXT returns true if the given state accepts in the + specified context. */ +#define ACCEPTS_IN_CONTEXT(prevn, currn, prevl, currl, state, reg) \ + _SUCCEEDS_IN_CONTEXT((reg).states[state].constraint, \ + prevn, currn, prevl, currl) + +/* FIRST_MATCHING_REGEXP returns the index number of the first of parallel + regexps that a given state could accept. Parallel regexps are numbered + starting at 1. */ +#define FIRST_MATCHING_REGEXP(state, reg) (-(reg).states[state].first_end) + +/* Entry points. */ + +#if __STDC__ + +/* Regsyntax() takes two arguments; the first sets the syntax bits described + earlier in this file, and the second sets the case-folding flag. */ +extern void regsyntax(int, int); + +/* Compile the given string of the given length into the given struct regexp. + Final argument is a flag specifying whether to build a searching or an + exact matcher. */ +extern void regcompile(const char *, size_t, struct regexp *, int); + +/* Execute the given struct regexp on the buffer of characters. The + first char * points to the beginning, and the second points to the + first character after the end of the buffer, which must be a writable + place so a sentinel end-of-buffer marker can be stored there. The + second-to-last argument is a flag telling whether to allow newlines to + be part of a string matching the regexp. The next-to-last argument, + if non-NULL, points to a place to increment every time we see a + newline. The final argument, if non-NULL, points to a flag that will + be set if further examination by a backtracking matcher is needed in + order to verify backreferencing; otherwise the flag will be cleared. + Returns NULL if no match is found, or a pointer to the first + character after the first & shortest matching string in the buffer. */ +extern char *regexecute(struct regexp *, char *, char *, int, int *, int *); + +/* Free the storage held by the components of a struct regexp. */ +extern void regfree(struct regexp *); + +/* Entry points for people who know what they're doing. */ + +/* Initialize the components of a struct regexp. */ +extern void reginit(struct regexp *); + +/* Incrementally parse a string of given length into a struct regexp. */ +extern void regparse(const char *, size_t, struct regexp *); + +/* Analyze a parsed regexp; second argument tells whether to build a searching + or an exact matcher. */ +extern void reganalyze(struct regexp *, int); + +/* Compute, for each possible character, the transitions out of a given + state, storing them in an array of integers. */ +extern void regstate(int, struct regexp *, int []); + +/* Error handling. */ + +/* Regerror() is called by the regexp routines whenever an error occurs. It + takes a single argument, a NUL-terminated string describing the error. + The default regerror() prints the error message to stderr and exits. + The user can provide a different regfree() if so desired. */ +extern void regerror(const char *); + +#else /* ! __STDC__ */ +extern void regsyntax(), regcompile(), regfree(), reginit(), regparse(); +extern void reganalyze(), regstate(), regerror(); +extern char *regexecute(); +#endif /* ! __STDC__ */ diff --git a/gnu/usr.bin/grep/getopt.c b/gnu/usr.bin/grep/getopt.c new file mode 100644 index 000000000000..0661bdf76289 --- /dev/null +++ b/gnu/usr.bin/grep/getopt.c @@ -0,0 +1,678 @@ +/* Getopt for GNU. + NOTE: getopt is now part of the C library, so if you don't know what + "Keep this file name-space clean" means, talk to roland@gnu.ai.mit.edu + before changing it! + + Copyright (C) 1987, 88, 89, 90, 91, 1992 Free Software Foundation, Inc. + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2, or (at your option) + any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ + +/* AIX requires this to be the first thing in the file. */ +#ifdef __GNUC__ +#define alloca __builtin_alloca +#else /* not __GNUC__ */ +#if defined(sparc) && !defined(USG) && !defined(SVR4) && !defined(__svr4__) +#include +#else +#ifdef _AIX + #pragma alloca +#else +char *alloca (); +#endif +#endif /* sparc */ +#endif /* not __GNUC__ */ + +#ifdef LIBC +/* For when compiled as part of the GNU C library. */ +#include +#endif + +#include + +/* This needs to come after some library #include + to get __GNU_LIBRARY__ defined. */ +#ifdef __GNU_LIBRARY__ +#undef alloca +#include +#include +#else /* Not GNU C library. */ +#define __alloca alloca +#endif /* GNU C library. */ + + +#ifndef __STDC__ +#define const +#endif + +/* If GETOPT_COMPAT is defined, `+' as well as `--' can introduce a + long-named option. Because this is not POSIX.2 compliant, it is + being phased out. */ +#define GETOPT_COMPAT + +/* This version of `getopt' appears to the caller like standard Unix `getopt' + but it behaves differently for the user, since it allows the user + to intersperse the options with the other arguments. + + As `getopt' works, it permutes the elements of ARGV so that, + when it is done, all the options precede everything else. Thus + all application programs are extended to handle flexible argument order. + + Setting the environment variable POSIXLY_CORRECT disables permutation. + Then the behavior is completely standard. + + GNU application programs can use a third alternative mode in which + they can distinguish the relative order of options and other arguments. */ + +#include "getopt.h" + +/* For communication from `getopt' to the caller. + When `getopt' finds an option that takes an argument, + the argument value is returned here. + Also, when `ordering' is RETURN_IN_ORDER, + each non-option ARGV-element is returned here. */ + +char *optarg = 0; + +/* Index in ARGV of the next element to be scanned. + This is used for communication to and from the caller + and for communication between successive calls to `getopt'. + + On entry to `getopt', zero means this is the first call; initialize. + + When `getopt' returns EOF, this is the index of the first of the + non-option elements that the caller should itself scan. + + Otherwise, `optind' communicates from one call to the next + how much of ARGV has been scanned so far. */ + +int optind = 0; + +/* The next char to be scanned in the option-element + in which the last option character we returned was found. + This allows us to pick up the scan where we left off. + + If this is zero, or a null string, it means resume the scan + by advancing to the next ARGV-element. */ + +static char *nextchar; + +/* Callers store zero here to inhibit the error message + for unrecognized options. */ + +int opterr = 1; + +/* Describe how to deal with options that follow non-option ARGV-elements. + + If the caller did not specify anything, + the default is REQUIRE_ORDER if the environment variable + POSIXLY_CORRECT is defined, PERMUTE otherwise. + + REQUIRE_ORDER means don't recognize them as options; + stop option processing when the first non-option is seen. + This is what Unix does. + This mode of operation is selected by either setting the environment + variable POSIXLY_CORRECT, or using `+' as the first character + of the list of option characters. + + PERMUTE is the default. We permute the contents of ARGV as we scan, + so that eventually all the non-options are at the end. This allows options + to be given in any order, even with programs that were not written to + expect this. + + RETURN_IN_ORDER is an option available to programs that were written + to expect options and other ARGV-elements in any order and that care about + the ordering of the two. We describe each non-option ARGV-element + as if it were the argument of an option with character code 1. + Using `-' as the first character of the list of option characters + selects this mode of operation. + + The special argument `--' forces an end of option-scanning regardless + of the value of `ordering'. In the case of RETURN_IN_ORDER, only + `--' can cause `getopt' to return EOF with `optind' != ARGC. */ + +static enum +{ + REQUIRE_ORDER, PERMUTE, RETURN_IN_ORDER +} ordering; + +#ifdef __GNU_LIBRARY__ +#include +#define my_index strchr +#define my_bcopy(src, dst, n) memcpy ((dst), (src), (n)) +#else + +/* Avoid depending on library functions or files + whose names are inconsistent. */ + +char *getenv (); + +static char * +my_index (string, chr) + char *string; + int chr; +{ + while (*string) + { + if (*string == chr) + return string; + string++; + } + return 0; +} + +static void +my_bcopy (from, to, size) + char *from, *to; + int size; +{ + int i; + for (i = 0; i < size; i++) + to[i] = from[i]; +} +#endif /* GNU C library. */ + +/* Handle permutation of arguments. */ + +/* Describe the part of ARGV that contains non-options that have + been skipped. `first_nonopt' is the index in ARGV of the first of them; + `last_nonopt' is the index after the last of them. */ + +static int first_nonopt; +static int last_nonopt; + +/* Exchange two adjacent subsequences of ARGV. + One subsequence is elements [first_nonopt,last_nonopt) + which contains all the non-options that have been skipped so far. + The other is elements [last_nonopt,optind), which contains all + the options processed since those non-options were skipped. + + `first_nonopt' and `last_nonopt' are relocated so that they describe + the new indices of the non-options in ARGV after they are moved. */ + +static void +exchange (argv) + char **argv; +{ + int nonopts_size = (last_nonopt - first_nonopt) * sizeof (char *); + char **temp = (char **) __alloca (nonopts_size); + + /* Interchange the two blocks of data in ARGV. */ + + my_bcopy (&argv[first_nonopt], temp, nonopts_size); + my_bcopy (&argv[last_nonopt], &argv[first_nonopt], + (optind - last_nonopt) * sizeof (char *)); + my_bcopy (temp, &argv[first_nonopt + optind - last_nonopt], nonopts_size); + + /* Update records for the slots the non-options now occupy. */ + + first_nonopt += (optind - last_nonopt); + last_nonopt = optind; +} + +/* Scan elements of ARGV (whose length is ARGC) for option characters + given in OPTSTRING. + + If an element of ARGV starts with '-', and is not exactly "-" or "--", + then it is an option element. The characters of this element + (aside from the initial '-') are option characters. If `getopt' + is called repeatedly, it returns successively each of the option characters + from each of the option elements. + + If `getopt' finds another option character, it returns that character, + updating `optind' and `nextchar' so that the next call to `getopt' can + resume the scan with the following option character or ARGV-element. + + If there are no more option characters, `getopt' returns `EOF'. + Then `optind' is the index in ARGV of the first ARGV-element + that is not an option. (The ARGV-elements have been permuted + so that those that are not options now come last.) + + OPTSTRING is a string containing the legitimate option characters. + If an option character is seen that is not listed in OPTSTRING, + return '?' after printing an error message. If you set `opterr' to + zero, the error message is suppressed but we still return '?'. + + If a char in OPTSTRING is followed by a colon, that means it wants an arg, + so the following text in the same ARGV-element, or the text of the following + ARGV-element, is returned in `optarg'. Two colons mean an option that + wants an optional arg; if there is text in the current ARGV-element, + it is returned in `optarg', otherwise `optarg' is set to zero. + + If OPTSTRING starts with `-' or `+', it requests different methods of + handling the non-option ARGV-elements. + See the comments about RETURN_IN_ORDER and REQUIRE_ORDER, above. + + Long-named options begin with `--' instead of `-'. + Their names may be abbreviated as long as the abbreviation is unique + or is an exact match for some defined option. If they have an + argument, it follows the option name in the same ARGV-element, separated + from the option name by a `=', or else the in next ARGV-element. + When `getopt' finds a long-named option, it returns 0 if that option's + `flag' field is nonzero, the value of the option's `val' field + if the `flag' field is zero. + + The elements of ARGV aren't really const, because we permute them. + But we pretend they're const in the prototype to be compatible + with other systems. + + LONGOPTS is a vector of `struct option' terminated by an + element containing a name which is zero. + + LONGIND returns the index in LONGOPT of the long-named option found. + It is only valid when a long-named option has been found by the most + recent call. + + If LONG_ONLY is nonzero, '-' as well as '--' can introduce + long-named options. */ + +int +_getopt_internal (argc, argv, optstring, longopts, longind, long_only) + int argc; + char *const *argv; + const char *optstring; + const struct option *longopts; + int *longind; + int long_only; +{ + int option_index; + + optarg = 0; + + /* Initialize the internal data when the first call is made. + Start processing options with ARGV-element 1 (since ARGV-element 0 + is the program name); the sequence of previously skipped + non-option ARGV-elements is empty. */ + + if (optind == 0) + { + first_nonopt = last_nonopt = optind = 1; + + nextchar = NULL; + + /* Determine how to handle the ordering of options and nonoptions. */ + + if (optstring[0] == '-') + { + ordering = RETURN_IN_ORDER; + ++optstring; + } + else if (optstring[0] == '+') + { + ordering = REQUIRE_ORDER; + ++optstring; + } + else if (getenv ("POSIXLY_CORRECT") != NULL) + ordering = REQUIRE_ORDER; + else + ordering = PERMUTE; + } + + if (nextchar == NULL || *nextchar == '\0') + { + if (ordering == PERMUTE) + { + /* If we have just processed some options following some non-options, + exchange them so that the options come first. */ + + if (first_nonopt != last_nonopt && last_nonopt != optind) + exchange ((char **) argv); + else if (last_nonopt != optind) + first_nonopt = optind; + + /* Now skip any additional non-options + and extend the range of non-options previously skipped. */ + + while (optind < argc + && (argv[optind][0] != '-' || argv[optind][1] == '\0') +#ifdef GETOPT_COMPAT + && (longopts == NULL + || argv[optind][0] != '+' || argv[optind][1] == '\0') +#endif /* GETOPT_COMPAT */ + ) + optind++; + last_nonopt = optind; + } + + /* Special ARGV-element `--' means premature end of options. + Skip it like a null option, + then exchange with previous non-options as if it were an option, + then skip everything else like a non-option. */ + + if (optind != argc && !strcmp (argv[optind], "--")) + { + optind++; + + if (first_nonopt != last_nonopt && last_nonopt != optind) + exchange ((char **) argv); + else if (first_nonopt == last_nonopt) + first_nonopt = optind; + last_nonopt = argc; + + optind = argc; + } + + /* If we have done all the ARGV-elements, stop the scan + and back over any non-options that we skipped and permuted. */ + + if (optind == argc) + { + /* Set the next-arg-index to point at the non-options + that we previously skipped, so the caller will digest them. */ + if (first_nonopt != last_nonopt) + optind = first_nonopt; + return EOF; + } + + /* If we have come to a non-option and did not permute it, + either stop the scan or describe it to the caller and pass it by. */ + + if ((argv[optind][0] != '-' || argv[optind][1] == '\0') +#ifdef GETOPT_COMPAT + && (longopts == NULL + || argv[optind][0] != '+' || argv[optind][1] == '\0') +#endif /* GETOPT_COMPAT */ + ) + { + if (ordering == REQUIRE_ORDER) + return EOF; + optarg = argv[optind++]; + return 1; + } + + /* We have found another option-ARGV-element. + Start decoding its characters. */ + + nextchar = (argv[optind] + 1 + + (longopts != NULL && argv[optind][1] == '-')); + } + + if (longopts != NULL + && ((argv[optind][0] == '-' + && (argv[optind][1] == '-' || long_only)) +#ifdef GETOPT_COMPAT + || argv[optind][0] == '+' +#endif /* GETOPT_COMPAT */ + )) + { + const struct option *p; + char *s = nextchar; + int exact = 0; + int ambig = 0; + const struct option *pfound = NULL; + int indfound; + + while (*s && *s != '=') + s++; + + /* Test all options for either exact match or abbreviated matches. */ + for (p = longopts, option_index = 0; p->name; + p++, option_index++) + if (!strncmp (p->name, nextchar, s - nextchar)) + { + if (s - nextchar == strlen (p->name)) + { + /* Exact match found. */ + pfound = p; + indfound = option_index; + exact = 1; + break; + } + else if (pfound == NULL) + { + /* First nonexact match found. */ + pfound = p; + indfound = option_index; + } + else + /* Second nonexact match found. */ + ambig = 1; + } + + if (ambig && !exact) + { + if (opterr) + fprintf (stderr, "%s: option `%s' is ambiguous\n", + argv[0], argv[optind]); + nextchar += strlen (nextchar); + optind++; + return '?'; + } + + if (pfound != NULL) + { + option_index = indfound; + optind++; + if (*s) + { + /* Don't test has_arg with >, because some C compilers don't + allow it to be used on enums. */ + if (pfound->has_arg) + optarg = s + 1; + else + { + if (opterr) + { + if (argv[optind - 1][1] == '-') + /* --option */ + fprintf (stderr, + "%s: option `--%s' doesn't allow an argument\n", + argv[0], pfound->name); + else + /* +option or -option */ + fprintf (stderr, + "%s: option `%c%s' doesn't allow an argument\n", + argv[0], argv[optind - 1][0], pfound->name); + } + nextchar += strlen (nextchar); + return '?'; + } + } + else if (pfound->has_arg == 1) + { + if (optind < argc) + optarg = argv[optind++]; + else + { + if (opterr) + fprintf (stderr, "%s: option `%s' requires an argument\n", + argv[0], argv[optind - 1]); + nextchar += strlen (nextchar); + return '?'; + } + } + nextchar += strlen (nextchar); + if (longind != NULL) + *longind = option_index; + if (pfound->flag) + { + *(pfound->flag) = pfound->val; + return 0; + } + return pfound->val; + } + /* Can't find it as a long option. If this is not getopt_long_only, + or the option starts with '--' or is not a valid short + option, then it's an error. + Otherwise interpret it as a short option. */ + if (!long_only || argv[optind][1] == '-' +#ifdef GETOPT_COMPAT + || argv[optind][0] == '+' +#endif /* GETOPT_COMPAT */ + || my_index (optstring, *nextchar) == NULL) + { + if (opterr) + { + if (argv[optind][1] == '-') + /* --option */ + fprintf (stderr, "%s: unrecognized option `--%s'\n", + argv[0], nextchar); + else + /* +option or -option */ + fprintf (stderr, "%s: unrecognized option `%c%s'\n", + argv[0], argv[optind][0], nextchar); + } + nextchar += strlen (nextchar); + optind++; + return '?'; + } + } + + /* Look at and handle the next option-character. */ + + { + char c = *nextchar++; + char *temp = my_index (optstring, c); + + /* Increment `optind' when we start to process its last character. */ + if (*nextchar == '\0') + optind++; + + if (temp == NULL || c == ':') + { + if (opterr) + { + if (c < 040 || c >= 0177) + fprintf (stderr, "%s: unrecognized option, character code 0%o\n", + argv[0], c); + else + fprintf (stderr, "%s: unrecognized option `-%c'\n", argv[0], c); + } + return '?'; + } + if (temp[1] == ':') + { + if (temp[2] == ':') + { + /* This is an option that accepts an argument optionally. */ + if (*nextchar != '\0') + { + optarg = nextchar; + optind++; + } + else + optarg = 0; + nextchar = NULL; + } + else + { + /* This is an option that requires an argument. */ + if (*nextchar != 0) + { + optarg = nextchar; + /* If we end this ARGV-element by taking the rest as an arg, + we must advance to the next element now. */ + optind++; + } + else if (optind == argc) + { + if (opterr) + fprintf (stderr, "%s: option `-%c' requires an argument\n", + argv[0], c); + c = '?'; + } + else + /* We already incremented `optind' once; + increment it again when taking next ARGV-elt as argument. */ + optarg = argv[optind++]; + nextchar = NULL; + } + } + return c; + } +} + +int +getopt (argc, argv, optstring) + int argc; + char *const *argv; + const char *optstring; +{ + return _getopt_internal (argc, argv, optstring, + (const struct option *) 0, + (int *) 0, + 0); +} + +#ifdef TEST + +/* Compile with -DTEST to make an executable for use in testing + the above definition of `getopt'. */ + +int +main (argc, argv) + int argc; + char **argv; +{ + int c; + int digit_optind = 0; + + while (1) + { + int this_option_optind = optind ? optind : 1; + + c = getopt (argc, argv, "abc:d:0123456789"); + if (c == EOF) + break; + + switch (c) + { + case '0': + case '1': + case '2': + case '3': + case '4': + case '5': + case '6': + case '7': + case '8': + case '9': + if (digit_optind != 0 && digit_optind != this_option_optind) + printf ("digits occur in two different argv-elements.\n"); + digit_optind = this_option_optind; + printf ("option %c\n", c); + break; + + case 'a': + printf ("option a\n"); + break; + + case 'b': + printf ("option b\n"); + break; + + case 'c': + printf ("option c with value `%s'\n", optarg); + break; + + case '?': + break; + + default: + printf ("?? getopt returned character code 0%o ??\n", c); + } + } + + if (optind < argc) + { + printf ("non-option ARGV-elements: "); + while (optind < argc) + printf ("%s ", argv[optind++]); + printf ("\n"); + } + + exit (0); +} + +#endif /* TEST */ diff --git a/gnu/usr.bin/grep/getopt.h b/gnu/usr.bin/grep/getopt.h new file mode 100644 index 000000000000..f64de3116163 --- /dev/null +++ b/gnu/usr.bin/grep/getopt.h @@ -0,0 +1,113 @@ +/* Declarations for getopt. + Copyright (C) 1989, 1990, 1991, 1992 Free Software Foundation, Inc. + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2, or (at your option) + any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ + +#ifndef _GETOPT_H +#define _GETOPT_H 1 + +/* For communication from `getopt' to the caller. + When `getopt' finds an option that takes an argument, + the argument value is returned here. + Also, when `ordering' is RETURN_IN_ORDER, + each non-option ARGV-element is returned here. */ + +extern char *optarg; + +/* Index in ARGV of the next element to be scanned. + This is used for communication to and from the caller + and for communication between successive calls to `getopt'. + + On entry to `getopt', zero means this is the first call; initialize. + + When `getopt' returns EOF, this is the index of the first of the + non-option elements that the caller should itself scan. + + Otherwise, `optind' communicates from one call to the next + how much of ARGV has been scanned so far. */ + +extern int optind; + +/* Callers store zero here to inhibit the error message `getopt' prints + for unrecognized options. */ + +extern int opterr; + +/* Describe the long-named options requested by the application. + The LONG_OPTIONS argument to getopt_long or getopt_long_only is a vector + of `struct option' terminated by an element containing a name which is + zero. + + The field `has_arg' is: + no_argument (or 0) if the option does not take an argument, + required_argument (or 1) if the option requires an argument, + optional_argument (or 2) if the option takes an optional argument. + + If the field `flag' is not NULL, it points to a variable that is set + to the value given in the field `val' when the option is found, but + left unchanged if the option is not found. + + To have a long-named option do something other than set an `int' to + a compiled-in constant, such as set a value from `optarg', set the + option's `flag' field to zero and its `val' field to a nonzero + value (the equivalent single-letter option character, if there is + one). For long options that have a zero `flag' field, `getopt' + returns the contents of the `val' field. */ + +struct option +{ +#if __STDC__ + const char *name; +#else + char *name; +#endif + /* has_arg can't be an enum because some compilers complain about + type mismatches in all the code that assumes it is an int. */ + int has_arg; + int *flag; + int val; +}; + +/* Names for the values of the `has_arg' field of `struct option'. */ + +enum _argtype +{ + no_argument, + required_argument, + optional_argument +}; + +#if __STDC__ +extern int getopt (int argc, char *const *argv, const char *shortopts); +extern int getopt_long (int argc, char *const *argv, const char *shortopts, + const struct option *longopts, int *longind); +extern int getopt_long_only (int argc, char *const *argv, + const char *shortopts, + const struct option *longopts, int *longind); + +/* Internal only. Users should not call this directly. */ +extern int _getopt_internal (int argc, char *const *argv, + const char *shortopts, + const struct option *longopts, int *longind, + int long_only); +#else /* not __STDC__ */ +extern int getopt (); +extern int getopt_long (); +extern int getopt_long_only (); + +extern int _getopt_internal (); +#endif /* not __STDC__ */ + +#endif /* _GETOPT_H */ diff --git a/gnu/usr.bin/grep/grep.1 b/gnu/usr.bin/grep/grep.1 new file mode 100644 index 000000000000..57093b6403a3 --- /dev/null +++ b/gnu/usr.bin/grep/grep.1 @@ -0,0 +1,234 @@ +.TH GREP 1 "1988 December 13" "GNU Project" \" -*- nroff -*- +.UC 4 +.SH NAME +grep, egrep \- print lines matching a regular expression +.SH SYNOPSIS +.B grep +[ +.B \-CVbchilnsvwx +] [ +.BI \- num +] [ +.B \-AB +.I num +] [ [ +.B \-e +] +.I expr +| +.B \-f +.I file +] [ +.I "files ..." +] +.SH DESCRIPTION +.I Grep +searches the files listed in the arguments (or standard +input if no files are given) for all lines that contain a match for +the given +.IR expr . +If any lines match, they are printed. +.PP +Also, if any matches were found, +.I grep +exits with a status of 0, but if no matches were found it exits +with a status of 1. This is useful for building shell scripts that +use +.I grep +as a condition for, for example, the +.I if +statement. +.PP +When invoked as +.I egrep +the syntax of the +.I expr +is slightly different; See below. +.br +.SH "REGULAR EXPRESSIONS" +.RS 2.5i +.ta 1i 2i +.sp +.ti -2.0i +(grep) (egrep) (explanation) +.sp +.ti -2.0i +\fIc\fP \fIc\fP a single (non-meta) character matches itself. +.sp +.ti -2.0i +\&. . matches any single character except newline. +.sp +.ti -2.0i +\\? ? postfix operator; preceeding item is optional. +.sp +.ti -2.0i +\(** \(** postfix operator; preceeding item 0 or +more times. +.sp +.ti -2.0i +\\+ + postfix operator; preceeding item 1 or +more times. +.sp +.ti -2.0i +\\| | infix operator; matches either +argument. +.sp +.ti -2.0i +^ ^ matches the empty string at the beginning of a line. +.sp +.ti -2.0i +$ $ matches the empty string at the end of a line. +.sp +.ti -2.0i +\\< \\< matches the empty string at the beginning of a word. +.sp +.ti -2.0i +\\> \\> matches the empty string at the end of a word. +.sp +.ti -2.0i +[\fIchars\fP] [\fIchars\fP] match any character in the given class; if the +first character after [ is ^, match any character +not in the given class; a range of characters may +be specified by \fIfirst\-last\fP; for example, \\W +(below) is equivalent to the class [^A\-Za\-z0\-9] +.sp +.ti -2.0i +\\( \\) ( ) parentheses are used to override operator precedence. +.sp +.ti -2.0i +\\\fIdigit\fP \\\fIdigit\fP \\\fIn\fP matches a repeat of the text +matched earlier in the regexp by the subexpression inside the nth +opening parenthesis. +.sp +.ti -2.0i +\\ \\ any special character may be preceded +by a backslash to match it literally. +.sp +.ti -2.0i +(the following are for compatibility with GNU Emacs) +.sp +.ti -2.0i +\\b \\b matches the empty string at the edge of a word. +.sp +.ti -2.0i +\\B \\B matches the empty string if not at the edge of a word. +.sp +.ti -2.0i +\\w \\w matches word-constituent characters (letters & digits). +.sp +.ti -2.0i +\\W \\W matches characters that are not word-constituent. +.RE +.PP +Operator precedence is (highest to lowest) ?, \(**, and +, concatenation, +and finally |. All other constructs are syntactically identical to +normal characters. For the truly interested, the file dfa.c describes +(and implements) the exact grammar understood by the parser. +.SH OPTIONS +.TP +.BI \-A " num" +print lines of context after every matching line +.TP +.BI \-B " num" +print +.I num +lines of context before every matching line +.TP +.B \-C +print 2 lines of context on each side of every match +.TP +.BI \- num +print +.I num +lines of context on each side of every match +.TP +.B \-V +print the version number on the diagnostic output +.TP +.B \-b +print every match preceded by its byte offset +.TP +.B \-c +print a total count of matching lines only +.TP +.BI \-e " expr" +search for +.IR expr ; +useful if +.I expr +begins with \- +.TP +.BI \-f " file" +search for the expression contained in +.I file +.TP +.B \-h +don't display filenames on matches +.TP +.B \-i +ignore case difference when comparing strings +.TP +.B \-l +list files containing matches only +.TP +.B \-n +print each match preceded by its line number +.TP +.B \-s +run silently producing no output except error messages +.TP +.B \-v +print only lines that contain no matches for the +.TP +.B \-w +print only lines where the match is a complete word +.TP +.B \-x +print only lines where the match is a whole line +.SH "SEE ALSO" +emacs(1), ed(1), sh(1), +.I "GNU Emacs Manual" +.SH INCOMPATIBILITIES +The following incompatibilities with UNIX +.I grep +exist: +.PP +.RS 0.5i +The context-dependent meaning of \(** is not quite the same (grep only). +.PP +.B \-b +prints a byte offset instead of a block offset. +.PP +The {\fIm,n\fP} construct of System V grep is not implemented. +.PP +.SH BUGS +GNU \fIe?grep\fP has been thoroughly debugged and tested over a period +of several years; we think it's a reliable beast or we wouldn't +distribute it. If by some fluke of the universe you discover a bug, +send a detailed description (including options, regular expressions, +and a copy of an input file that can reproduce it) to mike@ai.mit.edu. +.PP +.SH AUTHORS +Mike Haertel wrote the deterministic regexp code and the bulk +of the program. +.PP +James A. Woods is responsible for the hybridized search strategy +of using Boyer-Moore-Gosper fixed-string search as a filter +before calling the general regexp matcher. +.PP +Arthur David Olson contributed code that finds fixed strings for +the aforementioned BMG search for a large class of regexps. +.PP +Richard Stallman wrote the backtracking regexp matcher that is used +for \\\fIdigit\fP backreferences, as well as the GNU getopt. The +backtracking matcher was originally written for GNU Emacs. +.PP +D. A. Gwyn wrote the C alloca emulation that is provided so +System V machines can run this program. (Alloca is used only +by RMS' backtracking matcher, and then only rarely, so there +is no loss if your machine doesn't have a "real" alloca.) +.PP +Scott Anderson and Henry Spencer designed the regression tests +used in the "regress" script. +.PP +Paul Placeway wrote the original version of this manual page. diff --git a/gnu/usr.bin/grep/grep.c b/gnu/usr.bin/grep/grep.c new file mode 100644 index 000000000000..1c45c45fa425 --- /dev/null +++ b/gnu/usr.bin/grep/grep.c @@ -0,0 +1,1007 @@ +/* grep - print lines matching an extended regular expression + Copyright (C) 1988 Free Software Foundation, Inc. + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2, or (at your option) + any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ + +/* Written June, 1988 by Mike Haertel + BMG speedups added July, 1988 by James A. Woods and Arthur David Olson */ + +#include + +#if defined(USG) || defined(STDC_HEADERS) +#include +#ifndef bcopy +#define bcopy(s,d,n) memcpy((d),(s),(n)) +#endif +#ifndef index +#define index strchr +#endif +#else +#include +#endif + +#ifdef HAVE_UNISTD_H +#include +#endif + +#ifndef STDC_HEADERS +extern char *getenv(); +#endif +extern int errno; + +extern char *sys_errlist[]; + +#include "dfa.h" +#include "regex.h" +#include "getopt.h" + +/* Used by -w */ +#define WCHAR(C) (ISALNUM(C) || (C) == '_') + +#define MAX(a, b) ((a) > (b) ? (a) : (b)) + +/* Exit status codes. */ +#define MATCHES_FOUND 0 /* Exit 0 if no errors and matches found. */ +#define NO_MATCHES_FOUND 1 /* Exit 1 if no matches were found. */ +#define ERROR 2 /* Exit 2 if some error occurred. */ + +/* Error is set true if something awful happened. */ +static int error; + +/* The program name for error messages. */ +static char *prog; + +/* We do all our own buffering by hand for efficiency. */ +static char *buffer; /* The buffer itself, grown as needed. */ +static bufbytes; /* Number of bytes in the buffer. */ +static size_t bufalloc; /* Number of bytes allocated to the buffer. */ +static bufprev; /* Number of bytes that have been forgotten. + This is used to get byte offsets from the + beginning of the file. */ +static bufread; /* Number of bytes to get with each read(). */ + +static void +initialize_buffer() +{ + bufread = 8192; + bufalloc = bufread + bufread / 2; + buffer = malloc(bufalloc); + if (! buffer) + { + fprintf(stderr, "%s: Memory exhausted (%s)\n", prog, + sys_errlist[errno]); + exit(ERROR); + } +} + +/* The current input file. */ +static fd; +static char *filename; +static eof; + +/* Fill the buffer retaining the last n bytes at the beginning of the + newly filled buffer (for backward context). Returns the number of new + bytes read from disk. */ +static int +fill_buffer_retaining(n) + int n; +{ + char *p, *q; + int i; + + /* See if we need to grow the buffer. */ + if (bufalloc - n <= bufread) + { + while (bufalloc - n <= bufread) + { + bufalloc *= 2; + bufread *= 2; + } + buffer = realloc(buffer, bufalloc); + if (! buffer) + { + fprintf(stderr, "%s: Memory exhausted (%s)\n", prog, + sys_errlist[errno]); + exit(ERROR); + } + } + + bufprev += bufbytes - n; + + /* Shift stuff down. */ + for (i = n, p = buffer, q = p + bufbytes - n; i--; ) + *p++ = *q++; + bufbytes = n; + + if (eof) + return 0; + + /* Read in new stuff. */ + i = read(fd, buffer + bufbytes, bufread); + if (i < 0) + { + fprintf(stderr, "%s: read on %s failed (%s)\n", prog, + filename ? filename : "", sys_errlist[errno]); + error = 1; + } + + /* Kludge to pretend every nonempty file ends with a newline. */ + if (i == 0 && bufbytes > 0 && buffer[bufbytes - 1] != '\n') + { + eof = i = 1; + buffer[bufbytes] = '\n'; + } + + bufbytes += i; + return i; +} + +/* Various flags set according to the argument switches. */ +static trailing_context; /* Lines of context to show after matches. */ +static leading_context; /* Lines of context to show before matches. */ +static byte_count; /* Precede output lines the byte count of the + first character on the line. */ +static no_filenames; /* Do not display filenames. */ +static line_numbers; /* Precede output lines with line numbers. */ +static silent; /* Produce no output at all. This switch + is bogus, ever hear of /dev/null? */ +static int whole_word; /* Match only whole words. Note that if + backreferences are used this depends on + the regex routines getting leftmost-longest + right, which they don't right now if | + is also used. */ +static int whole_line; /* Match only whole lines. Backreference + caveat applies here too. */ +static nonmatching_lines; /* Print lines that don't match the regexp. */ + +static bmgexec; /* Invoke Boyer-Moore-Gosper routines */ + +/* The compiled regular expression lives here. */ +static struct regexp reg; + +/* The compiled regular expression for the backtracking matcher lives here. */ +static struct re_pattern_buffer regex; + +/* Pointer in the buffer after the last character printed. */ +static char *printed_limit; + +/* True when printed_limit has been artifically advanced without printing + anything. */ +static int printed_limit_fake; + +/* Print a line at the given line number, returning the number of + characters actually printed. Matching is true if the line is to + be considered a "matching line". This is only meaningful if + surrounding context is turned on. */ +static int +print_line(p, number, matching) + char *p; + int number; + int matching; +{ + int count = 0; + + if (silent) + { + do + ++count; + while (*p++ != '\n'); + printed_limit_fake = 0; + printed_limit = p; + return count; + } + + if (filename && !no_filenames) + printf("%s%c", filename, matching ? ':' : '-'); + if (byte_count) + printf("%d%c", p - buffer + bufprev, matching ? ':' : '-'); + if (line_numbers) + printf("%d%c", number, matching ? ':' : '-'); + do + { + ++count; + putchar(*p); + } + while (*p++ != '\n'); + printed_limit_fake = 0; + printed_limit = p; + return count; +} + +/* Print matching or nonmatching lines from the current file. Returns a + count of matching or nonmatching lines. */ +static int +grep() +{ + int retain = 0; /* Number of bytes to retain on next call + to fill_buffer_retaining(). */ + char *search_limit; /* Pointer to the character after the last + newline in the buffer. */ + char saved_char; /* Character after the last newline. */ + char *resume; /* Pointer to where to resume search. */ + int resume_index = 0; /* Count of characters to ignore after + refilling the buffer. */ + int line_count = 1; /* Line number. */ + int try_backref; /* Set to true if we need to verify the + match with a backtracking matcher. */ + int initial_line_count; /* Line count at beginning of last search. */ + char *match; /* Pointer to the first character after the + string matching the regexp. */ + int match_count = 0; /* Count of matching lines. */ + char *matching_line; /* Pointer to first character of the matching + line, or of the first line of context to + print if context is turned on. */ + char *real_matching_line; /* Pointer to the first character of the + real matching line. */ + char *next_line; /* Pointer to first character of the line + following the matching line. */ + char *last_match_limit; /* Pointer after last matched line. */ + int pending_lines = 0; /* Lines of context left over from last match + that we have to print. */ + static first_match = 1; /* True when nothing has been printed. */ + int i; + char *tmp; + char *execute(); + + printed_limit_fake = 0; + + while (fill_buffer_retaining(retain) > 0) + { + /* Find the last newline in the buffer. */ + search_limit = buffer + bufbytes; + while (search_limit > buffer && search_limit[-1] != '\n') + --search_limit; + if (search_limit == buffer) + { + retain = bufbytes; + continue; + } + + /* Save the character after the last newline so regexecute can write + its own sentinel newline. */ + saved_char = *search_limit; + + /* Search the buffer for a match. */ + printed_limit = buffer; + resume = buffer + resume_index; + last_match_limit = resume; + initial_line_count = line_count; + + + /* In retrospect, I have to say that the following code sucks. + For an example of how to do this right, see the fgrep + driver program that I wrote around a year later. I'm + too lazy to retrofit that to egrep right now (the + pattern matchers have different needs). */ + + + while (match = execute(®, resume, search_limit, 0, &line_count, &try_backref)) + { + /* Find the beginning of the matching line. */ + matching_line = match; + while (matching_line > resume && matching_line[-1] != '\n') + --matching_line; + real_matching_line = matching_line; + + /* Find the beginning of the next line. */ + next_line = match; + while (next_line < search_limit && *next_line++ != '\n') + ; + + /* If a potential backreference is indicated, try it out with + a backtracking matcher to make sure the line is a match. + This is hairy because we need to handle whole_line and + whole_word matches specially. The method was stolen from + GNU fgrep. */ + if (try_backref) + { + struct re_registers regs; + int beg, len, maxlen, ret; + + beg = 0; + for (maxlen = next_line - matching_line - 1; beg <= maxlen; ++beg) + { + /* See if the matching line matches when backreferences + are considered... */ + ret = re_search (®ex, matching_line, maxlen, + beg, maxlen - beg, ®s); + if (ret == -1) + goto fail; + beg = ret; + len = regs.end[0] - beg; + /* Ok, now check if it subsumed the whole line if -x */ + if (whole_line && (beg != 0 || len != maxlen)) + goto fail; + /* If -w then check if the match aligns with word + boundaries. We have to do this iteratively, because + (a) The line may contain more than one occurence + of the pattern, and; + (b) Several alternatives in the pattern might + be valid at a given point, and we may need to + consider a shorter one in order to align with + word boundaries. */ + else if (whole_word) + while (len > 0) + { + /* If it's preceeded by a word constituent, then no go. */ + if (beg > 0 + && WCHAR((unsigned char) matching_line[beg - 1])) + break; + /* If it's followed by a word constituent, look for + a shorter match. */ + else if (beg + len < maxlen + && WCHAR((unsigned char) matching_line[beg + len])) + /* This is sheer incest. */ + len = re_match_2 (®ex, (unsigned char *) 0, 0, + matching_line, maxlen, + beg, ®s, beg + len - 1); + else + goto succeed; + } + else + goto succeed; + } + fail: + resume = next_line; + if (resume == search_limit) + break; + else + continue; + } + + succeed: + /* Print out the matching or nonmatching lines as necessary. */ + if (! nonmatching_lines) + { + /* Not -v, so nothing hairy... */ + ++match_count; + + /* Print leftover trailing context from last time around. */ + while (pending_lines && last_match_limit < matching_line) + { + last_match_limit += print_line(last_match_limit, + initial_line_count++, + 0); + --pending_lines; + } + + /* Back up over leading context if necessary. */ + for (i = leading_context; + i > 0 && matching_line > printed_limit; + --i) + { + while (matching_line > printed_limit + && (--matching_line)[-1] != '\n') + ; + --line_count; + } + + /* If context is enabled, we may have to print a separator. */ + if ((leading_context || trailing_context) && !silent + && !first_match && (printed_limit_fake + || matching_line > printed_limit)) + printf("----------\n"); + first_match = 0; + + /* Print the matching line and its leading context. */ + while (matching_line < real_matching_line) + matching_line += print_line(matching_line, line_count++, 0); + matching_line += print_line(matching_line, line_count++, 1); + + /* If there's trailing context, leave some lines pending until + next time. */ + pending_lines = trailing_context; + } + else if (matching_line == last_match_limit) + { + /* In the -v case, this is where we deal with leftover + trailing context from last time... */ + if (pending_lines > 0) + { + --pending_lines; + print_line(matching_line, line_count, 0); + } + ++line_count; + } + else if (matching_line > last_match_limit) + { + char *start = last_match_limit; + + /* Back up over leading context if necessary. */ + for (i = leading_context; start > printed_limit && i; --i) + { + while (start > printed_limit && (--start)[-1] != '\n') + ; + --initial_line_count; + } + + /* If context is enabled, we may have to print a separator. */ + if ((leading_context || trailing_context) && !silent + && !first_match && (printed_limit_fake + || start > printed_limit)) + printf("----------\n"); + first_match = 0; + + /* Print out the presumably matching leading context. */ + while (start < last_match_limit) + start += print_line(start, initial_line_count++, 0); + + /* Print out the nonmatching lines prior to the matching line. */ + while (start < matching_line) + { + /* This counts as a "matching line" in -v. */ + ++match_count; + start += print_line(start, initial_line_count++, 1); + } + + /* Deal with trailing context. In -v what this means is + we print the current (matching) line, marked as a non + matching line. */ + if (trailing_context) + { + print_line(matching_line, line_count, 0); + pending_lines = trailing_context - 1; + } + + /* Count the current line. */ + ++line_count; + } + else + /* Let us pray this never happens... */ + abort(); + + /* Resume searching at the beginning of the next line. */ + initial_line_count = line_count; + resume = next_line; + last_match_limit = next_line; + + if (resume == search_limit) + break; + } + + /* Restore the saved character. */ + *search_limit = saved_char; + + if (! nonmatching_lines) + { + while (last_match_limit < search_limit && pending_lines) + { + last_match_limit += print_line(last_match_limit, + initial_line_count++, + 0); + --pending_lines; + } + } + else if (search_limit > last_match_limit) + { + char *start = last_match_limit; + + /* Back up over leading context if necessary. */ + for (i = leading_context; start > printed_limit && i; --i) + { + while (start > printed_limit && (--start)[-1] != '\n') + ; + --initial_line_count; + } + + /* If context is enabled, we may have to print a separator. */ + if ((leading_context || trailing_context) && !silent + && !first_match && (printed_limit_fake + || start > printed_limit)) + printf("----------\n"); + first_match = 0; + + /* Print out all the nonmatching lines up to the search limit. */ + while (start < last_match_limit) + start += print_line(start, initial_line_count++, 0); + while (start < search_limit) + { + ++match_count; + start += print_line(start, initial_line_count++, 1); + } + + pending_lines = trailing_context; + resume_index = 0; + retain = bufbytes - (search_limit - buffer); + continue; + } + + /* Save the trailing end of the buffer for possible use as leading + context in the future. */ + i = leading_context; + tmp = search_limit; + while (tmp > printed_limit && i--) + while (tmp > printed_limit && (--tmp)[-1] != '\n') + ; + resume_index = search_limit - tmp; + retain = bufbytes - (tmp - buffer); + if (tmp > printed_limit) + printed_limit_fake = 1; + } + + return match_count; +} + +void +usage_and_die() +{ + fprintf(stderr, "\ +Usage: %s [-CVbchilnsvwx] [-num] [-A num] [-B num] [-f file]\n\ + [-e] expr [file...]\n", prog); + exit(ERROR); +} + +static char version[] = "GNU e?grep, version 1.6"; + +int +main(argc, argv) + int argc; + char **argv; +{ + int c; + int ignore_case = 0; /* Compile the regexp to ignore case. */ + char *the_regexp = 0; /* The regular expression. */ + int regexp_len; /* Length of the regular expression. */ + char *regexp_file = 0; /* File containing parallel regexps. */ + int count_lines = 0; /* Display only a count of matching lines. */ + int list_files = 0; /* Display only the names of matching files. */ + int line_count = 0; /* Count of matching lines for a file. */ + int matches_found = 0; /* True if matches were found. */ + char *regex_errmesg; /* Error message from regex routines. */ + char translate[_NOTCHAR]; /* Translate table for case conversion + (needed by the backtracking matcher). */ + + if (prog = index(argv[0], '/')) + ++prog; + else + prog = argv[0]; + + opterr = 0; + while ((c = getopt(argc, argv, "0123456789A:B:CVbce:f:hilnsvwx")) != EOF) + switch (c) + { + case '?': + usage_and_die(); + break; + + case '0': + case '1': + case '2': + case '3': + case '4': + case '5': + case '6': + case '7': + case '8': + case '9': + trailing_context = 10 * trailing_context + c - '0'; + leading_context = 10 * leading_context + c - '0'; + break; + + case 'A': + if (! sscanf(optarg, "%d", &trailing_context) + || trailing_context < 0) + usage_and_die(); + break; + + case 'B': + if (! sscanf(optarg, "%d", &leading_context) + || leading_context < 0) + usage_and_die(); + break; + + case 'C': + trailing_context = leading_context = 2; + break; + + case 'V': + fprintf(stderr, "%s\n", version); + break; + + case 'b': + byte_count = 1; + break; + + case 'c': + count_lines = 1; + silent = 1; + break; + + case 'e': + /* It doesn't make sense to mix -f and -e. */ + if (regexp_file) + usage_and_die(); + the_regexp = optarg; + break; + + case 'f': + /* It doesn't make sense to mix -f and -e. */ + if (the_regexp) + usage_and_die(); + regexp_file = optarg; + break; + + case 'h': + no_filenames = 1; + break; + + case 'i': + ignore_case = 1; + for (c = 0; c < _NOTCHAR; ++c) + if (isupper(c)) + translate[c] = tolower(c); + else + translate[c] = c; + regex.translate = translate; + break; + + case 'l': + list_files = 1; + silent = 1; + break; + + case 'n': + line_numbers = 1; + break; + + case 's': + silent = 1; + break; + + case 'v': + nonmatching_lines = 1; + break; + + case 'w': + whole_word = 1; + break; + + case 'x': + whole_line = 1; + break; + + default: + /* This can't happen. */ + fprintf(stderr, "%s: getopt(3) let one by!\n", prog); + usage_and_die(); + break; + } + + /* Set the syntax depending on whether we are EGREP or not. */ +#ifdef EGREP + regsyntax(RE_SYNTAX_EGREP, ignore_case); + re_set_syntax(RE_SYNTAX_EGREP); +#else + regsyntax(RE_SYNTAX_GREP, ignore_case); + re_set_syntax(RE_SYNTAX_GREP); +#endif + + /* Compile the regexp according to all the options. */ + if (regexp_file) + { + FILE *fp = fopen(regexp_file, "r"); + int len = 256; + int i = 0; + + if (! fp) + { + fprintf(stderr, "%s: %s: %s\n", prog, regexp_file, + sys_errlist[errno]); + exit(ERROR); + } + + the_regexp = malloc(len); + while ((c = getc(fp)) != EOF) + { + the_regexp[i++] = c; + if (i == len) + the_regexp = realloc(the_regexp, len *= 2); + } + fclose(fp); + /* Nuke the concluding newline so we won't match the empty string. */ + if (i > 0 && the_regexp[i - 1] == '\n') + --i; + regexp_len = i; + } + else if (! the_regexp) + { + if (optind >= argc) + usage_and_die(); + the_regexp = argv[optind++]; + regexp_len = strlen(the_regexp); + } + else + regexp_len = strlen(the_regexp); + + if (whole_word || whole_line) + { + /* In the whole-word case, we use the pattern: + (^|[^A-Za-z_])(userpattern)([^A-Za-z_]|$). + In the whole-line case, we use the pattern: + ^(userpattern)$. + BUG: Using [A-Za-z_] is locale-dependent! */ + + char *n = malloc(regexp_len + 50); + int i = 0; + +#ifdef EGREP + if (whole_word) + strcpy(n, "(^|[^A-Za-z_])("); + else + strcpy(n, "^("); +#else + /* Todo: Make *sure* this is the right syntax. Down with grep! */ + if (whole_word) + strcpy(n, "\\(^\\|[^A-Za-z_]\\)\\("); + else + strcpy(n, "^\\("); +#endif + i = strlen(n); + bcopy(the_regexp, n + i, regexp_len); + i += regexp_len; +#ifdef EGREP + if (whole_word) + strcpy(n + i, ")([^A-Za-z_]|$)"); + else + strcpy(n + i, ")$"); +#else + if (whole_word) + strcpy(n + i, "\\)\\([^A-Za-z_]\\|$\\)"); + else + strcpy(n + i, "\\)$"); +#endif + i += strlen(n + i); + regcompile(n, i, ®, 1); + } + else + regcompile(the_regexp, regexp_len, ®, 1); + + + if (regex_errmesg = re_compile_pattern(the_regexp, regexp_len, ®ex)) + regerror(regex_errmesg); + + /* + Find the longest metacharacter-free string which must occur in the + regexpr, before short-circuiting regexecute() with Boyer-Moore-Gosper. + (Conjecture: The problem in general is NP-complete.) If there is no + such string (like for many alternations), then default to full automaton + search. regmust() code and heuristics [see dfa.c] courtesy + Arthur David Olson. + */ + if (line_numbers == 0 && nonmatching_lines == 0) + { + if (reg.mustn == 0 || reg.mustn == MUST_MAX || + index(reg.must, '\0') != reg.must + reg.mustn) + bmgexec = 0; + else + { + reg.must[reg.mustn] = '\0'; + if (getenv("MUSTDEBUG") != NULL) + (void) printf("must have: \"%s\"\n", reg.must); + bmg_setup(reg.must, ignore_case); + bmgexec = 1; + } + } + + if (argc - optind < 2) + no_filenames = 1; + + initialize_buffer(); + + if (argc > optind) + while (optind < argc) + { + bufprev = eof = 0; + filename = argv[optind++]; + fd = open(filename, 0, 0); + if (fd < 0) + { + fprintf(stderr, "%s: %s: %s\n", prog, filename, + sys_errlist[errno]); + error = 1; + continue; + } + if (line_count = grep()) + matches_found = 1; + close(fd); + if (count_lines) + if (!no_filenames) + printf("%s:%d\n", filename, line_count); + else + printf("%d\n", line_count); + else if (list_files && line_count) + printf("%s\n", filename); + } + else + { + if (line_count = grep()) + matches_found = 1; + if (count_lines) + printf("%d\n", line_count); + else if (list_files && line_count) + printf("\n"); + } + + if (error) + exit(ERROR); + if (matches_found) + exit(MATCHES_FOUND); + exit(NO_MATCHES_FOUND); + return NO_MATCHES_FOUND; +} + +/* Needed by the regexp routines. This could be fancier, especially when + dealing with parallel regexps in files. */ +void +regerror(s) + const char *s; +{ + fprintf(stderr, "%s: %s\n", prog, s); + exit(ERROR); +} + +/* + bmg_setup() and bmg_search() adapted from: + Boyer/Moore/Gosper-assisted 'egrep' search, with delta0 table as in + original paper (CACM, October, 1977). No delta1 or delta2. According to + experiment (Horspool, Soft. Prac. Exp., 1982), delta2 is of minimal + practical value. However, to improve for worst case input, integrating + the improved Galil strategies (Apostolico/Giancarlo, Siam. J. Comput., + February 1986) deserves consideration. + + James A. Woods Copyleft (C) 1986, 1988 + NASA Ames Research Center +*/ + +char * +execute(r, begin, end, newline, count, try_backref) + struct regexp *r; + char *begin; + char *end; + int newline; + int *count; + int *try_backref; +{ + register char *p, *s; + char *match; + char *start = begin; + char save; /* regexecute() sentinel */ + int len; + char *bmg_search(); + + if (!bmgexec) /* full automaton search */ + return(regexecute(r, begin, end, newline, count, try_backref)); + else + { + len = end - begin; + while ((match = bmg_search((unsigned char *) start, len)) != NULL) + { + p = match; /* narrow search range to submatch line */ + while (p > begin && *p != '\n') + p--; + s = match; + while (s < end && *s != '\n') + s++; + s++; + + save = *s; + *s = '\0'; + match = regexecute(r, p, s, newline, count, try_backref); + *s = save; + + if (match != NULL) + return((char *) match); + else + { + start = s; + len = end - start; + } + } + return(NULL); + } +} + +int delta0[256]; +unsigned char cmap[256]; /* (un)folded characters */ +unsigned char pattern[5000]; +int patlen; + +char * +bmg_search(buffer, buflen) + unsigned char *buffer; + int buflen; +{ + register unsigned char *k, *strend, *s, *buflim; + register int t; + int j; + + if (patlen > buflen) + return NULL; + + buflim = buffer + buflen; + if (buflen > patlen * 4) + strend = buflim - patlen * 4; + else + strend = buffer; + + s = buffer; + k = buffer + patlen - 1; + + for (;;) + { + /* The dreaded inner loop, revisited. */ + while (k < strend && (t = delta0[*k])) + { + k += t; + k += delta0[*k]; + k += delta0[*k]; + } + while (k < buflim && delta0[*k]) + ++k; + if (k == buflim) + break; + + j = patlen - 1; + s = k; + while (--j >= 0 && cmap[*--s] == pattern[j]) + ; + /* + delta-less shortcut for literati, but + short shrift for genetic engineers. + */ + if (j >= 0) + k++; + else /* submatch */ + return ((char *)k); + } + return(NULL); +} + +int +bmg_setup(pat, folded) /* compute "boyer-moore" delta table */ + char *pat; + int folded; +{ /* ... HAKMEM lives ... */ + int j; + + patlen = strlen(pat); + + if (folded) /* fold case while saving pattern */ + for (j = 0; j < patlen; j++) + pattern[j] = (isupper((int) pat[j]) ? + (char) tolower((int) pat[j]) : pat[j]); + else + bcopy(pat, pattern, patlen); + + for (j = 0; j < 256; j++) + { + delta0[j] = patlen; + cmap[j] = (char) j; /* could be done at compile time */ + } + for (j = 0; j < patlen - 1; j++) + delta0[pattern[j]] = patlen - j - 1; + delta0[pattern[patlen - 1]] = 0; + + if (folded) + { + for (j = 0; j < patlen - 1; j++) + if (islower((int) pattern[j])) + delta0[toupper((int) pattern[j])] = patlen - j - 1; + if (islower((int) pattern[patlen - 1])) + delta0[toupper((int) pattern[patlen - 1])] = 0; + for (j = 'A'; j <= 'Z'; j++) + cmap[j] = (char) tolower((int) j); + } +} diff --git a/gnu/usr.bin/grep/regex.c b/gnu/usr.bin/grep/regex.c new file mode 100644 index 000000000000..d69e78438cad --- /dev/null +++ b/gnu/usr.bin/grep/regex.c @@ -0,0 +1,1768 @@ +/* Extended regular expression matching and search library. + Copyright (C) 1985, 1989 Free Software Foundation, Inc. + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2, or (at your option) + any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + + + In other words, you are welcome to use, share and improve this program. + You are forbidden to forbid anyone else to use, share and improve + what you give them. Help stamp out software-hoarding! */ + + +/* To test, compile with -Dtest. + This Dtestable feature turns this into a self-contained program + which reads a pattern, describes how it compiles, + then reads a string and searches for it. */ + +/* AIX requires this to be the first thing in the file. */ +#ifdef __GNUC__ +#undef alloca +#define alloca __builtin_alloca +#else /* not __GNUC__ */ +#if defined(sparc) && !defined(USG) && !defined(SVR4) && !defined(__svr4__) +#include +#else +#ifdef _AIX + #pragma alloca +#else +char *alloca (); +#endif +#endif /* sparc */ +#endif /* not __GNUC__ */ + + +#ifdef emacs + +/* The `emacs' switch turns on certain special matching commands + that make sense only in emacs. */ + +#include "config.h" +#include "lisp.h" +#include "buffer.h" +#include "syntax.h" + +#else /* not emacs */ + +#if defined(USG) || defined(STDC_HEADERS) +#include +#ifndef bcopy +#define bcopy(s,d,n) memcpy((d),(s),(n)) +#endif +#ifndef bcmp +#define bcmp(s1,s2,n) memcmp((s1),(s2),(n)) +#endif +#ifndef bzero +#define bzero(s,n) memset((s),0,(n)) +#endif +#endif + +#ifdef STDC_HEADERS +#include +#else +char *malloc (); +#endif + +/* + * Define the syntax stuff, so we can do the \<...\> things. + */ + +#ifndef Sword /* must be non-zero in some of the tests below... */ +#define Sword 1 +#endif + +#define SYNTAX(c) re_syntax_table[c] + +#ifdef SYNTAX_TABLE + +char *re_syntax_table; + +#else + +static char re_syntax_table[256]; + +static void +init_syntax_once () +{ + register int c; + static int done = 0; + + if (done) + return; + + bzero (re_syntax_table, sizeof re_syntax_table); + + for (c = 'a'; c <= 'z'; c++) + re_syntax_table[c] = Sword; + + for (c = 'A'; c <= 'Z'; c++) + re_syntax_table[c] = Sword; + + for (c = '0'; c <= '9'; c++) + re_syntax_table[c] = Sword; + + done = 1; +} + +#endif /* SYNTAX_TABLE */ +#endif /* not emacs */ + +#include "regex.h" + +/* Number of failure points to allocate space for initially, + when matching. If this number is exceeded, more space is allocated, + so it is not a hard limit. */ + +#ifndef NFAILURES +#define NFAILURES 80 +#endif /* NFAILURES */ + +/* width of a byte in bits */ + +#define BYTEWIDTH 8 + +#ifndef SIGN_EXTEND_CHAR +#ifdef __CHAR_UNSIGNED__ +#define SIGN_EXTEND_CHAR(c) ((c)>(char)127?(c)-256:(c)) +#else +#define SIGN_EXTEND_CHAR(x) (x) +#endif +#endif + +static int obscure_syntax = 0; + +/* Specify the precise syntax of regexp for compilation. + This provides for compatibility for various utilities + which historically have different, incompatible syntaxes. + + The argument SYNTAX is a bit-mask containing the two bits + RE_NO_BK_PARENS and RE_NO_BK_VBAR. */ + +int +re_set_syntax (syntax) + int syntax; +{ + int ret; + + ret = obscure_syntax; + obscure_syntax = syntax; + return ret; +} + +/* re_compile_pattern takes a regular-expression string + and converts it into a buffer full of byte commands for matching. + + PATTERN is the address of the pattern string + SIZE is the length of it. + BUFP is a struct re_pattern_buffer * which points to the info + on where to store the byte commands. + This structure contains a char * which points to the + actual space, which should have been obtained with malloc. + re_compile_pattern may use realloc to grow the buffer space. + + The number of bytes of commands can be found out by looking in + the struct re_pattern_buffer that bufp pointed to, + after re_compile_pattern returns. +*/ + +#define PATPUSH(ch) (*b++ = (char) (ch)) + +#define PATFETCH(c) \ + {if (p == pend) goto end_of_pattern; \ + c = * (unsigned char *) p++; \ + if (translate) c = translate[c]; } + +#define PATFETCH_RAW(c) \ + {if (p == pend) goto end_of_pattern; \ + c = * (unsigned char *) p++; } + +#define PATUNFETCH p-- + +#define EXTEND_BUFFER \ + { char *old_buffer = bufp->buffer; \ + if (bufp->allocated == (1<<16)) goto too_big; \ + bufp->allocated *= 2; \ + if (bufp->allocated > (1<<16)) bufp->allocated = (1<<16); \ + if (!(bufp->buffer = (char *) realloc (bufp->buffer, bufp->allocated))) \ + goto memory_exhausted; \ + c = bufp->buffer - old_buffer; \ + b += c; \ + if (fixup_jump) \ + fixup_jump += c; \ + if (laststart) \ + laststart += c; \ + begalt += c; \ + if (pending_exact) \ + pending_exact += c; \ + } + +static void store_jump (), insert_jump (); + +char * +re_compile_pattern (pattern, size, bufp) + char *pattern; + int size; + struct re_pattern_buffer *bufp; +{ + register char *b = bufp->buffer; + register char *p = pattern; + char *pend = pattern + size; + register unsigned c, c1; + char *p1; + unsigned char *translate = (unsigned char *) bufp->translate; + + /* address of the count-byte of the most recently inserted "exactn" command. + This makes it possible to tell whether a new exact-match character + can be added to that command or requires a new "exactn" command. */ + + char *pending_exact = 0; + + /* address of the place where a forward-jump should go + to the end of the containing expression. + Each alternative of an "or", except the last, ends with a forward-jump + of this sort. */ + + char *fixup_jump = 0; + + /* address of start of the most recently finished expression. + This tells postfix * where to find the start of its operand. */ + + char *laststart = 0; + + /* In processing a repeat, 1 means zero matches is allowed */ + + char zero_times_ok; + + /* In processing a repeat, 1 means many matches is allowed */ + + char many_times_ok; + + /* address of beginning of regexp, or inside of last \( */ + + char *begalt = b; + + /* Stack of information saved by \( and restored by \). + Four stack elements are pushed by each \(: + First, the value of b. + Second, the value of fixup_jump. + Third, the value of regnum. + Fourth, the value of begalt. */ + + int stackb[40]; + int *stackp = stackb; + int *stacke = stackb + 40; + int *stackt; + + /* Counts \('s as they are encountered. Remembered for the matching \), + where it becomes the "register number" to put in the stop_memory command */ + + int regnum = 1; + + bufp->fastmap_accurate = 0; + +#ifndef emacs +#ifndef SYNTAX_TABLE + /* + * Initialize the syntax table. + */ + init_syntax_once(); +#endif +#endif + + if (bufp->allocated == 0) + { + bufp->allocated = 28; + if (bufp->buffer) + /* EXTEND_BUFFER loses when bufp->allocated is 0 */ + bufp->buffer = (char *) realloc (bufp->buffer, 28); + else + /* Caller did not allocate a buffer. Do it for him */ + bufp->buffer = (char *) malloc (28); + if (!bufp->buffer) goto memory_exhausted; + begalt = b = bufp->buffer; + } + + while (p != pend) + { + if (b - bufp->buffer > bufp->allocated - 10) + /* Note that EXTEND_BUFFER clobbers c */ + EXTEND_BUFFER; + + PATFETCH (c); + + switch (c) + { + case '$': + if (obscure_syntax & RE_TIGHT_VBAR) + { + if (! (obscure_syntax & RE_CONTEXT_INDEP_OPS) && p != pend) + goto normal_char; + /* Make operand of last vbar end before this `$'. */ + if (fixup_jump) + store_jump (fixup_jump, jump, b); + fixup_jump = 0; + PATPUSH (endline); + break; + } + + /* $ means succeed if at end of line, but only in special contexts. + If randomly in the middle of a pattern, it is a normal character. */ + if (p == pend || *p == '\n' + || (obscure_syntax & RE_CONTEXT_INDEP_OPS) + || (obscure_syntax & RE_NO_BK_PARENS + ? *p == ')' + : *p == '\\' && p[1] == ')') + || (obscure_syntax & RE_NO_BK_VBAR + ? *p == '|' + : *p == '\\' && p[1] == '|')) + { + PATPUSH (endline); + break; + } + goto normal_char; + + case '^': + /* ^ means succeed if at beg of line, but only if no preceding pattern. */ + + if (laststart && p[-2] != '\n' + && ! (obscure_syntax & RE_CONTEXT_INDEP_OPS)) + goto normal_char; + if (obscure_syntax & RE_TIGHT_VBAR) + { + if (p != pattern + 1 + && ! (obscure_syntax & RE_CONTEXT_INDEP_OPS)) + goto normal_char; + PATPUSH (begline); + begalt = b; + } + else + PATPUSH (begline); + break; + + case '+': + case '?': + if (obscure_syntax & RE_BK_PLUS_QM) + goto normal_char; + handle_plus: + case '*': + /* If there is no previous pattern, char not special. */ + if (!laststart && ! (obscure_syntax & RE_CONTEXT_INDEP_OPS)) + goto normal_char; + /* If there is a sequence of repetition chars, + collapse it down to equivalent to just one. */ + zero_times_ok = 0; + many_times_ok = 0; + while (1) + { + zero_times_ok |= c != '+'; + many_times_ok |= c != '?'; + if (p == pend) + break; + PATFETCH (c); + if (c == '*') + ; + else if (!(obscure_syntax & RE_BK_PLUS_QM) + && (c == '+' || c == '?')) + ; + else if ((obscure_syntax & RE_BK_PLUS_QM) + && c == '\\') + { + int c1; + PATFETCH (c1); + if (!(c1 == '+' || c1 == '?')) + { + PATUNFETCH; + PATUNFETCH; + break; + } + c = c1; + } + else + { + PATUNFETCH; + break; + } + } + + /* Star, etc. applied to an empty pattern is equivalent + to an empty pattern. */ + if (!laststart) + break; + + /* Now we know whether 0 matches is allowed, + and whether 2 or more matches is allowed. */ + if (many_times_ok) + { + /* If more than one repetition is allowed, + put in a backward jump at the end. */ + store_jump (b, maybe_finalize_jump, laststart - 3); + b += 3; + } + insert_jump (on_failure_jump, laststart, b + 3, b); + pending_exact = 0; + b += 3; + if (!zero_times_ok) + { + /* At least one repetition required: insert before the loop + a skip over the initial on-failure-jump instruction */ + insert_jump (dummy_failure_jump, laststart, laststart + 6, b); + b += 3; + } + break; + + case '.': + laststart = b; + PATPUSH (anychar); + break; + + case '[': + while (b - bufp->buffer + > bufp->allocated - 3 - (1 << BYTEWIDTH) / BYTEWIDTH) + /* Note that EXTEND_BUFFER clobbers c */ + EXTEND_BUFFER; + + laststart = b; + if (*p == '^') + PATPUSH (charset_not), p++; + else + PATPUSH (charset); + p1 = p; + + PATPUSH ((1 << BYTEWIDTH) / BYTEWIDTH); + /* Clear the whole map */ + bzero (b, (1 << BYTEWIDTH) / BYTEWIDTH); + /* Read in characters and ranges, setting map bits */ + while (1) + { + PATFETCH (c); + if (c == ']' && p != p1 + 1) break; + if (*p == '-' && p[1] != ']') + { + PATFETCH (c1); + PATFETCH (c1); + while (c <= c1) + b[c / BYTEWIDTH] |= 1 << (c % BYTEWIDTH), c++; + } + else + { + b[c / BYTEWIDTH] |= 1 << (c % BYTEWIDTH); + } + } + /* Discard any bitmap bytes that are all 0 at the end of the map. + Decrement the map-length byte too. */ + while ((int) b[-1] > 0 && b[b[-1] - 1] == 0) + b[-1]--; + b += b[-1]; + break; + + case '(': + if (! (obscure_syntax & RE_NO_BK_PARENS)) + goto normal_char; + else + goto handle_open; + + case ')': + if (! (obscure_syntax & RE_NO_BK_PARENS)) + goto normal_char; + else + goto handle_close; + + case '\n': + if (! (obscure_syntax & RE_NEWLINE_OR)) + goto normal_char; + else + goto handle_bar; + + case '|': + if (! (obscure_syntax & RE_NO_BK_VBAR)) + goto normal_char; + else + goto handle_bar; + + case '\\': + if (p == pend) goto invalid_pattern; + PATFETCH_RAW (c); + switch (c) + { + case '(': + if (obscure_syntax & RE_NO_BK_PARENS) + goto normal_backsl; + handle_open: + if (stackp == stacke) goto nesting_too_deep; + if (regnum < RE_NREGS) + { + PATPUSH (start_memory); + PATPUSH (regnum); + } + *stackp++ = b - bufp->buffer; + *stackp++ = fixup_jump ? fixup_jump - bufp->buffer + 1 : 0; + *stackp++ = regnum++; + *stackp++ = begalt - bufp->buffer; + fixup_jump = 0; + laststart = 0; + begalt = b; + break; + + case ')': + if (obscure_syntax & RE_NO_BK_PARENS) + goto normal_backsl; + handle_close: + if (stackp == stackb) goto unmatched_close; + begalt = *--stackp + bufp->buffer; + if (fixup_jump) + store_jump (fixup_jump, jump, b); + if (stackp[-1] < RE_NREGS) + { + PATPUSH (stop_memory); + PATPUSH (stackp[-1]); + } + stackp -= 2; + fixup_jump = 0; + if (*stackp) + fixup_jump = *stackp + bufp->buffer - 1; + laststart = *--stackp + bufp->buffer; + break; + + case '|': + if (obscure_syntax & RE_NO_BK_VBAR) + goto normal_backsl; + handle_bar: + insert_jump (on_failure_jump, begalt, b + 6, b); + pending_exact = 0; + b += 3; + if (fixup_jump) + store_jump (fixup_jump, jump, b); + fixup_jump = b; + b += 3; + laststart = 0; + begalt = b; + break; + +#ifdef emacs + case '=': + PATPUSH (at_dot); + break; + + case 's': + laststart = b; + PATPUSH (syntaxspec); + PATFETCH (c); + PATPUSH (syntax_spec_code[c]); + break; + + case 'S': + laststart = b; + PATPUSH (notsyntaxspec); + PATFETCH (c); + PATPUSH (syntax_spec_code[c]); + break; +#endif /* emacs */ + + case 'w': + laststart = b; + PATPUSH (wordchar); + break; + + case 'W': + laststart = b; + PATPUSH (notwordchar); + break; + + case '<': + PATPUSH (wordbeg); + break; + + case '>': + PATPUSH (wordend); + break; + + case 'b': + PATPUSH (wordbound); + break; + + case 'B': + PATPUSH (notwordbound); + break; + + case '`': + PATPUSH (begbuf); + break; + + case '\'': + PATPUSH (endbuf); + break; + + case '1': + case '2': + case '3': + case '4': + case '5': + case '6': + case '7': + case '8': + case '9': + c1 = c - '0'; + if (c1 >= regnum) + goto normal_char; + for (stackt = stackp - 2; stackt > stackb; stackt -= 4) + if (*stackt == c1) + goto normal_char; + laststart = b; + PATPUSH (duplicate); + PATPUSH (c1); + break; + + case '+': + case '?': + if (obscure_syntax & RE_BK_PLUS_QM) + goto handle_plus; + + default: + normal_backsl: + /* You might think it would be useful for \ to mean + not to translate; but if we don't translate it + it will never match anything. */ + if (translate) c = translate[c]; + goto normal_char; + } + break; + + default: + normal_char: + if (!pending_exact || pending_exact + *pending_exact + 1 != b + || *pending_exact == 0177 || *p == '*' || *p == '^' + || ((obscure_syntax & RE_BK_PLUS_QM) + ? *p == '\\' && (p[1] == '+' || p[1] == '?') + : (*p == '+' || *p == '?'))) + { + laststart = b; + PATPUSH (exactn); + pending_exact = b; + PATPUSH (0); + } + PATPUSH (c); + (*pending_exact)++; + } + } + + if (fixup_jump) + store_jump (fixup_jump, jump, b); + + if (stackp != stackb) goto unmatched_open; + + bufp->used = b - bufp->buffer; + return 0; + + invalid_pattern: + return "Invalid regular expression"; + + unmatched_open: + return "Unmatched \\("; + + unmatched_close: + return "Unmatched \\)"; + + end_of_pattern: + return "Premature end of regular expression"; + + nesting_too_deep: + return "Nesting too deep"; + + too_big: + return "Regular expression too big"; + + memory_exhausted: + return "Memory exhausted"; +} + +/* Store where `from' points a jump operation to jump to where `to' points. + `opcode' is the opcode to store. */ + +static void +store_jump (from, opcode, to) + char *from, *to; + char opcode; +{ + from[0] = opcode; + from[1] = (to - (from + 3)) & 0377; + from[2] = (to - (from + 3)) >> 8; +} + +/* Open up space at char FROM, and insert there a jump to TO. + CURRENT_END gives te end of the storage no in use, + so we know how much data to copy up. + OP is the opcode of the jump to insert. + + If you call this function, you must zero out pending_exact. */ + +static void +insert_jump (op, from, to, current_end) + char op; + char *from, *to, *current_end; +{ + register char *pto = current_end + 3; + register char *pfrom = current_end; + while (pfrom != from) + *--pto = *--pfrom; + store_jump (from, op, to); +} + +/* Given a pattern, compute a fastmap from it. + The fastmap records which of the (1 << BYTEWIDTH) possible characters + can start a string that matches the pattern. + This fastmap is used by re_search to skip quickly over totally implausible text. + + The caller must supply the address of a (1 << BYTEWIDTH)-byte data area + as bufp->fastmap. + The other components of bufp describe the pattern to be used. */ + +void +re_compile_fastmap (bufp) + struct re_pattern_buffer *bufp; +{ + unsigned char *pattern = (unsigned char *) bufp->buffer; + int size = bufp->used; + register char *fastmap = bufp->fastmap; + register unsigned char *p = pattern; + register unsigned char *pend = pattern + size; + register int j, k; + unsigned char *translate = (unsigned char *) bufp->translate; + + unsigned char *stackb[NFAILURES]; + unsigned char **stackp = stackb; + + bzero (fastmap, (1 << BYTEWIDTH)); + bufp->fastmap_accurate = 1; + bufp->can_be_null = 0; + + while (p) + { + if (p == pend) + { + bufp->can_be_null = 1; + break; + } +#ifdef SWITCH_ENUM_BUG + switch ((int) ((enum regexpcode) *p++)) +#else + switch ((enum regexpcode) *p++) +#endif + { + case exactn: + if (translate) + fastmap[translate[p[1]]] = 1; + else + fastmap[p[1]] = 1; + break; + + case begline: + case before_dot: + case at_dot: + case after_dot: + case begbuf: + case endbuf: + case wordbound: + case notwordbound: + case wordbeg: + case wordend: + continue; + + case endline: + if (translate) + fastmap[translate['\n']] = 1; + else + fastmap['\n'] = 1; + if (bufp->can_be_null != 1) + bufp->can_be_null = 2; + break; + + case finalize_jump: + case maybe_finalize_jump: + case jump: + case dummy_failure_jump: + bufp->can_be_null = 1; + j = *p++ & 0377; + j += SIGN_EXTEND_CHAR (*(char *)p) << 8; + p += j + 1; /* The 1 compensates for missing ++ above */ + if (j > 0) + continue; + /* Jump backward reached implies we just went through + the body of a loop and matched nothing. + Opcode jumped to should be an on_failure_jump. + Just treat it like an ordinary jump. + For a * loop, it has pushed its failure point already; + if so, discard that as redundant. */ + if ((enum regexpcode) *p != on_failure_jump) + continue; + p++; + j = *p++ & 0377; + j += SIGN_EXTEND_CHAR (*(char *)p) << 8; + p += j + 1; /* The 1 compensates for missing ++ above */ + if (stackp != stackb && *stackp == p) + stackp--; + continue; + + case on_failure_jump: + j = *p++ & 0377; + j += SIGN_EXTEND_CHAR (*(char *)p) << 8; + p++; + *++stackp = p + j; + continue; + + case start_memory: + case stop_memory: + p++; + continue; + + case duplicate: + bufp->can_be_null = 1; + fastmap['\n'] = 1; + case anychar: + for (j = 0; j < (1 << BYTEWIDTH); j++) + if (j != '\n') + fastmap[j] = 1; + if (bufp->can_be_null) + return; + /* Don't return; check the alternative paths + so we can set can_be_null if appropriate. */ + break; + + case wordchar: + for (j = 0; j < (1 << BYTEWIDTH); j++) + if (SYNTAX (j) == Sword) + fastmap[j] = 1; + break; + + case notwordchar: + for (j = 0; j < (1 << BYTEWIDTH); j++) + if (SYNTAX (j) != Sword) + fastmap[j] = 1; + break; + +#ifdef emacs + case syntaxspec: + k = *p++; + for (j = 0; j < (1 << BYTEWIDTH); j++) + if (SYNTAX (j) == (enum syntaxcode) k) + fastmap[j] = 1; + break; + + case notsyntaxspec: + k = *p++; + for (j = 0; j < (1 << BYTEWIDTH); j++) + if (SYNTAX (j) != (enum syntaxcode) k) + fastmap[j] = 1; + break; +#endif /* emacs */ + + case charset: + for (j = *p++ * BYTEWIDTH - 1; j >= 0; j--) + if (p[j / BYTEWIDTH] & (1 << (j % BYTEWIDTH))) + { + if (translate) + fastmap[translate[j]] = 1; + else + fastmap[j] = 1; + } + break; + + case charset_not: + /* Chars beyond end of map must be allowed */ + for (j = *p * BYTEWIDTH; j < (1 << BYTEWIDTH); j++) + if (translate) + fastmap[translate[j]] = 1; + else + fastmap[j] = 1; + + for (j = *p++ * BYTEWIDTH - 1; j >= 0; j--) + if (!(p[j / BYTEWIDTH] & (1 << (j % BYTEWIDTH)))) + { + if (translate) + fastmap[translate[j]] = 1; + else + fastmap[j] = 1; + } + break; + + default: + break; + } + + /* Get here means we have successfully found the possible starting characters + of one path of the pattern. We need not follow this path any farther. + Instead, look at the next alternative remembered in the stack. */ + if (stackp != stackb) + p = *stackp--; + else + break; + } +} + +/* Like re_search_2, below, but only one string is specified. */ + +int +re_search (pbufp, string, size, startpos, range, regs) + struct re_pattern_buffer *pbufp; + char *string; + int size, startpos, range; + struct re_registers *regs; +{ + return re_search_2 (pbufp, 0, 0, string, size, startpos, range, regs, size); +} + +/* Like re_match_2 but tries first a match starting at index STARTPOS, + then at STARTPOS + 1, and so on. + RANGE is the number of places to try before giving up. + If RANGE is negative, the starting positions tried are + STARTPOS, STARTPOS - 1, etc. + It is up to the caller to make sure that range is not so large + as to take the starting position outside of the input strings. + +The value returned is the position at which the match was found, + or -1 if no match was found, + or -2 if error (such as failure stack overflow). */ + +int +re_search_2 (pbufp, string1, size1, string2, size2, startpos, range, regs, mstop) + struct re_pattern_buffer *pbufp; + char *string1, *string2; + int size1, size2; + int startpos; + register int range; + struct re_registers *regs; + int mstop; +{ + register char *fastmap = pbufp->fastmap; + register unsigned char *translate = (unsigned char *) pbufp->translate; + int total = size1 + size2; + int val; + + /* Update the fastmap now if not correct already */ + if (fastmap && !pbufp->fastmap_accurate) + re_compile_fastmap (pbufp); + + /* Don't waste time in a long search for a pattern + that says it is anchored. */ + if (pbufp->used > 0 && (enum regexpcode) pbufp->buffer[0] == begbuf + && range > 0) + { + if (startpos > 0) + return -1; + else + range = 1; + } + + while (1) + { + /* If a fastmap is supplied, skip quickly over characters + that cannot possibly be the start of a match. + Note, however, that if the pattern can possibly match + the null string, we must test it at each starting point + so that we take the first null string we get. */ + + if (fastmap && startpos < total && pbufp->can_be_null != 1) + { + if (range > 0) + { + register int lim = 0; + register unsigned char *p; + int irange = range; + if (startpos < size1 && startpos + range >= size1) + lim = range - (size1 - startpos); + + p = ((unsigned char *) + &(startpos >= size1 ? string2 - size1 : string1)[startpos]); + + if (translate) + { + while (range > lim && !fastmap[translate[*p++]]) + range--; + } + else + { + while (range > lim && !fastmap[*p++]) + range--; + } + startpos += irange - range; + } + else + { + register unsigned char c; + if (startpos >= size1) + c = string2[startpos - size1]; + else + c = string1[startpos]; + c &= 0xff; + if (translate ? !fastmap[translate[c]] : !fastmap[c]) + goto advance; + } + } + + if (range >= 0 && startpos == total + && fastmap && pbufp->can_be_null == 0) + return -1; + + val = re_match_2 (pbufp, string1, size1, string2, size2, startpos, regs, mstop); + if (0 <= val) + { + if (val == -2) + return -2; + return startpos; + } + +#ifdef C_ALLOCA + alloca (0); +#endif /* C_ALLOCA */ + + advance: + if (!range) break; + if (range > 0) range--, startpos++; else range++, startpos--; + } + return -1; +} + +#ifndef emacs /* emacs never uses this */ +int +re_match (pbufp, string, size, pos, regs) + struct re_pattern_buffer *pbufp; + char *string; + int size, pos; + struct re_registers *regs; +{ + return re_match_2 (pbufp, 0, 0, string, size, pos, regs, size); +} +#endif /* emacs */ + +/* Maximum size of failure stack. Beyond this, overflow is an error. */ + +int re_max_failures = 2000; + +static int bcmp_translate(); +/* Match the pattern described by PBUFP + against data which is the virtual concatenation of STRING1 and STRING2. + SIZE1 and SIZE2 are the sizes of the two data strings. + Start the match at position POS. + Do not consider matching past the position MSTOP. + + If pbufp->fastmap is nonzero, then it had better be up to date. + + The reason that the data to match are specified as two components + which are to be regarded as concatenated + is so this function can be used directly on the contents of an Emacs buffer. + + -1 is returned if there is no match. -2 is returned if there is + an error (such as match stack overflow). Otherwise the value is the length + of the substring which was matched. */ + +int +re_match_2 (pbufp, string1, size1, string2, size2, pos, regs, mstop) + struct re_pattern_buffer *pbufp; + unsigned char *string1, *string2; + int size1, size2; + int pos; + struct re_registers *regs; + int mstop; +{ + register unsigned char *p = (unsigned char *) pbufp->buffer; + register unsigned char *pend = p + pbufp->used; + /* End of first string */ + unsigned char *end1; + /* End of second string */ + unsigned char *end2; + /* Pointer just past last char to consider matching */ + unsigned char *end_match_1, *end_match_2; + register unsigned char *d, *dend; + register int mcnt; + unsigned char *translate = (unsigned char *) pbufp->translate; + + /* Failure point stack. Each place that can handle a failure further down the line + pushes a failure point on this stack. It consists of two char *'s. + The first one pushed is where to resume scanning the pattern; + the second pushed is where to resume scanning the strings. + If the latter is zero, the failure point is a "dummy". + If a failure happens and the innermost failure point is dormant, + it discards that failure point and tries the next one. */ + + unsigned char *initial_stack[2 * NFAILURES]; + unsigned char **stackb = initial_stack; + unsigned char **stackp = stackb, **stacke = &stackb[2 * NFAILURES]; + + /* Information on the "contents" of registers. + These are pointers into the input strings; they record + just what was matched (on this attempt) by some part of the pattern. + The start_memory command stores the start of a register's contents + and the stop_memory command stores the end. + + At that point, regstart[regnum] points to the first character in the register, + regend[regnum] points to the first character beyond the end of the register, + regstart_seg1[regnum] is true iff regstart[regnum] points into string1, + and regend_seg1[regnum] is true iff regend[regnum] points into string1. */ + + unsigned char *regstart[RE_NREGS]; + unsigned char *regend[RE_NREGS]; + unsigned char regstart_seg1[RE_NREGS], regend_seg1[RE_NREGS]; + + /* Set up pointers to ends of strings. + Don't allow the second string to be empty unless both are empty. */ + if (!size2) + { + string2 = string1; + size2 = size1; + string1 = 0; + size1 = 0; + } + end1 = string1 + size1; + end2 = string2 + size2; + + /* Compute where to stop matching, within the two strings */ + if (mstop <= size1) + { + end_match_1 = string1 + mstop; + end_match_2 = string2; + } + else + { + end_match_1 = end1; + end_match_2 = string2 + mstop - size1; + } + + /* Initialize \) text positions to -1 + to mark ones that no \( or \) has been seen for. */ + + for (mcnt = 0; mcnt < sizeof (regend) / sizeof (*regend); mcnt++) + regend[mcnt] = (unsigned char *) -1; + + /* `p' scans through the pattern as `d' scans through the data. + `dend' is the end of the input string that `d' points within. + `d' is advanced into the following input string whenever necessary, + but this happens before fetching; + therefore, at the beginning of the loop, + `d' can be pointing at the end of a string, + but it cannot equal string2. */ + + if (pos <= size1) + d = string1 + pos, dend = end_match_1; + else + d = string2 + pos - size1, dend = end_match_2; + +/* Write PREFETCH; just before fetching a character with *d. */ +#define PREFETCH \ + while (d == dend) \ + { if (dend == end_match_2) goto fail; /* end of string2 => failure */ \ + d = string2; /* end of string1 => advance to string2. */ \ + dend = end_match_2; } + + /* This loop loops over pattern commands. + It exits by returning from the function if match is complete, + or it drops through if match fails at this starting point in the input data. */ + + while (1) + { + if (p == pend) + /* End of pattern means we have succeeded! */ + { + /* If caller wants register contents data back, convert it to indices */ + if (regs) + { + regs->start[0] = pos; + if (dend == end_match_1) + regs->end[0] = d - string1; + else + regs->end[0] = d - string2 + size1; + for (mcnt = 1; mcnt < RE_NREGS; mcnt++) + { + if (regend[mcnt] == (unsigned char *) -1) + { + regs->start[mcnt] = -1; + regs->end[mcnt] = -1; + continue; + } + if (regstart_seg1[mcnt]) + regs->start[mcnt] = regstart[mcnt] - string1; + else + regs->start[mcnt] = regstart[mcnt] - string2 + size1; + if (regend_seg1[mcnt]) + regs->end[mcnt] = regend[mcnt] - string1; + else + regs->end[mcnt] = regend[mcnt] - string2 + size1; + } + } + if (dend == end_match_1) + return (d - string1 - pos); + else + return d - string2 + size1 - pos; + } + + /* Otherwise match next pattern command */ +#ifdef SWITCH_ENUM_BUG + switch ((int) ((enum regexpcode) *p++)) +#else + switch ((enum regexpcode) *p++) +#endif + { + + /* \( is represented by a start_memory, \) by a stop_memory. + Both of those commands contain a "register number" argument. + The text matched within the \( and \) is recorded under that number. + Then, \ turns into a `duplicate' command which + is followed by the numeric value of as the register number. */ + + case start_memory: + regstart[*p] = d; + regstart_seg1[*p++] = (dend == end_match_1); + break; + + case stop_memory: + regend[*p] = d; + regend_seg1[*p++] = (dend == end_match_1); + break; + + case duplicate: + { + int regno = *p++; /* Get which register to match against */ + register unsigned char *d2, *dend2; + + d2 = regstart[regno]; + dend2 = ((regstart_seg1[regno] == regend_seg1[regno]) + ? regend[regno] : end_match_1); + while (1) + { + /* Advance to next segment in register contents, if necessary */ + while (d2 == dend2) + { + if (dend2 == end_match_2) break; + if (dend2 == regend[regno]) break; + d2 = string2, dend2 = regend[regno]; /* end of string1 => advance to string2. */ + } + /* At end of register contents => success */ + if (d2 == dend2) break; + + /* Advance to next segment in data being matched, if necessary */ + PREFETCH; + + /* mcnt gets # consecutive chars to compare */ + mcnt = dend - d; + if (mcnt > dend2 - d2) + mcnt = dend2 - d2; + /* Compare that many; failure if mismatch, else skip them. */ + if (translate ? bcmp_translate (d, d2, mcnt, translate) : bcmp (d, d2, mcnt)) + goto fail; + d += mcnt, d2 += mcnt; + } + } + break; + + case anychar: + /* fetch a data character */ + PREFETCH; + /* Match anything but a newline. */ + if ((translate ? translate[*d++] : *d++) == '\n') + goto fail; + break; + + case charset: + case charset_not: + { + /* Nonzero for charset_not */ + int not = 0; + register int c; + if (*(p - 1) == (unsigned char) charset_not) + not = 1; + + /* fetch a data character */ + PREFETCH; + + if (translate) + c = translate [*d]; + else + c = *d; + + if (c < *p * BYTEWIDTH + && p[1 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) + not = !not; + + p += 1 + *p; + + if (!not) goto fail; + d++; + break; + } + + case begline: + if (d == string1 || d[-1] == '\n') + break; + goto fail; + + case endline: + if (d == end2 + || (d == end1 ? (size2 == 0 || *string2 == '\n') : *d == '\n')) + break; + goto fail; + + /* "or" constructs ("|") are handled by starting each alternative + with an on_failure_jump that points to the start of the next alternative. + Each alternative except the last ends with a jump to the joining point. + (Actually, each jump except for the last one really jumps + to the following jump, because tensioning the jumps is a hassle.) */ + + /* The start of a stupid repeat has an on_failure_jump that points + past the end of the repeat text. + This makes a failure point so that, on failure to match a repetition, + matching restarts past as many repetitions have been found + with no way to fail and look for another one. */ + + /* A smart repeat is similar but loops back to the on_failure_jump + so that each repetition makes another failure point. */ + + case on_failure_jump: + if (stackp == stacke) + { + unsigned char **stackx; + if (stacke - stackb > re_max_failures * 2) + return -2; + stackx = (unsigned char **) alloca (2 * (stacke - stackb) + * sizeof (char *)); + bcopy (stackb, stackx, (stacke - stackb) * sizeof (char *)); + stackp = stackx + (stackp - stackb); + stacke = stackx + 2 * (stacke - stackb); + stackb = stackx; + } + mcnt = *p++ & 0377; + mcnt += SIGN_EXTEND_CHAR (*(char *)p) << 8; + p++; + *stackp++ = mcnt + p; + *stackp++ = d; + break; + + /* The end of a smart repeat has an maybe_finalize_jump back. + Change it either to a finalize_jump or an ordinary jump. */ + + case maybe_finalize_jump: + mcnt = *p++ & 0377; + mcnt += SIGN_EXTEND_CHAR (*(char *)p) << 8; + p++; + { + register unsigned char *p2 = p; + /* Compare what follows with the begining of the repeat. + If we can establish that there is nothing that they would + both match, we can change to finalize_jump */ + while (p2 != pend + && (*p2 == (unsigned char) stop_memory + || *p2 == (unsigned char) start_memory)) + p2++; + if (p2 == pend) + p[-3] = (unsigned char) finalize_jump; + else if (*p2 == (unsigned char) exactn + || *p2 == (unsigned char) endline) + { + register int c = *p2 == (unsigned char) endline ? '\n' : p2[2]; + register unsigned char *p1 = p + mcnt; + /* p1[0] ... p1[2] are an on_failure_jump. + Examine what follows that */ + if (p1[3] == (unsigned char) exactn && p1[5] != c) + p[-3] = (unsigned char) finalize_jump; + else if (p1[3] == (unsigned char) charset + || p1[3] == (unsigned char) charset_not) + { + int not = p1[3] == (unsigned char) charset_not; + if (c < p1[4] * BYTEWIDTH + && p1[5 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) + not = !not; + /* not is 1 if c would match */ + /* That means it is not safe to finalize */ + if (!not) + p[-3] = (unsigned char) finalize_jump; + } + } + } + p -= 2; + if (p[-1] != (unsigned char) finalize_jump) + { + p[-1] = (unsigned char) jump; + goto nofinalize; + } + + /* The end of a stupid repeat has a finalize-jump + back to the start, where another failure point will be made + which will point after all the repetitions found so far. */ + + case finalize_jump: + stackp -= 2; + + case jump: + nofinalize: + mcnt = *p++ & 0377; + mcnt += SIGN_EXTEND_CHAR (*(char *)p) << 8; + p += mcnt + 1; /* The 1 compensates for missing ++ above */ + break; + + case dummy_failure_jump: + if (stackp == stacke) + { + unsigned char **stackx + = (unsigned char **) alloca (2 * (stacke - stackb) + * sizeof (char *)); + bcopy (stackb, stackx, (stacke - stackb) * sizeof (char *)); + stackp = stackx + (stackp - stackb); + stacke = stackx + 2 * (stacke - stackb); + stackb = stackx; + } + *stackp++ = 0; + *stackp++ = 0; + goto nofinalize; + + case wordbound: + if (d == string1 /* Points to first char */ + || d == end2 /* Points to end */ + || (d == end1 && size2 == 0)) /* Points to end */ + break; + if ((SYNTAX (d[-1]) == Sword) + != (SYNTAX (d == end1 ? *string2 : *d) == Sword)) + break; + goto fail; + + case notwordbound: + if (d == string1 /* Points to first char */ + || d == end2 /* Points to end */ + || (d == end1 && size2 == 0)) /* Points to end */ + goto fail; + if ((SYNTAX (d[-1]) == Sword) + != (SYNTAX (d == end1 ? *string2 : *d) == Sword)) + goto fail; + break; + + case wordbeg: + if (d == end2 /* Points to end */ + || (d == end1 && size2 == 0) /* Points to end */ + || SYNTAX (* (d == end1 ? string2 : d)) != Sword) /* Next char not a letter */ + goto fail; + if (d == string1 /* Points to first char */ + || SYNTAX (d[-1]) != Sword) /* prev char not letter */ + break; + goto fail; + + case wordend: + if (d == string1 /* Points to first char */ + || SYNTAX (d[-1]) != Sword) /* prev char not letter */ + goto fail; + if (d == end2 /* Points to end */ + || (d == end1 && size2 == 0) /* Points to end */ + || SYNTAX (d == end1 ? *string2 : *d) != Sword) /* Next char not a letter */ + break; + goto fail; + +#ifdef emacs + case before_dot: + if (((d - string2 <= (unsigned) size2) + ? d - bf_p2 : d - bf_p1) + <= point) + goto fail; + break; + + case at_dot: + if (((d - string2 <= (unsigned) size2) + ? d - bf_p2 : d - bf_p1) + == point) + goto fail; + break; + + case after_dot: + if (((d - string2 <= (unsigned) size2) + ? d - bf_p2 : d - bf_p1) + >= point) + goto fail; + break; + + case wordchar: + mcnt = (int) Sword; + goto matchsyntax; + + case syntaxspec: + mcnt = *p++; + matchsyntax: + PREFETCH; + if (SYNTAX (*d++) != (enum syntaxcode) mcnt) goto fail; + break; + + case notwordchar: + mcnt = (int) Sword; + goto matchnotsyntax; + + case notsyntaxspec: + mcnt = *p++; + matchnotsyntax: + PREFETCH; + if (SYNTAX (*d++) == (enum syntaxcode) mcnt) goto fail; + break; +#else + case wordchar: + PREFETCH; + if (SYNTAX (*d++) == 0) goto fail; + break; + + case notwordchar: + PREFETCH; + if (SYNTAX (*d++) != 0) goto fail; + break; +#endif /* not emacs */ + + case begbuf: + if (d == string1) /* Note, d cannot equal string2 */ + break; /* unless string1 == string2. */ + goto fail; + + case endbuf: + if (d == end2 || (d == end1 && size2 == 0)) + break; + goto fail; + + case exactn: + /* Match the next few pattern characters exactly. + mcnt is how many characters to match. */ + mcnt = *p++; + if (translate) + { + do + { + PREFETCH; + if (translate[*d++] != *p++) goto fail; + } + while (--mcnt); + } + else + { + do + { + PREFETCH; + if (*d++ != *p++) goto fail; + } + while (--mcnt); + } + break; + + default: + break; + } + continue; /* Successfully matched one pattern command; keep matching */ + + /* Jump here if any matching operation fails. */ + fail: + if (stackp != stackb) + /* A restart point is known. Restart there and pop it. */ + { + if (!stackp[-2]) + { /* If innermost failure point is dormant, flush it and keep looking */ + stackp -= 2; + goto fail; + } + d = *--stackp; + p = *--stackp; + if (d >= string1 && d <= end1) + dend = end_match_1; + } + else break; /* Matching at this starting point really fails! */ + } + return -1; /* Failure to match */ +} + +static int +bcmp_translate (s1, s2, len, translate) + unsigned char *s1, *s2; + register int len; + unsigned char *translate; +{ + register unsigned char *p1 = s1, *p2 = s2; + while (len) + { + if (translate [*p1++] != translate [*p2++]) return 1; + len--; + } + return 0; +} + +/* Entry points compatible with bsd4.2 regex library */ + +#ifndef emacs + +static struct re_pattern_buffer re_comp_buf; + +char * +re_comp (s) + char *s; +{ + if (!s) + { + if (!re_comp_buf.buffer) + return "No previous regular expression"; + return 0; + } + + if (!re_comp_buf.buffer) + { + if (!(re_comp_buf.buffer = (char *) malloc (200))) + return "Memory exhausted"; + re_comp_buf.allocated = 200; + if (!(re_comp_buf.fastmap = (char *) malloc (1 << BYTEWIDTH))) + return "Memory exhausted"; + } + return re_compile_pattern (s, strlen (s), &re_comp_buf); +} + +int +re_exec (s) + char *s; +{ + int len = strlen (s); + return 0 <= re_search (&re_comp_buf, s, len, 0, len, 0); +} + +#endif /* emacs */ + +#ifdef test + +#include + +/* Indexed by a character, gives the upper case equivalent of the character */ + +static char upcase[0400] = + { 000, 001, 002, 003, 004, 005, 006, 007, + 010, 011, 012, 013, 014, 015, 016, 017, + 020, 021, 022, 023, 024, 025, 026, 027, + 030, 031, 032, 033, 034, 035, 036, 037, + 040, 041, 042, 043, 044, 045, 046, 047, + 050, 051, 052, 053, 054, 055, 056, 057, + 060, 061, 062, 063, 064, 065, 066, 067, + 070, 071, 072, 073, 074, 075, 076, 077, + 0100, 0101, 0102, 0103, 0104, 0105, 0106, 0107, + 0110, 0111, 0112, 0113, 0114, 0115, 0116, 0117, + 0120, 0121, 0122, 0123, 0124, 0125, 0126, 0127, + 0130, 0131, 0132, 0133, 0134, 0135, 0136, 0137, + 0140, 0101, 0102, 0103, 0104, 0105, 0106, 0107, + 0110, 0111, 0112, 0113, 0114, 0115, 0116, 0117, + 0120, 0121, 0122, 0123, 0124, 0125, 0126, 0127, + 0130, 0131, 0132, 0173, 0174, 0175, 0176, 0177, + 0200, 0201, 0202, 0203, 0204, 0205, 0206, 0207, + 0210, 0211, 0212, 0213, 0214, 0215, 0216, 0217, + 0220, 0221, 0222, 0223, 0224, 0225, 0226, 0227, + 0230, 0231, 0232, 0233, 0234, 0235, 0236, 0237, + 0240, 0241, 0242, 0243, 0244, 0245, 0246, 0247, + 0250, 0251, 0252, 0253, 0254, 0255, 0256, 0257, + 0260, 0261, 0262, 0263, 0264, 0265, 0266, 0267, + 0270, 0271, 0272, 0273, 0274, 0275, 0276, 0277, + 0300, 0301, 0302, 0303, 0304, 0305, 0306, 0307, + 0310, 0311, 0312, 0313, 0314, 0315, 0316, 0317, + 0320, 0321, 0322, 0323, 0324, 0325, 0326, 0327, + 0330, 0331, 0332, 0333, 0334, 0335, 0336, 0337, + 0340, 0341, 0342, 0343, 0344, 0345, 0346, 0347, + 0350, 0351, 0352, 0353, 0354, 0355, 0356, 0357, + 0360, 0361, 0362, 0363, 0364, 0365, 0366, 0367, + 0370, 0371, 0372, 0373, 0374, 0375, 0376, 0377 + }; + +main (argc, argv) + int argc; + char **argv; +{ + char pat[80]; + struct re_pattern_buffer buf; + int i; + char c; + char fastmap[(1 << BYTEWIDTH)]; + + /* Allow a command argument to specify the style of syntax. */ + if (argc > 1) + obscure_syntax = atoi (argv[1]); + + buf.allocated = 40; + buf.buffer = (char *) malloc (buf.allocated); + buf.fastmap = fastmap; + buf.translate = upcase; + + while (1) + { + gets (pat); + + if (*pat) + { + re_compile_pattern (pat, strlen(pat), &buf); + + for (i = 0; i < buf.used; i++) + printchar (buf.buffer[i]); + + putchar ('\n'); + + printf ("%d allocated, %d used.\n", buf.allocated, buf.used); + + re_compile_fastmap (&buf); + printf ("Allowed by fastmap: "); + for (i = 0; i < (1 << BYTEWIDTH); i++) + if (fastmap[i]) printchar (i); + putchar ('\n'); + } + + gets (pat); /* Now read the string to match against */ + + i = re_match (&buf, pat, strlen (pat), 0, 0); + printf ("Match value %d.\n", i); + } +} + +#ifdef NOTDEF +print_buf (bufp) + struct re_pattern_buffer *bufp; +{ + int i; + + printf ("buf is :\n----------------\n"); + for (i = 0; i < bufp->used; i++) + printchar (bufp->buffer[i]); + + printf ("\n%d allocated, %d used.\n", bufp->allocated, bufp->used); + + printf ("Allowed by fastmap: "); + for (i = 0; i < (1 << BYTEWIDTH); i++) + if (bufp->fastmap[i]) + printchar (i); + printf ("\nAllowed by translate: "); + if (bufp->translate) + for (i = 0; i < (1 << BYTEWIDTH); i++) + if (bufp->translate[i]) + printchar (i); + printf ("\nfastmap is%s accurate\n", bufp->fastmap_accurate ? "" : "n't"); + printf ("can %s be null\n----------", bufp->can_be_null ? "" : "not"); +} +#endif + +printchar (c) + char c; +{ + if (c < 041 || c >= 0177) + { + putchar ('\\'); + putchar (((c >> 6) & 3) + '0'); + putchar (((c >> 3) & 7) + '0'); + putchar ((c & 7) + '0'); + } + else + putchar (c); +} + +error (string) + char *string; +{ + puts (string); + exit (1); +} + +#endif /* test */ diff --git a/gnu/usr.bin/grep/regex.h b/gnu/usr.bin/grep/regex.h new file mode 100644 index 000000000000..a5e89d32ccb3 --- /dev/null +++ b/gnu/usr.bin/grep/regex.h @@ -0,0 +1,185 @@ +/* Definitions for data structures callers pass the regex library. + Copyright (C) 1985, 1989 Free Software Foundation, Inc. + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2, or (at your option) + any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + + + In other words, you are welcome to use, share and improve this program. + You are forbidden to forbid anyone else to use, share and improve + what you give them. Help stamp out software-hoarding! */ + + +/* Define number of parens for which we record the beginnings and ends. + This affects how much space the `struct re_registers' type takes up. */ +#ifndef RE_NREGS +#define RE_NREGS 10 +#endif + +/* These bits are used in the obscure_syntax variable to choose among + alternative regexp syntaxes. */ + +/* 1 means plain parentheses serve as grouping, and backslash + parentheses are needed for literal searching. + 0 means backslash-parentheses are grouping, and plain parentheses + are for literal searching. */ +#define RE_NO_BK_PARENS 1 + +/* 1 means plain | serves as the "or"-operator, and \| is a literal. + 0 means \| serves as the "or"-operator, and | is a literal. */ +#define RE_NO_BK_VBAR 2 + +/* 0 means plain + or ? serves as an operator, and \+, \? are literals. + 1 means \+, \? are operators and plain +, ? are literals. */ +#define RE_BK_PLUS_QM 4 + +/* 1 means | binds tighter than ^ or $. + 0 means the contrary. */ +#define RE_TIGHT_VBAR 8 + +/* 1 means treat \n as an _OR operator + 0 means treat it as a normal character */ +#define RE_NEWLINE_OR 16 + +/* 0 means that a special characters (such as *, ^, and $) always have + their special meaning regardless of the surrounding context. + 1 means that special characters may act as normal characters in some + contexts. Specifically, this applies to: + ^ - only special at the beginning, or after ( or | + $ - only special at the end, or before ) or | + *, +, ? - only special when not after the beginning, (, or | */ +#define RE_CONTEXT_INDEP_OPS 32 + +/* Now define combinations of bits for the standard possibilities. */ +#define RE_SYNTAX_AWK (RE_NO_BK_PARENS | RE_NO_BK_VBAR | RE_CONTEXT_INDEP_OPS) +#define RE_SYNTAX_EGREP (RE_SYNTAX_AWK | RE_NEWLINE_OR) +#define RE_SYNTAX_GREP (RE_BK_PLUS_QM | RE_NEWLINE_OR) +#define RE_SYNTAX_EMACS 0 + +/* This data structure is used to represent a compiled pattern. */ + +struct re_pattern_buffer + { + char *buffer; /* Space holding the compiled pattern commands. */ + int allocated; /* Size of space that buffer points to */ + int used; /* Length of portion of buffer actually occupied */ + char *fastmap; /* Pointer to fastmap, if any, or zero if none. */ + /* re_search uses the fastmap, if there is one, + to skip quickly over totally implausible characters */ + char *translate; /* Translate table to apply to all characters before comparing. + Or zero for no translation. + The translation is applied to a pattern when it is compiled + and to data when it is matched. */ + char fastmap_accurate; + /* Set to zero when a new pattern is stored, + set to one when the fastmap is updated from it. */ + char can_be_null; /* Set to one by compiling fastmap + if this pattern might match the null string. + It does not necessarily match the null string + in that case, but if this is zero, it cannot. + 2 as value means can match null string + but at end of range or before a character + listed in the fastmap. */ + }; + +/* Structure to store "register" contents data in. + + Pass the address of such a structure as an argument to re_match, etc., + if you want this information back. + + start[i] and end[i] record the string matched by \( ... \) grouping i, + for i from 1 to RE_NREGS - 1. + start[0] and end[0] record the entire string matched. */ + +struct re_registers + { + int start[RE_NREGS]; + int end[RE_NREGS]; + }; + +/* These are the command codes that appear in compiled regular expressions, one per byte. + Some command codes are followed by argument bytes. + A command code can specify any interpretation whatever for its arguments. + Zero-bytes may appear in the compiled regular expression. */ + +enum regexpcode + { + unused, + exactn, /* followed by one byte giving n, and then by n literal bytes */ + begline, /* fails unless at beginning of line */ + endline, /* fails unless at end of line */ + jump, /* followed by two bytes giving relative address to jump to */ + on_failure_jump, /* followed by two bytes giving relative address of place + to resume at in case of failure. */ + finalize_jump, /* Throw away latest failure point and then jump to address. */ + maybe_finalize_jump, /* Like jump but finalize if safe to do so. + This is used to jump back to the beginning + of a repeat. If the command that follows + this jump is clearly incompatible with the + one at the beginning of the repeat, such that + we can be sure that there is no use backtracking + out of repetitions already completed, + then we finalize. */ + dummy_failure_jump, /* jump, and push a dummy failure point. + This failure point will be thrown away + if an attempt is made to use it for a failure. + A + construct makes this before the first repeat. */ + anychar, /* matches any one character */ + charset, /* matches any one char belonging to specified set. + First following byte is # bitmap bytes. + Then come bytes for a bit-map saying which chars are in. + Bits in each byte are ordered low-bit-first. + A character is in the set if its bit is 1. + A character too large to have a bit in the map + is automatically not in the set */ + charset_not, /* similar but match any character that is NOT one of those specified */ + start_memory, /* starts remembering the text that is matched + and stores it in a memory register. + followed by one byte containing the register number. + Register numbers must be in the range 0 through NREGS. */ + stop_memory, /* stops remembering the text that is matched + and stores it in a memory register. + followed by one byte containing the register number. + Register numbers must be in the range 0 through NREGS. */ + duplicate, /* match a duplicate of something remembered. + Followed by one byte containing the index of the memory register. */ + before_dot, /* Succeeds if before dot */ + at_dot, /* Succeeds if at dot */ + after_dot, /* Succeeds if after dot */ + begbuf, /* Succeeds if at beginning of buffer */ + endbuf, /* Succeeds if at end of buffer */ + wordchar, /* Matches any word-constituent character */ + notwordchar, /* Matches any char that is not a word-constituent */ + wordbeg, /* Succeeds if at word beginning */ + wordend, /* Succeeds if at word end */ + wordbound, /* Succeeds if at a word boundary */ + notwordbound, /* Succeeds if not at a word boundary */ + syntaxspec, /* Matches any character whose syntax is specified. + followed by a byte which contains a syntax code, Sword or such like */ + notsyntaxspec /* Matches any character whose syntax differs from the specified. */ + }; + +extern char *re_compile_pattern (); +/* Is this really advertised? */ +extern void re_compile_fastmap (); +extern int re_search (), re_search_2 (); +extern int re_match (), re_match_2 (); + +/* 4.2 bsd compatibility (yuck) */ +extern char *re_comp (); +extern int re_exec (); + +#ifdef SYNTAX_TABLE +extern char *re_syntax_table; +#endif diff --git a/gnu/usr.bin/grep/tests/khadafy.lines b/gnu/usr.bin/grep/tests/khadafy.lines new file mode 100644 index 000000000000..57e21a17948d --- /dev/null +++ b/gnu/usr.bin/grep/tests/khadafy.lines @@ -0,0 +1,32 @@ +1) Muammar Qaddafi +2) Mo'ammar Gadhafi +3) Muammar Kaddafi +4) Muammar Qadhafi +5) Moammar El Kadhafi +6) Muammar Gadafi +7) Mu'ammar al-Qadafi +8) Moamer El Kazzafi +9) Moamar al-Gaddafi +10) Mu'ammar Al Qathafi +11) Muammar Al Qathafi +12) Mo'ammar el-Gadhafi +13) Moamar El Kadhafi +14) Muammar al-Qadhafi +15) Mu'ammar al-Qadhdhafi +16) Mu'ammar Qadafi +17) Moamar Gaddafi +18) Mu'ammar Qadhdhafi +19) Muammar Khaddafi +20) Muammar al-Khaddafi +21) Mu'amar al-Kadafi +22) Muammar Ghaddafy +23) Muammar Ghadafi +24) Muammar Ghaddafi +25) Muamar Kaddafi +26) Muammar Quathafi +27) Muammar Gheddafi +28) Muamar Al-Kaddafi +29) Moammar Khadafy +30) Moammar Qudhafi +31) Mu'ammar al-Qaddafi +32) Mulazim Awwal Mu'ammar Muhammad Abu Minyar al-Qadhafi diff --git a/gnu/usr.bin/grep/tests/khadafy.regexp b/gnu/usr.bin/grep/tests/khadafy.regexp new file mode 100644 index 000000000000..46fe8ddd6dad --- /dev/null +++ b/gnu/usr.bin/grep/tests/khadafy.regexp @@ -0,0 +1 @@ +M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy] diff --git a/gnu/usr.bin/grep/tests/regress.sh b/gnu/usr.bin/grep/tests/regress.sh new file mode 100644 index 000000000000..b947036467e0 --- /dev/null +++ b/gnu/usr.bin/grep/tests/regress.sh @@ -0,0 +1,30 @@ +#! /bin/sh +# Regression test for GNU e?grep. +# Usage: regress.sh [dir-containing-egrep] + +builddir=${1-..} + +failures=0 + +# The Khadafy test is brought to you by Scott Anderson . . . +$builddir/egrep -f khadafy.regexp khadafy.lines > khadafy.out +if cmp khadafy.lines khadafy.out +then + rm khadafy.out +else + echo Khadafy test failed -- output left on khadafy.out + failures=1 +fi + +# . . . and the following by Henry Spencer. + +awk -F: -f scriptgen.awk spencer.tests > tmp.script + +if sh tmp.script $builddir +then + rm tmp.script + exit $failures +else + rm tmp.script + exit 1 +fi diff --git a/gnu/usr.bin/grep/tests/scriptgen.awk b/gnu/usr.bin/grep/tests/scriptgen.awk new file mode 100644 index 000000000000..cdadc917aee4 --- /dev/null +++ b/gnu/usr.bin/grep/tests/scriptgen.awk @@ -0,0 +1,10 @@ +BEGIN { print "failures=0"; } +!/^#/ && NF == 3 { + print "echo '" $3 "' | $1/egrep -e '" $2 "' > /dev/null 2>&1"; + print "if [ $? != " $1 " ]" + print "then" + printf "\techo Spencer test \\#%d failed\n", ++n + print "\tfailures=1" + print "fi" +} +END { print "exit $failures"; } diff --git a/gnu/usr.bin/grep/tests/spencer.tests b/gnu/usr.bin/grep/tests/spencer.tests new file mode 100644 index 000000000000..c7eee9bdde83 --- /dev/null +++ b/gnu/usr.bin/grep/tests/spencer.tests @@ -0,0 +1,122 @@ +0:abc:abc +1:abc:xbc +1:abc:axc +1:abc:abx +0:abc:xabcy +0:abc:ababc +0:ab*c:abc +0:ab*bc:abc +0:ab*bc:abbc +0:ab*bc:abbbbc +0:ab+bc:abbc +1:ab+bc:abc +1:ab+bc:abq +0:ab+bc:abbbbc +0:ab?bc:abbc +0:ab?bc:abc +1:ab?bc:abbbbc +0:ab?c:abc +0:^abc$:abc +1:^abc$:abcc +0:^abc:abcc +1:^abc$:aabc +0:abc$:aabc +0:^:abc +0:$:abc +0:a.c:abc +0:a.c:axc +0:a.*c:axyzc +1:a.*c:axyzd +1:a[bc]d:abc +0:a[bc]d:abd +1:a[b-d]e:abd +0:a[b-d]e:ace +0:a[b-d]:aac +0:a[-b]:a- +2:a[b-]:a- +1:a[b-a]:- +2:a[]b:- +2:a[:- +0:a]:a] +0:a[]]b:a]b +0:a[^bc]d:aed +1:a[^bc]d:abd +0:a[^-b]c:adc +1:a[^-b]c:a-c +1:a[^]b]c:a]c +0:a[^]b]c:adc +0:ab|cd:abc +0:ab|cd:abcd +0:()ef:def +0:()*:- +1:*a:- +0:^*:- +0:$*:- +1:(*)b:- +1:$b:b +2:a\:- +0:a\(b:a(b +0:a\(*b:ab +0:a\(*b:a((b +1:a\x:a\x +2:abc):- +2:(abc:- +0:((a)):abc +0:(a)b(c):abc +0:a+b+c:aabbabc +0:a**:- +0:a*?:- +0:(a*)*:- +0:(a*)+:- +0:(a|)*:- +0:(a*|b)*:- +0:(a+|b)*:ab +0:(a+|b)+:ab +0:(a+|b)?:ab +0:[^ab]*:cde +0:(^)*:- +0:(ab|)*:- +2:)(:- +1:abc: +1:abc: +0:a*: +0:([abc])*d:abbbcd +0:([abc])*bcd:abcd +0:a|b|c|d|e:e +0:(a|b|c|d|e)f:ef +0:((a*|b))*:- +0:abcd*efg:abcdefg +0:ab*:xabyabbbz +0:ab*:xayabbbz +0:(ab|cd)e:abcde +0:[abhgefdc]ij:hij +1:^(ab|cd)e:abcde +0:(abc|)ef:abcdef +0:(a|b)c*d:abcd +0:(ab|ab*)bc:abc +0:a([bc]*)c*:abc +0:a([bc]*)(c*d):abcd +0:a([bc]+)(c*d):abcd +0:a([bc]*)(c+d):abcd +0:a[bcd]*dcdcde:adcdcde +1:a[bcd]+dcdcde:adcdcde +0:(ab|a)b*c:abc +0:((a)(b)c)(d):abcd +0:[A-Za-z_][A-Za-z0-9_]*:alpha +0:^a(bc+|b[eh])g|.h$:abh +0:(bc+d$|ef*g.|h?i(j|k)):effgz +0:(bc+d$|ef*g.|h?i(j|k)):ij +1:(bc+d$|ef*g.|h?i(j|k)):effg +1:(bc+d$|ef*g.|h?i(j|k)):bcdd +0:(bc+d$|ef*g.|h?i(j|k)):reffgz +1:((((((((((a)))))))))):- +0:(((((((((a))))))))):a +1:multiple words of text:uh-uh +0:multiple words:multiple words, yeah +0:(.*)c(.*):abcde +1:\((.*),:(.*)\) +1:[k]:ab +0:abcd:abcd +0:a(bc)d:abcd +0:a[-]?c:ac +0:(....).*\1:beriberi