110 lines
3.5 KiB
Plaintext
110 lines
3.5 KiB
Plaintext
|
.\" This module is believed to contain source code proprietary to AT&T.
|
||
|
.\" Use and redistribution is subject to the Berkeley Software License
|
||
|
.\" Agreement and your Software Agreement with AT&T (Western Electric).
|
||
|
.\"
|
||
|
.\" @(#)ss3 8.1 (Berkeley) 6/8/93
|
||
|
.\"
|
||
|
.\" $FreeBSD$
|
||
|
.SH
|
||
|
3: Lexical Analysis
|
||
|
.PP
|
||
|
The user must supply a lexical analyzer to read the input stream and communicate tokens
|
||
|
(with values, if desired) to the parser.
|
||
|
The lexical analyzer is an integer-valued function called
|
||
|
.I yylex .
|
||
|
The function returns an integer, the
|
||
|
.I "token number" ,
|
||
|
representing the kind of token read.
|
||
|
If there is a value associated with that token, it should be assigned
|
||
|
to the external variable
|
||
|
.I yylval .
|
||
|
.PP
|
||
|
The parser and the lexical analyzer must agree on these token numbers in order for
|
||
|
communication between them to take place.
|
||
|
The numbers may be chosen by Yacc, or chosen by the user.
|
||
|
In either case, the ``# define'' mechanism of C is used to allow the lexical analyzer
|
||
|
to return these numbers symbolically.
|
||
|
For example, suppose that the token name DIGIT has been defined in the declarations section of the
|
||
|
Yacc specification file.
|
||
|
The relevant portion of the lexical analyzer might look like:
|
||
|
.DS
|
||
|
yylex(){
|
||
|
extern int yylval;
|
||
|
int c;
|
||
|
. . .
|
||
|
c = getchar();
|
||
|
. . .
|
||
|
switch( c ) {
|
||
|
. . .
|
||
|
case \'0\':
|
||
|
case \'1\':
|
||
|
. . .
|
||
|
case \'9\':
|
||
|
yylval = c\-\'0\';
|
||
|
return( DIGIT );
|
||
|
. . .
|
||
|
}
|
||
|
. . .
|
||
|
.DE
|
||
|
.PP
|
||
|
The intent is to return a token number of DIGIT, and a value equal to the numerical value of the
|
||
|
digit.
|
||
|
Provided that the lexical analyzer code is placed in the programs section of the specification file,
|
||
|
the identifier DIGIT will be defined as the token number associated
|
||
|
with the token DIGIT.
|
||
|
.PP
|
||
|
This mechanism leads to clear,
|
||
|
easily modified lexical analyzers; the only pitfall is the need
|
||
|
to avoid using any token names in the grammar that are reserved
|
||
|
or significant in C or the parser; for example, the use of
|
||
|
token names
|
||
|
.I if
|
||
|
or
|
||
|
.I while
|
||
|
will almost certainly cause severe
|
||
|
difficulties when the lexical analyzer is compiled.
|
||
|
The token name
|
||
|
.I error
|
||
|
is reserved for error handling, and should not be used naively
|
||
|
(see Section 7).
|
||
|
.PP
|
||
|
As mentioned above, the token numbers may be chosen by Yacc or by the user.
|
||
|
In the default situation, the numbers are chosen by Yacc.
|
||
|
The default token number for a literal
|
||
|
character is the numerical value of the character in the local character set.
|
||
|
Other names are assigned token numbers
|
||
|
starting at 257.
|
||
|
.PP
|
||
|
To assign a token number to a token (including literals),
|
||
|
the first appearance of the token name or literal
|
||
|
.I
|
||
|
in the declarations section
|
||
|
.R
|
||
|
can be immediately followed by
|
||
|
a nonnegative integer.
|
||
|
This integer is taken to be the token number of the name or literal.
|
||
|
Names and literals not defined by this mechanism retain their default definition.
|
||
|
It is important that all token numbers be distinct.
|
||
|
.PP
|
||
|
For historical reasons, the endmarker must have token
|
||
|
number 0 or negative.
|
||
|
This token number cannot be redefined by the user; thus, all
|
||
|
lexical analyzers should be prepared to return 0 or negative as a token number
|
||
|
upon reaching the end of their input.
|
||
|
.PP
|
||
|
A very useful tool for constructing lexical analyzers is
|
||
|
the
|
||
|
.I Lex
|
||
|
program developed by Mike Lesk.
|
||
|
.[
|
||
|
Lesk Lex
|
||
|
.]
|
||
|
These lexical analyzers are designed to work in close
|
||
|
harmony with Yacc parsers.
|
||
|
The specifications for these lexical analyzers
|
||
|
use regular expressions instead of grammar rules.
|
||
|
Lex can be easily used to produce quite complicated lexical analyzers,
|
||
|
but there remain some languages (such as FORTRAN) which do not
|
||
|
fit any theoretical framework, and whose lexical analyzers
|
||
|
must be crafted by hand.
|