xref: /openbsd-src/usr.bin/lex/flex.1 (revision 0fae8ff5598eca8a853c91c80a222e7a5950eeba)
1.\"	$OpenBSD: flex.1,v 1.46 2024/11/09 18:06:00 op Exp $
2.\"
3.\" Copyright (c) 1990 The Regents of the University of California.
4.\" All rights reserved.
5.\"
6.\" This code is derived from software contributed to Berkeley by
7.\" Vern Paxson.
8.\"
9.\" The United States Government has rights in this work pursuant
10.\" to contract no. DE-AC03-76SF00098 between the United States
11.\" Department of Energy and the University of California.
12.\"
13.\" Redistribution and use in source and binary forms, with or without
14.\" modification, are permitted provided that the following conditions
15.\" are met:
16.\"
17.\" 1. Redistributions of source code must retain the above copyright
18.\"    notice, this list of conditions and the following disclaimer.
19.\" 2. Redistributions in binary form must reproduce the above copyright
20.\"    notice, this list of conditions and the following disclaimer in the
21.\"    documentation and/or other materials provided with the distribution.
22.\"
23.\" Neither the name of the University nor the names of its contributors
24.\" may be used to endorse or promote products derived from this software
25.\" without specific prior written permission.
26.\"
27.\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
28.\" IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
29.\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
30.\" PURPOSE.
31.\"
32.Dd $Mdocdate: November 9 2024 $
33.Dt FLEX 1
34.Os
35.Sh NAME
36.Nm flex ,
37.Nm flex++ ,
38.Nm lex
39.Nd fast lexical analyzer generator
40.Sh SYNOPSIS
41.Nm
42.Bk -words
43.Op Fl 78BbdFfhIiLlnpsTtVvw+?
44.Op Fl C Ns Op Cm aeFfmr
45.Op Fl Fl help
46.Op Fl Fl version
47.Op Fl o Ns Ar output
48.Op Fl P Ns Ar prefix
49.Op Fl S Ns Ar skeleton
50.Op Ar
51.Ek
52.Sh DESCRIPTION
53.Nm
54is a tool for generating
55.Em scanners :
56programs which recognize lexical patterns in text.
57.Nm
58reads the given input files, or its standard input if no file names are given,
59for a description of a scanner to generate.
60The description is in the form of pairs of regular expressions and C code,
61called
62.Em rules .
63.Nm
64generates as output a C source file,
65.Pa lex.yy.c ,
66which defines a routine
67.Fn yylex .
68This file is compiled and linked with the
69.Fl lfl
70library to produce an executable.
71When the executable is run, it analyzes its input for occurrences
72of the regular expressions.
73Whenever it finds one, it executes the corresponding C code.
74.Pp
75.Nm lex
76is a synonym for
77.Nm flex .
78.Nm flex++
79is a synonym for
80.Nm
81.Fl + .
82.Pp
83The manual includes both tutorial and reference sections:
84.Bl -ohang
85.It Sy Some Simple Examples
86.It Sy Format of the Input File
87.It Sy Patterns
88The extended regular expressions used by
89.Nm .
90.It Sy How the Input is Matched
91The rules for determining what has been matched.
92.It Sy Actions
93How to specify what to do when a pattern is matched.
94.It Sy The Generated Scanner
95Details regarding the scanner that
96.Nm
97produces;
98how to control the input source.
99.It Sy Start Conditions
100Introducing context into scanners, and managing
101.Qq mini-scanners .
102.It Sy Multiple Input Buffers
103How to manipulate multiple input sources;
104how to scan from strings instead of files.
105.It Sy End-of-File Rules
106Special rules for matching the end of the input.
107.It Sy Miscellaneous Macros
108A summary of macros available to the actions.
109.It Sy Values Available to the User
110A summary of values available to the actions.
111.It Sy Interfacing with Yacc
112Connecting flex scanners together with
113.Xr yacc 1
114parsers.
115.It Sy Options
116.Nm
117command-line options, and the
118.Dq %option
119directive.
120.It Sy Performance Considerations
121How to make scanners go as fast as possible.
122.It Sy Generating C++ Scanners
123The
124.Pq experimental
125facility for generating C++ scanner classes.
126.It Sy Incompatibilities with Lex and POSIX
127How
128.Nm
129differs from
130.At
131.Nm lex
132and the
133.Tn POSIX
134.Nm lex
135standard.
136.It Sy Files
137Files used by
138.Nm .
139.It Sy Diagnostics
140Those error messages produced by
141.Nm
142.Pq or scanners it generates
143whose meanings might not be apparent.
144.It Sy See Also
145Other documentation, related tools.
146.It Sy Authors
147Includes contact information.
148.It Sy Bugs
149Known problems with
150.Nm .
151.El
152.Sh SOME SIMPLE EXAMPLES
153First some simple examples to get the flavor of how one uses
154.Nm .
155The following
156.Nm
157input specifies a scanner which whenever it encounters the string
158.Qq username
159will replace it with the user's login name:
160.Bd -literal -offset indent
161%%
162username    printf("%s", getlogin());
163.Ed
164.Pp
165By default, any text not matched by a
166.Nm
167scanner is copied to the output, so the net effect of this scanner is
168to copy its input file to its output with each occurrence of
169.Qq username
170expanded.
171In this input, there is just one rule.
172.Qq username
173is the
174.Em pattern
175and the
176.Qq printf
177is the
178.Em action .
179The
180.Qq %%
181marks the beginning of the rules.
182.Pp
183Here's another simple example:
184.Bd -literal -offset indent
185%{
186int num_lines = 0, num_chars = 0;
187%}
188
189%%
190\en      ++num_lines; ++num_chars;
191\&.       ++num_chars;
192
193%%
194main()
195{
196	yylex();
197	printf("# of lines = %d, # of chars = %d\en",
198            num_lines, num_chars);
199}
200.Ed
201.Pp
202This scanner counts the number of characters and the number
203of lines in its input
204(it produces no output other than the final report on the counts).
205The first line declares two globals,
206.Qq num_lines
207and
208.Qq num_chars ,
209which are accessible both inside
210.Fn yylex
211and in the
212.Fn main
213routine declared after the second
214.Qq %% .
215There are two rules, one which matches a newline
216.Pq \&"\en\&"
217and increments both the line count and the character count,
218and one which matches any character other than a newline
219(indicated by the
220.Qq \&.
221regular expression).
222.Pp
223A somewhat more complicated example:
224.Bd -literal -offset indent
225/* scanner for a toy Pascal-like language */
226
227DIGIT    [0-9]
228ID       [a-z][a-z0-9]*
229
230%%
231
232{DIGIT}+ {
233        printf("An integer: %s\en", yytext);
234}
235
236{DIGIT}+"."{DIGIT}* {
237        printf("A float: %s\en", yytext);
238}
239
240if|then|begin|end|procedure|function {
241        printf("A keyword: %s\en", yytext);
242}
243
244{ID}    printf("An identifier: %s\en", yytext);
245
246"+"|"-"|"*"|"/"   printf("An operator: %s\en", yytext);
247
248"{"[^}\en]*"}"     /* eat up one-line comments */
249
250[ \et\en]+          /* eat up whitespace */
251
252\&.       printf("Unrecognized character: %s\en", yytext);
253
254%%
255
256int
257main(int argc, char *argv[])
258{
259        ++argv; --argc;  /* skip over program name */
260        if (argc > 0)
261                yyin = fopen(argv[0], "r");
262        else
263                yyin = stdin;
264
265        yylex();
266}
267.Ed
268.Pp
269This is the beginnings of a simple scanner for a language like Pascal.
270It identifies different types of
271.Em tokens
272and reports on what it has seen.
273.Pp
274The details of this example will be explained in the following sections.
275.Sh FORMAT OF THE INPUT FILE
276The
277.Nm
278input file consists of three sections, separated by a line with just
279.Qq %%
280in it:
281.Bd -unfilled -offset indent
282definitions
283%%
284rules
285%%
286user code
287.Ed
288.Pp
289The
290.Em definitions
291section contains declarations of simple
292.Em name
293definitions to simplify the scanner specification, and declarations of
294.Em start conditions ,
295which are explained in a later section.
296.Pp
297Name definitions have the form:
298.Pp
299.D1 name definition
300.Pp
301The
302.Qq name
303is a word beginning with a letter or an underscore
304.Pq Sq _
305followed by zero or more letters, digits,
306.Sq _ ,
307or
308.Sq -
309.Pq dash .
310The definition is taken to begin at the first non-whitespace character
311following the name and continuing to the end of the line.
312The definition can subsequently be referred to using
313.Qq {name} ,
314which will expand to
315.Qq (definition) .
316For example:
317.Bd -literal -offset indent
318DIGIT    [0-9]
319ID       [a-z][a-z0-9]*
320.Ed
321.Pp
322This defines
323.Qq DIGIT
324to be a regular expression which matches a single digit, and
325.Qq ID
326to be a regular expression which matches a letter
327followed by zero-or-more letters-or-digits.
328A subsequent reference to
329.Pp
330.Dl {DIGIT}+"."{DIGIT}*
331.Pp
332is identical to
333.Pp
334.Dl ([0-9])+"."([0-9])*
335.Pp
336and matches one-or-more digits followed by a
337.Sq .\&
338followed by zero-or-more digits.
339.Pp
340The
341.Em rules
342section of the
343.Nm
344input contains a series of rules of the form:
345.Pp
346.Dl pattern	action
347.Pp
348The pattern must be unindented and the action must begin
349on the same line.
350.Pp
351See below for a further description of patterns and actions.
352.Pp
353Finally, the user code section is simply copied to
354.Pa lex.yy.c
355verbatim.
356It is used for companion routines which call or are called by the scanner.
357The presence of this section is optional;
358if it is missing, the second
359.Qq %%
360in the input file may be skipped too.
361.Pp
362In the definitions and rules sections, any indented text or text enclosed in
363.Sq %{
364and
365.Sq %}
366is copied verbatim to the output
367.Pq with the %{}'s removed .
368The %{}'s must appear unindented on lines by themselves.
369.Pp
370In the rules section,
371any indented or %{} text appearing before the first rule may be used to
372declare variables which are local to the scanning routine and
373.Pq after the declarations
374code which is to be executed whenever the scanning routine is entered.
375Other indented or %{} text in the rule section is still copied to the output,
376but its meaning is not well-defined and it may well cause compile-time
377errors (this feature is present for
378.Tn POSIX
379compliance; see below for other such features).
380.Pp
381In the definitions section
382.Pq but not in the rules section ,
383an unindented comment
384(i.e., a line beginning with
385.Qq /* )
386is also copied verbatim to the output up to the next
387.Qq */ .
388.Sh PATTERNS
389The patterns in the input are written using an extended set of regular
390expressions.
391These are:
392.Bl -tag -width "XXXXXXXX"
393.It x
394Match the character
395.Sq x .
396.It .\&
397Any character
398.Pq byte
399except newline.
400.It [xyz]
401A
402.Qq character class ;
403in this case, the pattern matches either an
404.Sq x ,
405a
406.Sq y ,
407or a
408.Sq z .
409.It [abj-oZ]
410A
411.Qq character class
412with a range in it; matches an
413.Sq a ,
414a
415.Sq b ,
416any letter from
417.Sq j
418through
419.Sq o ,
420or a
421.Sq Z .
422.It [^A-Z]
423A
424.Qq negated character class ,
425i.e., any character but those in the class.
426In this case, any character EXCEPT an uppercase letter.
427.It [^A-Z\en]
428Any character EXCEPT an uppercase letter or a newline.
429.It r*
430Zero or more r's, where
431.Sq r
432is any regular expression.
433.It r+
434One or more r's.
435.It r?
436Zero or one r's (that is,
437.Qq an optional r ) .
438.It r{2,5}
439Anywhere from two to five r's.
440.It r{2,}
441Two or more r's.
442.It r{4}
443Exactly 4 r's.
444.It {name}
445The expansion of the
446.Qq name
447definition
448.Pq see above .
449.It \&"[xyz]\e\&"foo\&"
450The literal string: [xyz]"foo.
451.It \eX
452If
453.Sq X
454is an
455.Sq a ,
456.Sq b ,
457.Sq f ,
458.Sq n ,
459.Sq r ,
460.Sq t ,
461or
462.Sq v ,
463then the ANSI-C interpretation of
464.Sq \eX .
465Otherwise, a literal
466.Sq X
467(used to escape operators such as
468.Sq * ) .
469.It \e0
470A NUL character
471.Pq ASCII code 0 .
472.It \e123
473The character with octal value 123.
474.It \ex2a
475The character with hexadecimal value 2a.
476.It (r)
477Match an
478.Sq r ;
479parentheses are used to override precedence
480.Pq see below .
481.It rs
482The regular expression
483.Sq r
484followed by the regular expression
485.Sq s ;
486called
487.Qq concatenation .
488.It r|s
489Either an
490.Sq r
491or an
492.Sq s .
493.It r/s
494An
495.Sq r ,
496but only if it is followed by an
497.Sq s .
498The text matched by
499.Sq s
500is included when determining whether this rule is the
501.Qq longest match ,
502but is then returned to the input before the action is executed.
503So the action only sees the text matched by
504.Sq r .
505This type of pattern is called
506.Qq trailing context .
507(There are some combinations of r/s that
508.Nm
509cannot match correctly; see notes in the
510.Sx BUGS
511section below regarding
512.Qq dangerous trailing context . )
513.It ^r
514An
515.Sq r ,
516but only at the beginning of a line
517(i.e., just starting to scan, or right after a newline has been scanned).
518.It r$
519An
520.Sq r ,
521but only at the end of a line
522.Pq i.e., just before a newline .
523Equivalent to
524.Qq r/\en .
525.Pp
526Note that
527.Nm flex Ns 's
528notion of
529.Qq newline
530is exactly whatever the C compiler used to compile
531.Nm
532interprets
533.Sq \en
534as.
535.\" In particular, on some DOS systems you must either filter out \er's in the
536.\" input yourself, or explicitly use r/\er\en for
537.\" .Qq r$ .
538.It <s>r
539An
540.Sq r ,
541but only in start condition
542.Sq s
543.Pq see below for discussion of start conditions .
544.It <s1,s2,s3>r
545The same, but in any of start conditions s1, s2, or s3.
546.It <*>r
547An
548.Sq r
549in any start condition, even an exclusive one.
550.It <<EOF>>
551An end-of-file.
552.It <s1,s2><<EOF>>
553An end-of-file when in start condition s1 or s2.
554.El
555.Pp
556Note that inside of a character class, all regular expression operators
557lose their special meaning except escape
558.Pq Sq \e
559and the character class operators,
560.Sq - ,
561.Sq ]\& ,
562and, at the beginning of the class,
563.Sq ^ .
564.Pp
565The regular expressions listed above are grouped according to
566precedence, from highest precedence at the top to lowest at the bottom.
567Those grouped together have equal precedence.
568For example,
569.Pp
570.D1 foo|bar*
571.Pp
572is the same as
573.Pp
574.D1 (foo)|(ba(r*))
575.Pp
576since the
577.Sq *
578operator has higher precedence than concatenation,
579and concatenation higher than alternation
580.Pq Sq |\& .
581This pattern therefore matches
582.Em either
583the string
584.Qq foo
585.Em or
586the string
587.Qq ba
588followed by zero-or-more r's.
589To match
590.Qq foo
591or zero-or-more "bar"'s,
592use:
593.Pp
594.D1 foo|(bar)*
595.Pp
596and to match zero-or-more "foo"'s-or-"bar"'s:
597.Pp
598.D1 (foo|bar)*
599.Pp
600In addition to characters and ranges of characters, character classes
601can also contain character class
602.Em expressions .
603These are expressions enclosed inside
604.Sq [:
605and
606.Sq :]
607delimiters (which themselves must appear between the
608.Sq \&[
609and
610.Sq ]\&
611of the
612character class; other elements may occur inside the character class, too).
613The valid expressions are:
614.Bd -unfilled -offset indent
615[:alnum:] [:alpha:] [:blank:]
616[:cntrl:] [:digit:] [:graph:]
617[:lower:] [:print:] [:punct:]
618[:space:] [:upper:] [:xdigit:]
619.Ed
620.Pp
621These expressions all designate a set of characters equivalent to
622the corresponding standard C
623.Fn isXXX
624function.
625For example, [:alnum:] designates those characters for which
626.Xr isalnum 3
627returns true \- i.e., any alphabetic or numeric.
628Some systems don't provide
629.Xr isblank 3 ,
630so
631.Nm
632defines [:blank:] as a blank or a tab.
633.Pp
634For example, the following character classes are all equivalent:
635.Bd -unfilled -offset indent
636[[:alnum:]]
637[[:alpha:][:digit:]]
638[[:alpha:]0-9]
639[a-zA-Z0-9]
640.Ed
641.Pp
642If the scanner is case-insensitive (the
643.Fl i
644flag), then [:upper:] and [:lower:] are equivalent to [:alpha:].
645.Pp
646Some notes on patterns:
647.Bl -dash
648.It
649A negated character class such as the example
650.Qq [^A-Z]
651above will match a newline unless "\en"
652.Pq or an equivalent escape sequence
653is one of the characters explicitly present in the negated character class
654(e.g.,
655.Qq [^A-Z\en] ) .
656This is unlike how many other regular expression tools treat negated character
657classes, but unfortunately the inconsistency is historically entrenched.
658Matching newlines means that a pattern like
659.Qq [^"]*
660can match the entire input unless there's another quote in the input.
661.It
662A rule can have at most one instance of trailing context
663(the
664.Sq /
665operator or the
666.Sq $
667operator).
668The start condition,
669.Sq ^ ,
670and
671.Qq <<EOF>>
672patterns can only occur at the beginning of a pattern and, as well as with
673.Sq /
674and
675.Sq $ ,
676cannot be grouped inside parentheses.
677A
678.Sq ^
679which does not occur at the beginning of a rule or a
680.Sq $
681which does not occur at the end of a rule loses its special properties
682and is treated as a normal character.
683.It
684The following are illegal:
685.Bd -unfilled -offset indent
686foo/bar$
687<sc1>foo<sc2>bar
688.Ed
689.Pp
690Note that the first of these, can be written
691.Qq foo/bar\en .
692.It
693The following will result in
694.Sq $
695or
696.Sq ^
697being treated as a normal character:
698.Bd -unfilled -offset indent
699foo|(bar$)
700foo|^bar
701.Ed
702.Pp
703If what's wanted is a
704.Qq foo
705or a bar-followed-by-a-newline, the following could be used
706(the special
707.Sq |\&
708action is explained below):
709.Bd -unfilled -offset indent
710foo      |
711bar$     /* action goes here */
712.Ed
713.Pp
714A similar trick will work for matching a foo or a
715bar-at-the-beginning-of-a-line.
716.El
717.Sh HOW THE INPUT IS MATCHED
718When the generated scanner is run,
719it analyzes its input looking for strings which match any of its patterns.
720If it finds more than one match,
721it takes the one matching the most text
722(for trailing context rules, this includes the length of the trailing part,
723even though it will then be returned to the input).
724If it finds two or more matches of the same length,
725the rule listed first in the
726.Nm
727input file is chosen.
728.Pp
729Once the match is determined, the text corresponding to the match
730(called the
731.Em token )
732is made available in the global character pointer
733.Fa yytext ,
734and its length in the global integer
735.Fa yyleng .
736The
737.Em action
738corresponding to the matched pattern is then executed
739.Pq a more detailed description of actions follows ,
740and then the remaining input is scanned for another match.
741.Pp
742If no match is found, then the default rule is executed:
743the next character in the input is considered matched and
744copied to the standard output.
745Thus, the simplest legal
746.Nm
747input is:
748.Pp
749.D1 %%
750.Pp
751which generates a scanner that simply copies its input
752.Pq one character at a time
753to its output.
754.Pp
755Note that
756.Fa yytext
757can be defined in two different ways:
758either as a character pointer or as a character array.
759Which definition
760.Nm
761uses can be controlled by including one of the special directives
762.Dq %pointer
763or
764.Dq %array
765in the first
766.Pq definitions
767section of flex input.
768The default is
769.Dq %pointer ,
770unless the
771.Fl l
772.Nm lex
773compatibility option is used, in which case
774.Fa yytext
775will be an array.
776The advantage of using
777.Dq %pointer
778is substantially faster scanning and no buffer overflow when matching
779very large tokens
780.Pq unless not enough dynamic memory is available .
781The disadvantage is that actions are restricted in how they can modify
782.Fa yytext
783.Pq see the next section ,
784and calls to the
785.Fn unput
786function destroy the present contents of
787.Fa yytext ,
788which can be a considerable porting headache when moving between different
789.Nm lex
790versions.
791.Pp
792The advantage of
793.Dq %array
794is that
795.Fa yytext
796can be modified as much as wanted, and calls to
797.Fn unput
798do not destroy
799.Fa yytext
800.Pq see below .
801Furthermore, existing
802.Nm lex
803programs sometimes access
804.Fa yytext
805externally using declarations of the form:
806.Pp
807.D1 extern char yytext[];
808.Pp
809This definition is erroneous when used with
810.Dq %pointer ,
811but correct for
812.Dq %array .
813.Pp
814.Dq %array
815defines
816.Fa yytext
817to be an array of
818.Dv YYLMAX
819characters, which defaults to a fairly large value.
820The size can be changed by simply #define'ing
821.Dv YYLMAX
822to a different value in the first section of
823.Nm
824input.
825As mentioned above, with
826.Dq %pointer
827yytext grows dynamically to accommodate large tokens.
828While this means a
829.Dq %pointer
830scanner can accommodate very large tokens
831.Pq such as matching entire blocks of comments ,
832bear in mind that each time the scanner must resize
833.Fa yytext
834it also must rescan the entire token from the beginning, so matching such
835tokens can prove slow.
836.Fa yytext
837presently does not dynamically grow if a call to
838.Fn unput
839results in too much text being pushed back; instead, a run-time error results.
840.Pp
841Also note that
842.Dq %array
843cannot be used with C++ scanner classes
844.Pq the c++ option; see below .
845.Sh ACTIONS
846Each pattern in a rule has a corresponding action,
847which can be any arbitrary C statement.
848The pattern ends at the first non-escaped whitespace character;
849the remainder of the line is its action.
850If the action is empty,
851then when the pattern is matched the input token is simply discarded.
852For example, here is the specification for a program
853which deletes all occurrences of
854.Qq zap me
855from its input:
856.Bd -literal -offset indent
857%%
858"zap me"
859.Ed
860.Pp
861(It will copy all other characters in the input to the output since
862they will be matched by the default rule.)
863.Pp
864Here is a program which compresses multiple blanks and tabs down to
865a single blank, and throws away whitespace found at the end of a line:
866.Bd -literal -offset indent
867%%
868[ \et]+        putchar(' ');
869[ \et]+$       /* ignore this token */
870.Ed
871.Pp
872If the action contains a
873.Sq { ,
874then the action spans till the balancing
875.Sq }
876is found, and the action may cross multiple lines.
877.Nm
878knows about C strings and comments and won't be fooled by braces found
879within them, but also allows actions to begin with
880.Sq %{
881and will consider the action to be all the text up to the next
882.Sq %}
883.Pq regardless of ordinary braces inside the action .
884.Pp
885An action consisting solely of a vertical bar
886.Pq Sq |\&
887means
888.Qq same as the action for the next rule .
889See below for an illustration.
890.Pp
891Actions can include arbitrary C code,
892including return statements to return a value to whatever routine called
893.Fn yylex .
894Each time
895.Fn yylex
896is called, it continues processing tokens from where it last left off
897until it either reaches the end of the file or executes a return.
898.Pp
899Actions are free to modify
900.Fa yytext
901except for lengthening it
902(adding characters to its end \- these will overwrite later characters in the
903input stream).
904This, however, does not apply when using
905.Dq %array
906.Pq see above ;
907in that case,
908.Fa yytext
909may be freely modified in any way.
910.Pp
911Actions are free to modify
912.Fa yyleng
913except they should not do so if the action also includes use of
914.Fn yymore
915.Pq see below .
916.Pp
917There are a number of special directives which can be included within
918an action:
919.Bl -tag -width Ds
920.It ECHO
921Copies
922.Fa yytext
923to the scanner's output.
924.It BEGIN
925Followed by the name of a start condition, places the scanner in the
926corresponding start condition
927.Pq see below .
928.It REJECT
929Directs the scanner to proceed on to the
930.Qq second best
931rule which matched the input
932.Pq or a prefix of the input .
933The rule is chosen as described above in
934.Sx HOW THE INPUT IS MATCHED ,
935and
936.Fa yytext
937and
938.Fa yyleng
939set up appropriately.
940It may either be one which matched as much text
941as the originally chosen rule but came later in the
942.Nm
943input file, or one which matched less text.
944For example, the following will both count the
945words in the input and call the routine
946.Fn special
947whenever
948.Qq frob
949is seen:
950.Bd -literal -offset indent
951int word_count = 0;
952%%
953
954frob        special(); REJECT;
955[^ \et\en]+   ++word_count;
956.Ed
957.Pp
958Without the
959.Em REJECT ,
960any "frob"'s in the input would not be counted as words,
961since the scanner normally executes only one action per token.
962Multiple
963.Em REJECT Ns 's
964are allowed,
965each one finding the next best choice to the currently active rule.
966For example, when the following scanner scans the token
967.Qq abcd ,
968it will write
969.Qq abcdabcaba
970to the output:
971.Bd -literal -offset indent
972%%
973a        |
974ab       |
975abc      |
976abcd     ECHO; REJECT;
977\&.|\en     /* eat up any unmatched character */
978.Ed
979.Pp
980(The first three rules share the fourth's action since they use
981the special
982.Sq |\&
983action.)
984.Em REJECT
985is a particularly expensive feature in terms of scanner performance;
986if it is used in any of the scanner's actions it will slow down
987all of the scanner's matching.
988Furthermore,
989.Em REJECT
990cannot be used with the
991.Fl Cf
992or
993.Fl CF
994options
995.Pq see below .
996.Pp
997Note also that unlike the other special actions,
998.Em REJECT
999is a
1000.Em branch ;
1001code immediately following it in the action will not be executed.
1002.It yymore()
1003Tells the scanner that the next time it matches a rule, the corresponding
1004token should be appended onto the current value of
1005.Fa yytext
1006rather than replacing it.
1007For example, given the input
1008.Qq mega-kludge
1009the following will write
1010.Qq mega-mega-kludge
1011to the output:
1012.Bd -literal -offset indent
1013%%
1014mega-    ECHO; yymore();
1015kludge   ECHO;
1016.Ed
1017.Pp
1018First
1019.Qq mega-
1020is matched and echoed to the output.
1021Then
1022.Qq kludge
1023is matched, but the previous
1024.Qq mega-
1025is still hanging around at the beginning of
1026.Fa yytext
1027so the
1028.Em ECHO
1029for the
1030.Qq kludge
1031rule will actually write
1032.Qq mega-kludge .
1033.Pp
1034Two notes regarding use of
1035.Fn yymore :
1036First,
1037.Fn yymore
1038depends on the value of
1039.Fa yyleng
1040correctly reflecting the size of the current token, so
1041.Fa yyleng
1042must not be modified when using
1043.Fn yymore .
1044Second, the presence of
1045.Fn yymore
1046in the scanner's action entails a minor performance penalty in the
1047scanner's matching speed.
1048.It yyless(n)
1049Returns all but the first
1050.Ar n
1051characters of the current token back to the input stream, where they
1052will be rescanned when the scanner looks for the next match.
1053.Fa yytext
1054and
1055.Fa yyleng
1056are adjusted appropriately (e.g.,
1057.Fa yyleng
1058will now be equal to
1059.Ar n ) .
1060For example, on the input
1061.Qq foobar
1062the following will write out
1063.Qq foobarbar :
1064.Bd -literal -offset indent
1065%%
1066foobar    ECHO; yyless(3);
1067[a-z]+    ECHO;
1068.Ed
1069.Pp
1070An argument of 0 to
1071.Fa yyless
1072will cause the entire current input string to be scanned again.
1073Unless how the scanner will subsequently process its input has been changed
1074(using
1075.Em BEGIN ,
1076for example),
1077this will result in an endless loop.
1078.Pp
1079Note that
1080.Fa yyless
1081is a macro and can only be used in the
1082.Nm
1083input file, not from other source files.
1084.It unput(c)
1085Puts the character
1086.Ar c
1087back into the input stream.
1088It will be the next character scanned.
1089The following action will take the current token and cause it
1090to be rescanned enclosed in parentheses.
1091.Bd -literal -offset indent
1092{
1093        int i;
1094        char *yycopy;
1095
1096        /* Copy yytext because unput() trashes yytext */
1097        if ((yycopy = strdup(yytext)) == NULL)
1098                err(1, NULL);
1099        unput(')');
1100        for (i = yyleng - 1; i >= 0; --i)
1101                unput(yycopy[i]);
1102        unput('(');
1103        free(yycopy);
1104}
1105.Ed
1106.Pp
1107Note that since each
1108.Fn unput
1109puts the given character back at the beginning of the input stream,
1110pushing back strings must be done back-to-front.
1111.Pp
1112An important potential problem when using
1113.Fn unput
1114is that if using
1115.Dq %pointer
1116.Pq the default ,
1117a call to
1118.Fn unput
1119destroys the contents of
1120.Fa yytext ,
1121starting with its rightmost character and devouring one character to
1122the left with each call.
1123If the value of
1124.Fa yytext
1125should be preserved after a call to
1126.Fn unput
1127.Pq as in the above example ,
1128it must either first be copied elsewhere, or the scanner must be built using
1129.Dq %array
1130instead (see
1131.Sx HOW THE INPUT IS MATCHED ) .
1132.Pp
1133Finally, note that EOF cannot be put back
1134to attempt to mark the input stream with an end-of-file.
1135.It input()
1136Reads the next character from the input stream.
1137For example, the following is one way to eat up C comments:
1138.Bd -literal -offset indent
1139%%
1140"/*" {
1141        int c;
1142
1143        for (;;) {
1144                while ((c = input()) != '*' && c != EOF)
1145                        ; /* eat up text of comment */
1146
1147                if (c == '*') {
1148                        while ((c = input()) == '*')
1149                                ;
1150                        if (c == '/')
1151                                break; /* found the end */
1152                }
1153
1154                if (c == EOF) {
1155                        errx(1, "EOF in comment");
1156                        break;
1157                }
1158        }
1159}
1160.Ed
1161.Pp
1162(Note that if the scanner is compiled using C++, then
1163.Fn input
1164is instead referred to as
1165.Fn yyinput ,
1166in order to avoid a name clash with the C++ stream by the name of input.)
1167.It YY_FLUSH_BUFFER
1168Flushes the scanner's internal buffer
1169so that the next time the scanner attempts to match a token,
1170it will first refill the buffer using
1171.Dv YY_INPUT
1172(see
1173.Sx THE GENERATED SCANNER ,
1174below).
1175This action is a special case of the more general
1176.Fn yy_flush_buffer
1177function, described below in the section
1178.Sx MULTIPLE INPUT BUFFERS .
1179.It yyterminate()
1180Can be used in lieu of a return statement in an action.
1181It terminates the scanner and returns a 0 to the scanner's caller, indicating
1182.Qq all done .
1183By default,
1184.Fn yyterminate
1185is also called when an end-of-file is encountered.
1186It is a macro and may be redefined.
1187.El
1188.Sh THE GENERATED SCANNER
1189The output of
1190.Nm
1191is the file
1192.Pa lex.yy.c ,
1193which contains the scanning routine
1194.Fn yylex ,
1195a number of tables used by it for matching tokens,
1196and a number of auxiliary routines and macros.
1197By default,
1198.Fn yylex
1199is declared as follows:
1200.Bd -unfilled -offset indent
1201int yylex()
1202{
1203    ... various definitions and the actions in here ...
1204}
1205.Ed
1206.Pp
1207(If the environment supports function prototypes, then it will
1208be "int yylex(void)".)
1209This definition may be changed by defining the
1210.Dv YY_DECL
1211macro.
1212For example:
1213.Bd -literal -offset indent
1214#define YY_DECL float lexscan(a, b) float a, b;
1215.Ed
1216.Pp
1217would give the scanning routine the name
1218.Em lexscan ,
1219returning a float, and taking two floats as arguments.
1220Note that if arguments are given to the scanning routine using a
1221K&R-style/non-prototyped function declaration,
1222the definition must be terminated with a semi-colon
1223.Pq Sq ;\& .
1224.Pp
1225Whenever
1226.Fn yylex
1227is called, it scans tokens from the global input file
1228.Pa yyin
1229.Pq which defaults to stdin .
1230It continues until it either reaches an end-of-file
1231.Pq at which point it returns the value 0
1232or one of its actions executes a
1233.Em return
1234statement.
1235.Pp
1236If the scanner reaches an end-of-file, subsequent calls are undefined
1237unless either
1238.Em yyin
1239is pointed at a new input file
1240.Pq in which case scanning continues from that file ,
1241or
1242.Fn yyrestart
1243is called.
1244.Fn yyrestart
1245takes one argument, a
1246.Fa FILE *
1247pointer (which can be nil, if
1248.Dv YY_INPUT
1249has been set up to scan from a source other than
1250.Em yyin ) ,
1251and initializes
1252.Em yyin
1253for scanning from that file.
1254Essentially there is no difference between just assigning
1255.Em yyin
1256to a new input file or using
1257.Fn yyrestart
1258to do so; the latter is available for compatibility with previous versions of
1259.Nm ,
1260and because it can be used to switch input files in the middle of scanning.
1261It can also be used to throw away the current input buffer,
1262by calling it with an argument of
1263.Em yyin ;
1264but better is to use
1265.Dv YY_FLUSH_BUFFER
1266.Pq see above .
1267Note that
1268.Fn yyrestart
1269does not reset the start condition to
1270.Em INITIAL
1271(see
1272.Sx START CONDITIONS ,
1273below).
1274.Pp
1275If
1276.Fn yylex
1277stops scanning due to executing a
1278.Em return
1279statement in one of the actions, the scanner may then be called again and it
1280will resume scanning where it left off.
1281.Pp
1282By default
1283.Pq and for purposes of efficiency ,
1284the scanner uses block-reads rather than simple
1285.Xr getc 3
1286calls to read characters from
1287.Em yyin .
1288The nature of how it gets its input can be controlled by defining the
1289.Dv YY_INPUT
1290macro.
1291.Dv YY_INPUT Ns 's
1292calling sequence is
1293.Qq YY_INPUT(buf,result,max_size) .
1294Its action is to place up to
1295.Dv max_size
1296characters in the character array
1297.Em buf
1298and return in the integer variable
1299.Em result
1300either the number of characters read or the constant
1301.Dv YY_NULL
1302(0 on
1303.Ux
1304systems)
1305to indicate
1306.Dv EOF .
1307The default
1308.Dv YY_INPUT
1309reads from the global file-pointer
1310.Qq yyin .
1311.Pp
1312A sample definition of
1313.Dv YY_INPUT
1314.Pq in the definitions section of the input file :
1315.Bd -unfilled -offset indent
1316%{
1317#define YY_INPUT(buf,result,max_size) \e
1318{ \e
1319        int c = getchar(); \e
1320        result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \e
1321}
1322%}
1323.Ed
1324.Pp
1325This definition will change the input processing to occur
1326one character at a time.
1327.Pp
1328When the scanner receives an end-of-file indication from
1329.Dv YY_INPUT ,
1330it then checks the
1331.Fn yywrap
1332function.
1333If
1334.Fn yywrap
1335returns false
1336.Pq zero ,
1337then it is assumed that the function has gone ahead and set up
1338.Em yyin
1339to point to another input file, and scanning continues.
1340If it returns true
1341.Pq non-zero ,
1342then the scanner terminates, returning 0 to its caller.
1343Note that in either case, the start condition remains unchanged;
1344it does not revert to
1345.Em INITIAL .
1346.Pp
1347If you do not supply your own version of
1348.Fn yywrap ,
1349then you must either use
1350.Dq %option noyywrap
1351(in which case the scanner behaves as though
1352.Fn yywrap
1353returned 1), or you must link with
1354.Fl lfl
1355to obtain the default version of the routine, which always returns 1.
1356.Pp
1357Three routines are available for scanning from in-memory buffers rather
1358than files:
1359.Fn yy_scan_string ,
1360.Fn yy_scan_bytes ,
1361and
1362.Fn yy_scan_buffer .
1363See the discussion of them below in the section
1364.Sx MULTIPLE INPUT BUFFERS .
1365.Pp
1366The scanner writes its
1367.Em ECHO
1368output to the
1369.Em yyout
1370global
1371.Pq default, stdout ,
1372which may be redefined by the user simply by assigning it to some other
1373.Va FILE
1374pointer.
1375.Sh START CONDITIONS
1376.Nm
1377provides a mechanism for conditionally activating rules.
1378Any rule whose pattern is prefixed with
1379.Qq Aq sc
1380will only be active when the scanner is in the start condition named
1381.Qq sc .
1382For example,
1383.Bd -literal -offset indent
1384<STRING>[^"]* { /* eat up the string body ... */
1385        ...
1386}
1387.Ed
1388.Pp
1389will be active only when the scanner is in the
1390.Qq STRING
1391start condition, and
1392.Bd -literal -offset indent
1393<INITIAL,STRING,QUOTE>\e. { /* handle an escape ... */
1394        ...
1395}
1396.Ed
1397.Pp
1398will be active only when the current start condition is either
1399.Qq INITIAL ,
1400.Qq STRING ,
1401or
1402.Qq QUOTE .
1403.Pp
1404Start conditions are declared in the definitions
1405.Pq first
1406section of the input using unindented lines beginning with either
1407.Sq %s
1408or
1409.Sq %x
1410followed by a list of names.
1411The former declares
1412.Em inclusive
1413start conditions, the latter
1414.Em exclusive
1415start conditions.
1416A start condition is activated using the
1417.Em BEGIN
1418action.
1419Until the next
1420.Em BEGIN
1421action is executed, rules with the given start condition will be active and
1422rules with other start conditions will be inactive.
1423If the start condition is inclusive,
1424then rules with no start conditions at all will also be active.
1425If it is exclusive,
1426then only rules qualified with the start condition will be active.
1427A set of rules contingent on the same exclusive start condition
1428describe a scanner which is independent of any of the other rules in the
1429.Nm
1430input.
1431Because of this, exclusive start conditions make it easy to specify
1432.Qq mini-scanners
1433which scan portions of the input that are syntactically different
1434from the rest
1435.Pq e.g., comments .
1436.Pp
1437If the distinction between inclusive and exclusive start conditions
1438is still a little vague, here's a simple example illustrating the
1439connection between the two.
1440The set of rules:
1441.Bd -literal -offset indent
1442%s example
1443%%
1444
1445<example>foo   do_something();
1446
1447bar            something_else();
1448.Ed
1449.Pp
1450is equivalent to
1451.Bd -literal -offset indent
1452%x example
1453%%
1454
1455<example>foo   do_something();
1456
1457<INITIAL,example>bar    something_else();
1458.Ed
1459.Pp
1460Without the
1461.Aq INITIAL,example
1462qualifier, the
1463.Dq bar
1464pattern in the second example wouldn't be active
1465.Pq i.e., couldn't match
1466when in start condition
1467.Dq example .
1468If we just used
1469.Aq example
1470to qualify
1471.Dq bar ,
1472though, then it would only be active in
1473.Dq example
1474and not in
1475.Em INITIAL ,
1476while in the first example it's active in both,
1477because in the first example the
1478.Dq example
1479start condition is an inclusive
1480.Pq Sq %s
1481start condition.
1482.Pp
1483Also note that the special start-condition specifier
1484.Sq Aq *
1485matches every start condition.
1486Thus, the above example could also have been written:
1487.Bd -literal -offset indent
1488%x example
1489%%
1490
1491<example>foo   do_something();
1492
1493<*>bar         something_else();
1494.Ed
1495.Pp
1496The default rule (to
1497.Em ECHO
1498any unmatched character) remains active in start conditions.
1499It is equivalent to:
1500.Bd -literal -offset indent
1501<*>.|\en     ECHO;
1502.Ed
1503.Pp
1504.Dq BEGIN(0)
1505returns to the original state where only the rules with
1506no start conditions are active.
1507This state can also be referred to as the start-condition
1508.Em INITIAL ,
1509so
1510.Dq BEGIN(INITIAL)
1511is equivalent to
1512.Dq BEGIN(0) .
1513(The parentheses around the start condition name are not required but
1514are considered good style.)
1515.Pp
1516.Em BEGIN
1517actions can also be given as indented code at the beginning
1518of the rules section.
1519For example, the following will cause the scanner to enter the
1520.Qq SPECIAL
1521start condition whenever
1522.Fn yylex
1523is called and the global variable
1524.Fa enter_special
1525is true:
1526.Bd -literal -offset indent
1527int enter_special;
1528
1529%x SPECIAL
1530%%
1531        if (enter_special)
1532                BEGIN(SPECIAL);
1533
1534<SPECIAL>blahblahblah
1535\&...more rules follow...
1536.Ed
1537.Pp
1538To illustrate the uses of start conditions,
1539here is a scanner which provides two different interpretations
1540of a string like
1541.Qq 123.456 .
1542By default it will treat it as three tokens: the integer
1543.Qq 123 ,
1544a dot
1545.Pq Sq .\& ,
1546and the integer
1547.Qq 456 .
1548But if the string is preceded earlier in the line by the string
1549.Qq expect-floats
1550it will treat it as a single token, the floating-point number 123.456:
1551.Bd -literal -offset indent
1552%{
1553#include <math.h>
1554%}
1555%s expect
1556
1557%%
1558expect-floats        BEGIN(expect);
1559
1560<expect>[0-9]+"."[0-9]+ {
1561        printf("found a float, = %s\en", yytext);
1562}
1563<expect>\en {
1564        /*
1565         * That's the end of the line, so
1566         * we need another "expect-number"
1567         * before we'll recognize any more
1568         * numbers.
1569         */
1570        BEGIN(INITIAL);
1571}
1572
1573[0-9]+ {
1574        printf("found an integer, = %s\en", yytext);
1575}
1576
1577"."     printf("found a dot\en");
1578.Ed
1579.Pp
1580Here is a scanner which recognizes
1581.Pq and discards
1582C comments while maintaining a count of the current input line:
1583.Bd -literal -offset indent
1584%x comment
1585%%
1586int line_num = 1;
1587
1588"/*"                    BEGIN(comment);
1589
1590<comment>[^*\en]*        /* eat anything that's not a '*' */
1591<comment>"*"+[^*/\en]*   /* eat up '*'s not followed by '/'s */
1592<comment>\en             ++line_num;
1593<comment>"*"+"/"        BEGIN(INITIAL);
1594.Ed
1595.Pp
1596This scanner goes to a bit of trouble to match as much
1597text as possible with each rule.
1598In general, when attempting to write a high-speed scanner
1599try to match as much as possible in each rule, as it's a big win.
1600.Pp
1601Note that start-condition names are really integer values and
1602can be stored as such.
1603Thus, the above could be extended in the following fashion:
1604.Bd -literal -offset indent
1605%x comment foo
1606%%
1607int line_num = 1;
1608int comment_caller;
1609
1610"/*" {
1611        comment_caller = INITIAL;
1612        BEGIN(comment);
1613}
1614
1615\&...
1616
1617<foo>"/*" {
1618        comment_caller = foo;
1619        BEGIN(comment);
1620}
1621
1622<comment>[^*\en]*        /* eat anything that's not a '*' */
1623<comment>"*"+[^*/\en]*   /* eat up '*'s not followed by '/'s */
1624<comment>\en             ++line_num;
1625<comment>"*"+"/"        BEGIN(comment_caller);
1626.Ed
1627.Pp
1628Furthermore, the current start condition can be accessed by using
1629the integer-valued
1630.Dv YY_START
1631macro.
1632For example, the above assignments to
1633.Em comment_caller
1634could instead be written
1635.Pp
1636.Dl comment_caller = YY_START;
1637.Pp
1638Flex provides
1639.Dv YYSTATE
1640as an alias for
1641.Dv YY_START
1642(since that is what's used by
1643.At
1644.Nm lex ) .
1645.Pp
1646Note that start conditions do not have their own name-space;
1647%s's and %x's declare names in the same fashion as #define's.
1648.Pp
1649Finally, here's an example of how to match C-style quoted strings using
1650exclusive start conditions, including expanded escape sequences
1651(but not including checking for a string that's too long):
1652.Bd -literal -offset indent
1653%x str
1654
1655%%
1656#define MAX_STR_CONST 1024
1657char string_buf[MAX_STR_CONST];
1658char *string_buf_ptr;
1659
1660\e"      string_buf_ptr = string_buf; BEGIN(str);
1661
1662<str>\e" { /* saw closing quote - all done */
1663        BEGIN(INITIAL);
1664        *string_buf_ptr = '\e0';
1665        /*
1666         * return string constant token type and
1667         * value to parser
1668         */
1669}
1670
1671<str>\en {
1672        /* error - unterminated string constant */
1673        /* generate error message */
1674}
1675
1676<str>\e\e[0-7]{1,3} {
1677        /* octal escape sequence */
1678        int result;
1679
1680        (void) sscanf(yytext + 1, "%o", &result);
1681
1682        if (result > 0xff) {
1683                /* error, constant is out-of-bounds */
1684	} else
1685	        *string_buf_ptr++ = result;
1686}
1687
1688<str>\e\e[0-9]+ {
1689        /*
1690         * generate error - bad escape sequence; something
1691         * like '\e48' or '\e0777777'
1692         */
1693}
1694
1695<str>\e\en  *string_buf_ptr++ = '\en';
1696<str>\e\et  *string_buf_ptr++ = '\et';
1697<str>\e\er  *string_buf_ptr++ = '\er';
1698<str>\e\eb  *string_buf_ptr++ = '\eb';
1699<str>\e\ef  *string_buf_ptr++ = '\ef';
1700
1701<str>\e\e(.|\en)  *string_buf_ptr++ = yytext[1];
1702
1703<str>[^\e\e\en\e"]+ {
1704        char *yptr = yytext;
1705
1706        while (*yptr)
1707                *string_buf_ptr++ = *yptr++;
1708}
1709.Ed
1710.Pp
1711Often, such as in some of the examples above,
1712a whole bunch of rules are all preceded by the same start condition(s).
1713.Nm
1714makes this a little easier and cleaner by introducing a notion of
1715start condition
1716.Em scope .
1717A start condition scope is begun with:
1718.Pp
1719.Dl <SCs>{
1720.Pp
1721where
1722.Dq SCs
1723is a list of one or more start conditions.
1724Inside the start condition scope, every rule automatically has the prefix
1725.Aq SCs
1726applied to it, until a
1727.Sq }
1728which matches the initial
1729.Sq { .
1730So, for example,
1731.Bd -literal -offset indent
1732<ESC>{
1733    "\e\en"   return '\en';
1734    "\e\er"   return '\er';
1735    "\e\ef"   return '\ef';
1736    "\e\e0"   return '\e0';
1737}
1738.Ed
1739.Pp
1740is equivalent to:
1741.Bd -literal -offset indent
1742<ESC>"\e\en"  return '\en';
1743<ESC>"\e\er"  return '\er';
1744<ESC>"\e\ef"  return '\ef';
1745<ESC>"\e\e0"  return '\e0';
1746.Ed
1747.Pp
1748Start condition scopes may be nested.
1749.Pp
1750Three routines are available for manipulating stacks of start conditions:
1751.Bl -tag -width Ds
1752.It void yy_push_state(int new_state)
1753Pushes the current start condition onto the top of the start condition
1754stack and switches to
1755.Fa new_state
1756as though
1757.Dq BEGIN new_state
1758had been used
1759.Pq recall that start condition names are also integers .
1760.It void yy_pop_state()
1761Pops the top of the stack and switches to it via
1762.Em BEGIN .
1763.It int yy_top_state()
1764Returns the top of the stack without altering the stack's contents.
1765.El
1766.Pp
1767The start condition stack grows dynamically and so has no built-in
1768size limitation.
1769If memory is exhausted, program execution aborts.
1770.Pp
1771To use start condition stacks, scanners must include a
1772.Dq %option stack
1773directive (see
1774.Sx OPTIONS
1775below).
1776.Sh MULTIPLE INPUT BUFFERS
1777Some scanners
1778(such as those which support
1779.Qq include
1780files)
1781require reading from several input streams.
1782As
1783.Nm
1784scanners do a large amount of buffering, one cannot control
1785where the next input will be read from by simply writing a
1786.Dv YY_INPUT
1787which is sensitive to the scanning context.
1788.Dv YY_INPUT
1789is only called when the scanner reaches the end of its buffer, which
1790may be a long time after scanning a statement such as an
1791.Qq include
1792which requires switching the input source.
1793.Pp
1794To negotiate these sorts of problems,
1795.Nm
1796provides a mechanism for creating and switching between multiple
1797input buffers.
1798An input buffer is created by using:
1799.Pp
1800.D1 YY_BUFFER_STATE yy_create_buffer(FILE *file, int size)
1801.Pp
1802which takes a
1803.Fa FILE
1804pointer and a
1805.Fa size
1806and creates a buffer associated with the given file and large enough to hold
1807.Fa size
1808characters (when in doubt, use
1809.Dv YY_BUF_SIZE
1810for the size).
1811It returns a
1812.Dv YY_BUFFER_STATE
1813handle, which may then be passed to other routines
1814.Pq see below .
1815The
1816.Dv YY_BUFFER_STATE
1817type is a pointer to an opaque
1818.Dq struct yy_buffer_state
1819structure, so
1820.Dv YY_BUFFER_STATE
1821variables may be safely initialized to
1822.Dq ((YY_BUFFER_STATE) 0)
1823if desired, and the opaque structure can also be referred to in order to
1824correctly declare input buffers in source files other than that of scanners.
1825Note that the
1826.Fa FILE
1827pointer in the call to
1828.Fn yy_create_buffer
1829is only used as the value of
1830.Fa yyin
1831seen by
1832.Dv YY_INPUT ;
1833if
1834.Dv YY_INPUT
1835is redefined so that it no longer uses
1836.Fa yyin ,
1837then a nil
1838.Fa FILE
1839pointer can safely be passed to
1840.Fn yy_create_buffer .
1841To select a particular buffer to scan:
1842.Pp
1843.D1 void yy_switch_to_buffer(YY_BUFFER_STATE new_buffer)
1844.Pp
1845It switches the scanner's input buffer so subsequent tokens will
1846come from
1847.Fa new_buffer .
1848Note that
1849.Fn yy_switch_to_buffer
1850may be used by
1851.Fn yywrap
1852to set things up for continued scanning,
1853instead of opening a new file and pointing
1854.Fa yyin
1855at it.
1856Note also that switching input sources via either
1857.Fn yy_switch_to_buffer
1858or
1859.Fn yywrap
1860does not change the start condition.
1861.Pp
1862.D1 void yy_delete_buffer(YY_BUFFER_STATE buffer)
1863.Pp
1864is used to reclaim the storage associated with a buffer.
1865.Pf ( Fa buffer
1866can be nil, in which case the routine does nothing.)
1867To clear the current contents of a buffer:
1868.Pp
1869.D1 void yy_flush_buffer(YY_BUFFER_STATE buffer)
1870.Pp
1871This function discards the buffer's contents,
1872so the next time the scanner attempts to match a token from the buffer,
1873it will first fill the buffer anew using
1874.Dv YY_INPUT .
1875.Pp
1876.Fn yy_new_buffer
1877is an alias for
1878.Fn yy_create_buffer ,
1879provided for compatibility with the C++ use of
1880.Em new
1881and
1882.Em delete
1883for creating and destroying dynamic objects.
1884.Pp
1885Finally, the
1886.Dv YY_CURRENT_BUFFER
1887macro returns a
1888.Dv YY_BUFFER_STATE
1889handle to the current buffer.
1890.Pp
1891Here is an example of using these features for writing a scanner
1892which expands include files (the
1893.Aq Aq EOF
1894feature is discussed below):
1895.Bd -literal -offset indent
1896/*
1897 * the "incl" state is used for picking up the name
1898 * of an include file
1899 */
1900%x incl
1901
1902%{
1903#define MAX_INCLUDE_DEPTH 10
1904YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1905int include_stack_ptr = 0;
1906%}
1907
1908%%
1909include             BEGIN(incl);
1910
1911[a-z]+              ECHO;
1912[^a-z\en]*\en?        ECHO;
1913
1914<incl>[ \et]*        /* eat the whitespace */
1915<incl>[^ \et\en]+ {   /* got the include file name */
1916        if (include_stack_ptr >= MAX_INCLUDE_DEPTH)
1917                errx(1, "Includes nested too deeply");
1918
1919        include_stack[include_stack_ptr++] =
1920            YY_CURRENT_BUFFER;
1921
1922        yyin = fopen(yytext, "r");
1923
1924        if (yyin == NULL)
1925                err(1, NULL);
1926
1927        yy_switch_to_buffer(
1928            yy_create_buffer(yyin, YY_BUF_SIZE));
1929
1930        BEGIN(INITIAL);
1931}
1932
1933<<EOF>> {
1934        if (--include_stack_ptr < 0)
1935                yyterminate();
1936        else {
1937                yy_delete_buffer(YY_CURRENT_BUFFER);
1938                yy_switch_to_buffer(
1939                    include_stack[include_stack_ptr]);
1940       }
1941}
1942.Ed
1943.Pp
1944Three routines are available for setting up input buffers for
1945scanning in-memory strings instead of files.
1946All of them create a new input buffer for scanning the string,
1947and return a corresponding
1948.Dv YY_BUFFER_STATE
1949handle (which should be deleted afterwards using
1950.Fn yy_delete_buffer ) .
1951They also switch to the new buffer using
1952.Fn yy_switch_to_buffer ,
1953so the next call to
1954.Fn yylex
1955will start scanning the string.
1956.Bl -tag -width Ds
1957.It yy_scan_string(const char *str)
1958Scans a NUL-terminated string.
1959.It yy_scan_bytes(const char *bytes, int len)
1960Scans
1961.Fa len
1962bytes
1963.Pq including possibly NUL's
1964starting at location
1965.Fa bytes .
1966.El
1967.Pp
1968Note that both of these functions create and scan a copy
1969of the string or bytes.
1970(This may be desirable, since
1971.Fn yylex
1972modifies the contents of the buffer it is scanning.)
1973The copy can be avoided by using:
1974.Bl -tag -width Ds
1975.It yy_scan_buffer(char *base, yy_size_t size)
1976Which scans the buffer starting at
1977.Fa base ,
1978consisting of
1979.Fa size
1980bytes, the last two bytes of which must be
1981.Dv YY_END_OF_BUFFER_CHAR
1982.Pq ASCII NUL .
1983These last two bytes are not scanned; thus, scanning consists of
1984base[0] through base[size-2], inclusive.
1985.Pp
1986If
1987.Fa base
1988is not set up in this manner
1989(i.e., forget the final two
1990.Dv YY_END_OF_BUFFER_CHAR
1991bytes), then
1992.Fn yy_scan_buffer
1993returns a nil pointer instead of creating a new input buffer.
1994.Pp
1995The type
1996.Fa yy_size_t
1997is an integral type which can be cast to an integer expression
1998reflecting the size of the buffer.
1999.El
2000.Sh END-OF-FILE RULES
2001The special rule
2002.Qq Aq Aq EOF
2003indicates actions which are to be taken when an end-of-file is encountered and
2004.Fn yywrap
2005returns non-zero
2006.Pq i.e., indicates no further files to process .
2007The action must finish by doing one of four things:
2008.Bl -dash
2009.It
2010Assigning
2011.Em yyin
2012to a new input file
2013(in previous versions of
2014.Nm ,
2015after doing the assignment, it was necessary to call the special action
2016.Dv YY_NEW_FILE ;
2017this is no longer necessary).
2018.It
2019Executing a
2020.Em return
2021statement.
2022.It
2023Executing the special
2024.Fn yyterminate
2025action.
2026.It
2027Switching to a new buffer using
2028.Fn yy_switch_to_buffer
2029as shown in the example above.
2030.El
2031.Pp
2032.Aq Aq EOF
2033rules may not be used with other patterns;
2034they may only be qualified with a list of start conditions.
2035If an unqualified
2036.Aq Aq EOF
2037rule is given, it applies to all start conditions which do not already have
2038.Aq Aq EOF
2039actions.
2040To specify an
2041.Aq Aq EOF
2042rule for only the initial start condition, use
2043.Pp
2044.Dl <INITIAL><<EOF>>
2045.Pp
2046These rules are useful for catching things like unclosed comments.
2047An example:
2048.Bd -literal -offset indent
2049%x quote
2050%%
2051
2052\&...other rules for dealing with quotes...
2053
2054<quote><<EOF>> {
2055         error("unterminated quote");
2056         yyterminate();
2057}
2058<<EOF>> {
2059         if (*++filelist)
2060                 yyin = fopen(*filelist, "r");
2061         else
2062                 yyterminate();
2063}
2064.Ed
2065.Sh MISCELLANEOUS MACROS
2066The macro
2067.Dv YY_USER_ACTION
2068can be defined to provide an action
2069which is always executed prior to the matched rule's action.
2070For example,
2071it could be #define'd to call a routine to convert yytext to lower-case.
2072When
2073.Dv YY_USER_ACTION
2074is invoked, the variable
2075.Fa yy_act
2076gives the number of the matched rule
2077.Pq rules are numbered starting with 1 .
2078For example, to profile how often each rule is matched,
2079the following would do the trick:
2080.Pp
2081.Dl #define YY_USER_ACTION ++ctr[yy_act]
2082.Pp
2083where
2084.Fa ctr
2085is an array to hold the counts for the different rules.
2086Note that the macro
2087.Dv YY_NUM_RULES
2088gives the total number of rules
2089(including the default rule, even if
2090.Fl s
2091is used),
2092so a correct declaration for
2093.Fa ctr
2094is:
2095.Pp
2096.Dl int ctr[YY_NUM_RULES];
2097.Pp
2098The macro
2099.Dv YY_USER_INIT
2100may be defined to provide an action which is always executed before
2101the first scan
2102.Pq and before the scanner's internal initializations are done .
2103For example, it could be used to call a routine to read
2104in a data table or open a logging file.
2105.Pp
2106The macro
2107.Dv yy_set_interactive(is_interactive)
2108can be used to control whether the current buffer is considered
2109.Em interactive .
2110An interactive buffer is processed more slowly,
2111but must be used when the scanner's input source is indeed
2112interactive to avoid problems due to waiting to fill buffers
2113(see the discussion of the
2114.Fl I
2115flag below).
2116A non-zero value in the macro invocation marks the buffer as interactive,
2117a zero value as non-interactive.
2118Note that use of this macro overrides
2119.Dq %option always-interactive
2120or
2121.Dq %option never-interactive
2122(see
2123.Sx OPTIONS
2124below).
2125.Fn yy_set_interactive
2126must be invoked prior to beginning to scan the buffer that is
2127.Pq or is not
2128to be considered interactive.
2129.Pp
2130The macro
2131.Dv yy_set_bol(at_bol)
2132can be used to control whether the current buffer's scanning
2133context for the next token match is done as though at the
2134beginning of a line.
2135A non-zero macro argument makes rules anchored with
2136.Sq ^
2137active, while a zero argument makes
2138.Sq ^
2139rules inactive.
2140.Pp
2141The macro
2142.Dv YY_AT_BOL
2143returns true if the next token scanned from the current buffer will have
2144.Sq ^
2145rules active, false otherwise.
2146.Pp
2147In the generated scanner, the actions are all gathered in one large
2148switch statement and separated using
2149.Dv YY_BREAK ,
2150which may be redefined.
2151By default, it is simply a
2152.Qq break ,
2153to separate each rule's action from the following rules.
2154Redefining
2155.Dv YY_BREAK
2156allows, for example, C++ users to
2157.Dq #define YY_BREAK
2158to do nothing
2159(while being very careful that every rule ends with a
2160.Qq break
2161or a
2162.Qq return ! )
2163to avoid suffering from unreachable statement warnings where because a rule's
2164action ends with
2165.Dq return ,
2166the
2167.Dv YY_BREAK
2168is inaccessible.
2169.Sh VALUES AVAILABLE TO THE USER
2170This section summarizes the various values available to the user
2171in the rule actions.
2172.Bl -tag -width Ds
2173.It char *yytext
2174Holds the text of the current token.
2175It may be modified but not lengthened
2176.Pq characters cannot be appended to the end .
2177.Pp
2178If the special directive
2179.Dq %array
2180appears in the first section of the scanner description, then
2181.Fa yytext
2182is instead declared
2183.Dq char yytext[YYLMAX] ,
2184where
2185.Dv YYLMAX
2186is a macro definition that can be redefined in the first section
2187to change the default value
2188.Pq generally 8KB .
2189Using
2190.Dq %array
2191results in somewhat slower scanners, but the value of
2192.Fa yytext
2193becomes immune to calls to
2194.Fn input
2195and
2196.Fn unput ,
2197which potentially destroy its value when
2198.Fa yytext
2199is a character pointer.
2200The opposite of
2201.Dq %array
2202is
2203.Dq %pointer ,
2204which is the default.
2205.Pp
2206.Dq %array
2207cannot be used when generating C++ scanner classes
2208(the
2209.Fl +
2210flag).
2211.It int yyleng
2212Holds the length of the current token.
2213.It FILE *yyin
2214Is the file which by default
2215.Nm
2216reads from.
2217It may be redefined, but doing so only makes sense before
2218scanning begins or after an
2219.Dv EOF
2220has been encountered.
2221Changing it in the midst of scanning will have unexpected results since
2222.Nm
2223buffers its input; use
2224.Fn yyrestart
2225instead.
2226Once scanning terminates because an end-of-file
2227has been seen,
2228.Fa yyin
2229can be assigned as the new input file
2230and the scanner can be called again to continue scanning.
2231.It void yyrestart(FILE *new_file)
2232May be called to point
2233.Fa yyin
2234at the new input file.
2235The switch-over to the new file is immediate
2236.Pq any previously buffered-up input is lost .
2237Note that calling
2238.Fn yyrestart
2239with
2240.Fa yyin
2241as an argument thus throws away the current input buffer and continues
2242scanning the same input file.
2243.It FILE *yyout
2244Is the file to which
2245.Em ECHO
2246actions are done.
2247It can be reassigned by the user.
2248.It YY_CURRENT_BUFFER
2249Returns a
2250.Dv YY_BUFFER_STATE
2251handle to the current buffer.
2252.It YY_START
2253Returns an integer value corresponding to the current start condition.
2254This value can subsequently be used with
2255.Em BEGIN
2256to return to that start condition.
2257.El
2258.Sh INTERFACING WITH YACC
2259One of the main uses of
2260.Nm
2261is as a companion to the
2262.Xr yacc 1
2263parser-generator.
2264yacc parsers expect to call a routine named
2265.Fn yylex
2266to find the next input token.
2267The routine is supposed to return the type of the next token
2268as well as putting any associated value in the global
2269.Fa yylval ,
2270which is defined externally,
2271and can be a union or any other complex data structure.
2272To use
2273.Nm
2274with yacc, one specifies the
2275.Fl d
2276option to yacc to instruct it to generate the file
2277.Pa y.tab.h
2278containing definitions of all the
2279.Dq %tokens
2280appearing in the yacc input.
2281This file is then included in the
2282.Nm
2283scanner.
2284For example, part of the scanner might look like:
2285.Bd -literal -offset indent
2286%{
2287#include "y.tab.h"
2288%}
2289
2290%%
2291
2292if            return TOK_IF;
2293then          return TOK_THEN;
2294begin         return TOK_BEGIN;
2295end           return TOK_END;
2296.Ed
2297.Sh OPTIONS
2298.Nm
2299has the following options:
2300.Bl -tag -width Ds
2301.It Fl 7
2302Instructs
2303.Nm
2304to generate a 7-bit scanner, i.e., one which can only recognize 7-bit
2305characters in its input.
2306The advantage of using
2307.Fl 7
2308is that the scanner's tables can be up to half the size of those generated
2309using the
2310.Fl 8
2311option
2312.Pq see below .
2313The disadvantage is that such scanners often hang
2314or crash if their input contains an 8-bit character.
2315.Pp
2316Note, however, that unless generating a scanner using the
2317.Fl Cf
2318or
2319.Fl CF
2320table compression options, use of
2321.Fl 7
2322will save only a small amount of table space,
2323and make the scanner considerably less portable.
2324.Nm flex Ns 's
2325default behavior is to generate an 8-bit scanner unless
2326.Fl Cf
2327or
2328.Fl CF
2329is specified, in which case
2330.Nm
2331defaults to generating 7-bit scanners unless it was
2332configured to generate 8-bit scanners
2333(as will often be the case with non-USA sites).
2334It is possible tell whether
2335.Nm
2336generated a 7-bit or an 8-bit scanner by inspecting the flag summary in the
2337.Fl v
2338output as described below.
2339.Pp
2340Note that if
2341.Fl Cfe
2342or
2343.Fl CFe
2344are used
2345(the table compression options, but also using equivalence classes as
2346discussed below),
2347.Nm
2348still defaults to generating an 8-bit scanner,
2349since usually with these compression options full 8-bit tables
2350are not much more expensive than 7-bit tables.
2351.It Fl 8
2352Instructs
2353.Nm
2354to generate an 8-bit scanner, i.e., one which can recognize 8-bit
2355characters.
2356This flag is only needed for scanners generated using
2357.Fl Cf
2358or
2359.Fl CF ,
2360as otherwise
2361.Nm
2362defaults to generating an 8-bit scanner anyway.
2363.Pp
2364See the discussion of
2365.Fl 7
2366above for
2367.Nm flex Ns 's
2368default behavior and the tradeoffs between 7-bit and 8-bit scanners.
2369.It Fl B
2370Instructs
2371.Nm
2372to generate a
2373.Em batch
2374scanner, the opposite of
2375.Em interactive
2376scanners generated by
2377.Fl I
2378.Pq see below .
2379In general,
2380.Fl B
2381is used when the scanner will never be used interactively,
2382and you want to squeeze a little more performance out of it.
2383If the aim is instead to squeeze out a lot more performance,
2384use the
2385.Fl Cf
2386or
2387.Fl CF
2388options
2389.Pq discussed below ,
2390which turn on
2391.Fl B
2392automatically anyway.
2393.It Fl b
2394Generate backing-up information to
2395.Pa lex.backup .
2396This is a list of scanner states which require backing up
2397and the input characters on which they do so.
2398By adding rules one can remove backing-up states.
2399If all backing-up states are eliminated and
2400.Fl Cf
2401or
2402.Fl CF
2403is used, the generated scanner will run faster (see the
2404.Fl p
2405flag).
2406Only users who wish to squeeze every last cycle out of their
2407scanners need worry about this option.
2408(See the section on
2409.Sx PERFORMANCE CONSIDERATIONS
2410below.)
2411.It Fl C Ns Op Cm aeFfmr
2412Controls the degree of table compression and, more generally, trade-offs
2413between small scanners and fast scanners.
2414.Bl -tag -width Ds
2415.It Fl Ca
2416Instructs
2417.Nm
2418to trade off larger tables in the generated scanner for faster performance
2419because the elements of the tables are better aligned for memory access
2420and computation.
2421On some
2422.Tn RISC
2423architectures, fetching and manipulating longwords is more efficient
2424than with smaller-sized units such as shortwords.
2425This option can double the size of the tables used by the scanner.
2426.It Fl Ce
2427Directs
2428.Nm
2429to construct
2430.Em equivalence classes ,
2431i.e., sets of characters which have identical lexical properties
2432(for example, if the only appearance of digits in the
2433.Nm
2434input is in the character class
2435.Qq [0-9]
2436then the digits
2437.Sq 0 ,
2438.Sq 1 ,
2439.Sq ... ,
2440.Sq 9
2441will all be put in the same equivalence class).
2442Equivalence classes usually give dramatic reductions in the final
2443table/object file sizes
2444.Pq typically a factor of 2\-5
2445and are pretty cheap performance-wise
2446.Pq one array look-up per character scanned .
2447.It Fl CF
2448Specifies that the alternate fast scanner representation
2449(described below under the
2450.Fl F
2451option)
2452should be used.
2453This option cannot be used with
2454.Fl + .
2455.It Fl Cf
2456Specifies that the
2457.Em full
2458scanner tables should be generated \-
2459.Nm
2460should not compress the tables by taking advantage of
2461similar transition functions for different states.
2462.It Fl \&Cm
2463Directs
2464.Nm
2465to construct
2466.Em meta-equivalence classes ,
2467which are sets of equivalence classes
2468(or characters, if equivalence classes are not being used)
2469that are commonly used together.
2470Meta-equivalence classes are often a big win when using compressed tables,
2471but they have a moderate performance impact
2472(one or two
2473.Qq if
2474tests and one array look-up per character scanned).
2475.It Fl Cr
2476Causes the generated scanner to
2477.Em bypass
2478use of the standard I/O library
2479.Pq stdio
2480for input.
2481Instead of calling
2482.Xr fread 3
2483or
2484.Xr getc 3 ,
2485the scanner will use the
2486.Xr read 2
2487system call,
2488resulting in a performance gain which varies from system to system,
2489but in general is probably negligible unless
2490.Fl Cf
2491or
2492.Fl CF
2493are being used.
2494Using
2495.Fl Cr
2496can cause strange behavior if, for example, reading from
2497.Fa yyin
2498using stdio prior to calling the scanner
2499(because the scanner will miss whatever text previous reads left
2500in the stdio input buffer).
2501.Pp
2502.Fl Cr
2503has no effect if
2504.Dv YY_INPUT
2505is defined
2506(see
2507.Sx THE GENERATED SCANNER
2508above).
2509.El
2510.Pp
2511A lone
2512.Fl C
2513specifies that the scanner tables should be compressed but neither
2514equivalence classes nor meta-equivalence classes should be used.
2515.Pp
2516The options
2517.Fl Cf
2518or
2519.Fl CF
2520and
2521.Fl \&Cm
2522do not make sense together \- there is no opportunity for meta-equivalence
2523classes if the table is not being compressed.
2524Otherwise the options may be freely mixed, and are cumulative.
2525.Pp
2526The default setting is
2527.Fl Cem
2528which specifies that
2529.Nm
2530should generate equivalence classes and meta-equivalence classes.
2531This setting provides the highest degree of table compression.
2532It is possible to trade off faster-executing scanners at the cost of
2533larger tables with the following generally being true:
2534.Bd -unfilled -offset indent
2535slowest & smallest
2536      -Cem
2537      -Cm
2538      -Ce
2539      -C
2540      -C{f,F}e
2541      -C{f,F}
2542      -C{f,F}a
2543fastest & largest
2544.Ed
2545.Pp
2546Note that scanners with the smallest tables are usually generated and
2547compiled the quickest,
2548so during development the default is usually best,
2549maximal compression.
2550.Pp
2551.Fl Cfe
2552is often a good compromise between speed and size for production scanners.
2553.It Fl d
2554Makes the generated scanner run in debug mode.
2555Whenever a pattern is recognized and the global
2556.Fa yy_flex_debug
2557is non-zero
2558.Pq which is the default ,
2559the scanner will write to stderr a line of the form:
2560.Pp
2561.D1 --accepting rule at line 53 ("the matched text")
2562.Pp
2563The line number refers to the location of the rule in the file
2564defining the scanner
2565(i.e., the file that was fed to
2566.Nm ) .
2567Messages are also generated when the scanner backs up,
2568accepts the default rule,
2569reaches the end of its input buffer
2570(or encounters a NUL;
2571at this point, the two look the same as far as the scanner's concerned),
2572or reaches an end-of-file.
2573.It Fl F
2574Specifies that the fast scanner table representation should be used
2575.Pq and stdio bypassed .
2576This representation is about as fast as the full table representation
2577.Pq Fl f ,
2578and for some sets of patterns will be considerably smaller
2579.Pq and for others, larger .
2580In general, if the pattern set contains both
2581.Qq keywords
2582and a catch-all,
2583.Qq identifier
2584rule, such as in the set:
2585.Bd -unfilled -offset indent
2586"case"    return TOK_CASE;
2587"switch"  return TOK_SWITCH;
2588\&...
2589"default" return TOK_DEFAULT;
2590[a-z]+    return TOK_ID;
2591.Ed
2592.Pp
2593then it's better to use the full table representation.
2594If only the
2595.Qq identifier
2596rule is present and a hash table or some such is used to detect the keywords,
2597it's better to use
2598.Fl F .
2599.Pp
2600This option is equivalent to
2601.Fl CFr
2602.Pq see above .
2603It cannot be used with
2604.Fl + .
2605.It Fl f
2606Specifies
2607.Em fast scanner .
2608No table compression is done and stdio is bypassed.
2609The result is large but fast.
2610This option is equivalent to
2611.Fl Cfr
2612.Pq see above .
2613.It Fl h
2614Generates a help summary of
2615.Nm flex Ns 's
2616options to stdout and then exits.
2617.Fl ?\&
2618and
2619.Fl Fl help
2620are synonyms for
2621.Fl h .
2622.It Fl I
2623Instructs
2624.Nm
2625to generate an
2626.Em interactive
2627scanner.
2628An interactive scanner is one that only looks ahead to decide
2629what token has been matched if it absolutely must.
2630It turns out that always looking one extra character ahead,
2631even if the scanner has already seen enough text
2632to disambiguate the current token, is a bit faster than
2633only looking ahead when necessary.
2634But scanners that always look ahead give dreadful interactive performance;
2635for example, when a user types a newline,
2636it is not recognized as a newline token until they enter
2637.Em another
2638token, which often means typing in another whole line.
2639.Pp
2640.Nm
2641scanners default to
2642.Em interactive
2643unless
2644.Fl Cf
2645or
2646.Fl CF
2647table-compression options are specified
2648.Pq see above .
2649That's because if high-performance is most important,
2650one of these options should be used,
2651so if they weren't,
2652.Nm
2653assumes it is preferable to trade off a bit of run-time performance for
2654intuitive interactive behavior.
2655Note also that
2656.Fl I
2657cannot be used in conjunction with
2658.Fl Cf
2659or
2660.Fl CF .
2661Thus, this option is not really needed; it is on by default for all those
2662cases in which it is allowed.
2663.Pp
2664A scanner can be forced to not be interactive by using
2665.Fl B
2666.Pq see above .
2667.It Fl i
2668Instructs
2669.Nm
2670to generate a case-insensitive scanner.
2671The case of letters given in the
2672.Nm
2673input patterns will be ignored,
2674and tokens in the input will be matched regardless of case.
2675The matched text given in
2676.Fa yytext
2677will have the preserved case
2678.Pq i.e., it will not be folded .
2679.It Fl L
2680Instructs
2681.Nm
2682not to generate
2683.Dq #line
2684directives.
2685Without this option,
2686.Nm
2687peppers the generated scanner with #line directives so error messages
2688in the actions will be correctly located with respect to either the original
2689.Nm
2690input file
2691(if the errors are due to code in the input file),
2692or
2693.Pa lex.yy.c
2694(if the errors are
2695.Nm flex Ns 's
2696fault \- these sorts of errors should be reported to the email address
2697given below).
2698.It Fl l
2699Turns on maximum compatibility with the original
2700.At
2701.Nm lex
2702implementation.
2703Note that this does not mean full compatibility.
2704Use of this option costs a considerable amount of performance,
2705and it cannot be used with the
2706.Fl + , f , F , Cf ,
2707or
2708.Fl CF
2709options.
2710For details on the compatibilities it provides, see the section
2711.Sx INCOMPATIBILITIES WITH LEX AND POSIX
2712below.
2713This option also results in the name
2714.Dv YY_FLEX_LEX_COMPAT
2715being #define'd in the generated scanner.
2716.It Fl n
2717Another do-nothing, deprecated option included only for
2718.Tn POSIX
2719compliance.
2720.It Fl o Ns Ar output
2721Directs
2722.Nm
2723to write the scanner to the file
2724.Ar output
2725instead of
2726.Pa lex.yy.c .
2727If
2728.Fl o
2729is combined with the
2730.Fl t
2731option, then the scanner is written to stdout but its
2732.Dq #line
2733directives
2734(see the
2735.Fl L
2736option above)
2737refer to the file
2738.Ar output .
2739.It Fl P Ns Ar prefix
2740Changes the default
2741.Qq yy
2742prefix used by
2743.Nm
2744for all globally visible variable and function names to instead be
2745.Ar prefix .
2746For example,
2747.Fl P Ns Ar foo
2748changes the name of
2749.Fa yytext
2750to
2751.Fa footext .
2752It also changes the name of the default output file from
2753.Pa lex.yy.c
2754to
2755.Pa lex.foo.c .
2756Here are all of the names affected:
2757.Bd -unfilled -offset indent
2758yy_create_buffer
2759yy_delete_buffer
2760yy_flex_debug
2761yy_init_buffer
2762yy_flush_buffer
2763yy_load_buffer_state
2764yy_switch_to_buffer
2765yyin
2766yyleng
2767yylex
2768yylineno
2769yyout
2770yyrestart
2771yytext
2772yywrap
2773.Ed
2774.Pp
2775(If using a C++ scanner, then only
2776.Fa yywrap
2777and
2778.Fa yyFlexLexer
2779are affected.)
2780Within the scanner itself, it is still possible to refer to the global variables
2781and functions using either version of their name; but externally, they
2782have the modified name.
2783.Pp
2784This option allows multiple
2785.Nm
2786programs to be easily linked together into the same executable.
2787Note, though, that using this option also renames
2788.Fn yywrap ,
2789so now either an
2790.Pq appropriately named
2791version of the routine for the scanner must be supplied, or
2792.Dq %option noyywrap
2793must be used, as linking with
2794.Fl lfl
2795no longer provides one by default.
2796.It Fl p
2797Generates a performance report to stderr.
2798The report consists of comments regarding features of the
2799.Nm
2800input file which will cause a serious loss of performance in the resulting
2801scanner.
2802If the flag is specified twice,
2803comments regarding features that lead to minor performance losses
2804will also be reported>
2805.Pp
2806Note that the use of
2807.Em REJECT ,
2808.Dq %option yylineno ,
2809and variable trailing context
2810(see the
2811.Sx BUGS
2812section below)
2813entails a substantial performance penalty; use of
2814.Fn yymore ,
2815the
2816.Sq ^
2817operator, and the
2818.Fl I
2819flag entail minor performance penalties.
2820.It Fl S Ns Ar skeleton
2821Overrides the default skeleton file from which
2822.Nm
2823constructs its scanners.
2824This option is needed only for
2825.Nm
2826maintenance or development.
2827.It Fl s
2828Causes the default rule
2829.Pq that unmatched scanner input is echoed to stdout
2830to be suppressed.
2831If the scanner encounters input that does not
2832match any of its rules, it aborts with an error.
2833This option is useful for finding holes in a scanner's rule set.
2834.It Fl T
2835Makes
2836.Nm
2837run in
2838.Em trace
2839mode.
2840It will generate a lot of messages to stderr concerning
2841the form of the input and the resultant non-deterministic and deterministic
2842finite automata.
2843This option is mostly for use in maintaining
2844.Nm .
2845.It Fl t
2846Instructs
2847.Nm
2848to write the scanner it generates to standard output instead of
2849.Pa lex.yy.c .
2850.It Fl V
2851Prints the version number to stdout and exits.
2852.Fl Fl version
2853is a synonym for
2854.Fl V .
2855.It Fl v
2856Specifies that
2857.Nm
2858should write to stderr
2859a summary of statistics regarding the scanner it generates.
2860Most of the statistics are meaningless to the casual
2861.Nm
2862user, but the first line identifies the version of
2863.Nm
2864(same as reported by
2865.Fl V ) ,
2866and the next line the flags used when generating the scanner,
2867including those that are on by default.
2868.It Fl w
2869Suppresses warning messages.
2870.It Fl +
2871Specifies that
2872.Nm
2873should generate a C++ scanner class.
2874See the section on
2875.Sx GENERATING C++ SCANNERS
2876below for details.
2877.El
2878.Pp
2879.Nm
2880also provides a mechanism for controlling options within the
2881scanner specification itself, rather than from the
2882.Nm
2883command line.
2884This is done by including
2885.Dq %option
2886directives in the first section of the scanner specification.
2887Multiple options can be specified with a single
2888.Dq %option
2889directive, and multiple directives in the first section of the
2890.Nm
2891input file.
2892.Pp
2893Most options are given simply as names, optionally preceded by the word
2894.Qq no
2895.Pq with no intervening whitespace
2896to negate their meaning.
2897A number are equivalent to
2898.Nm
2899flags or their negation:
2900.Bd -unfilled -offset indent
29017bit            -7 option
29028bit            -8 option
2903align           -Ca option
2904backup          -b option
2905batch           -B option
2906c++             -+ option
2907
2908caseful or
2909case-sensitive  opposite of -i (default)
2910
2911case-insensitive or
2912caseless        -i option
2913
2914debug           -d option
2915default         opposite of -s option
2916ecs             -Ce option
2917fast            -F option
2918full            -f option
2919interactive     -I option
2920lex-compat      -l option
2921meta-ecs        -Cm option
2922perf-report     -p option
2923read            -Cr option
2924stdout          -t option
2925verbose         -v option
2926warn            opposite of -w option
2927                (use "%option nowarn" for -w)
2928
2929array           equivalent to "%array"
2930pointer         equivalent to "%pointer" (default)
2931.Ed
2932.Pp
2933Some %option's provide features otherwise not available:
2934.Bl -tag -width Ds
2935.It always-interactive
2936Instructs
2937.Nm
2938to generate a scanner which always considers its input
2939.Qq interactive .
2940Normally, on each new input file the scanner calls
2941.Fn isatty
2942in an attempt to determine whether the scanner's input source is interactive
2943and thus should be read a character at a time.
2944When this option is used, however, no such call is made.
2945.It main
2946Directs
2947.Nm
2948to provide a default
2949.Fn main
2950program for the scanner, which simply calls
2951.Fn yylex .
2952This option implies
2953.Dq noyywrap
2954.Pq see below .
2955.It never-interactive
2956Instructs
2957.Nm
2958to generate a scanner which never considers its input
2959.Qq interactive
2960(again, no call made to
2961.Fn isatty ) .
2962This is the opposite of
2963.Dq always-interactive .
2964.It stack
2965Enables the use of start condition stacks
2966(see
2967.Sx START CONDITIONS
2968above).
2969.It stdinit
2970If set (i.e.,
2971.Dq %option stdinit ) ,
2972initializes
2973.Fa yyin
2974and
2975.Fa yyout
2976to stdin and stdout, instead of the default of
2977.Dq nil .
2978Some existing
2979.Nm lex
2980programs depend on this behavior, even though it is not compliant with ANSI C,
2981which does not require stdin and stdout to be compile-time constant.
2982.It yylineno
2983Directs
2984.Nm
2985to generate a scanner that maintains the number of the current line
2986read from its input in the global variable
2987.Fa yylineno .
2988This option is implied by
2989.Dq %option lex-compat .
2990.It yywrap
2991If unset (i.e.,
2992.Dq %option noyywrap ) ,
2993makes the scanner not call
2994.Fn yywrap
2995upon an end-of-file, but simply assume that there are no more files to scan
2996(until the user points
2997.Fa yyin
2998at a new file and calls
2999.Fn yylex
3000again).
3001.El
3002.Pp
3003.Nm
3004scans rule actions to determine whether the
3005.Em REJECT
3006or
3007.Fn yymore
3008features are being used.
3009The
3010.Dq reject
3011and
3012.Dq yymore
3013options are available to override its decision as to whether to use the
3014options, either by setting them (e.g.,
3015.Dq %option reject )
3016to indicate the feature is indeed used,
3017or unsetting them to indicate it actually is not used
3018(e.g.,
3019.Dq %option noyymore ) .
3020.Pp
3021Three options take string-delimited values, offset with
3022.Sq = :
3023.Pp
3024.D1 %option outfile="ABC"
3025.Pp
3026is equivalent to
3027.Fl o Ns Ar ABC ,
3028and
3029.Pp
3030.D1 %option prefix="XYZ"
3031.Pp
3032is equivalent to
3033.Fl P Ns Ar XYZ .
3034Finally,
3035.Pp
3036.D1 %option yyclass="foo"
3037.Pp
3038only applies when generating a C++ scanner
3039.Pf ( Fl +
3040option).
3041It informs
3042.Nm
3043that
3044.Dq foo
3045has been derived as a subclass of yyFlexLexer, so
3046.Nm
3047will place actions in the member function
3048.Dq foo::yylex()
3049instead of
3050.Dq yyFlexLexer::yylex() .
3051It also generates a
3052.Dq yyFlexLexer::yylex()
3053member function that emits a run-time error (by invoking
3054.Dq yyFlexLexer::LexerError() )
3055if called.
3056See
3057.Sx GENERATING C++ SCANNERS ,
3058below, for additional information.
3059.Pp
3060A number of options are available for
3061lint
3062purists who want to suppress the appearance of unneeded routines
3063in the generated scanner.
3064Each of the following, if unset
3065(e.g.,
3066.Dq %option nounput ) ,
3067results in the corresponding routine not appearing in the generated scanner:
3068.Bd -unfilled -offset indent
3069input, unput
3070yy_push_state, yy_pop_state, yy_top_state
3071yy_scan_buffer, yy_scan_bytes, yy_scan_string
3072.Ed
3073.Pp
3074(though
3075.Fn yy_push_state
3076and friends won't appear anyway unless
3077.Dq %option stack
3078is being used).
3079.Sh PERFORMANCE CONSIDERATIONS
3080The main design goal of
3081.Nm
3082is that it generate high-performance scanners.
3083It has been optimized for dealing well with large sets of rules.
3084Aside from the effects on scanner speed of the table compression
3085.Fl C
3086options outlined above,
3087there are a number of options/actions which degrade performance.
3088These are, from most expensive to least:
3089.Bd -unfilled -offset indent
3090REJECT
3091%option yylineno
3092arbitrary trailing context
3093
3094pattern sets that require backing up
3095%array
3096%option interactive
3097%option always-interactive
3098
3099\&'^' beginning-of-line operator
3100yymore()
3101.Ed
3102.Pp
3103with the first three all being quite expensive
3104and the last two being quite cheap.
3105Note also that
3106.Fn unput
3107is implemented as a routine call that potentially does quite a bit of work,
3108while
3109.Fn yyless
3110is a quite-cheap macro; so if just putting back some excess text,
3111use
3112.Fn yyless .
3113.Pp
3114.Em REJECT
3115should be avoided at all costs when performance is important.
3116It is a particularly expensive option.
3117.Pp
3118Getting rid of backing up is messy and often may be an enormous
3119amount of work for a complicated scanner.
3120In principal, one begins by using the
3121.Fl b
3122flag to generate a
3123.Pa lex.backup
3124file.
3125For example, on the input
3126.Bd -literal -offset indent
3127%%
3128foo        return TOK_KEYWORD;
3129foobar     return TOK_KEYWORD;
3130.Ed
3131.Pp
3132the file looks like:
3133.Bd -literal -offset indent
3134State #6 is non-accepting -
3135 associated rule line numbers:
3136       2       3
3137 out-transitions: [ o ]
3138 jam-transitions: EOF [ \e001-n  p-\e177 ]
3139
3140State #8 is non-accepting -
3141 associated rule line numbers:
3142       3
3143 out-transitions: [ a ]
3144 jam-transitions: EOF [ \e001-`  b-\e177 ]
3145
3146State #9 is non-accepting -
3147 associated rule line numbers:
3148       3
3149 out-transitions: [ r ]
3150 jam-transitions: EOF [ \e001-q  s-\e177 ]
3151
3152Compressed tables always back up.
3153.Ed
3154.Pp
3155The first few lines tell us that there's a scanner state in
3156which it can make a transition on an
3157.Sq o
3158but not on any other character,
3159and that in that state the currently scanned text does not match any rule.
3160The state occurs when trying to match the rules found
3161at lines 2 and 3 in the input file.
3162If the scanner is in that state and then reads something other than an
3163.Sq o ,
3164it will have to back up to find a rule which is matched.
3165With a bit of headscratching one can see that this must be the
3166state it's in when it has seen
3167.Sq fo .
3168When this has happened, if anything other than another
3169.Sq o
3170is seen, the scanner will have to back up to simply match the
3171.Sq f
3172.Pq by the default rule .
3173.Pp
3174The comment regarding State #8 indicates there's a problem when
3175.Qq foob
3176has been scanned.
3177Indeed, on any character other than an
3178.Sq a ,
3179the scanner will have to back up to accept
3180.Qq foo .
3181Similarly, the comment for State #9 concerns when
3182.Qq fooba
3183has been scanned and an
3184.Sq r
3185does not follow.
3186.Pp
3187The final comment reminds us that there's no point going to
3188all the trouble of removing backing up from the rules unless we're using
3189.Fl Cf
3190or
3191.Fl CF ,
3192since there's no performance gain doing so with compressed scanners.
3193.Pp
3194The way to remove the backing up is to add
3195.Qq error
3196rules:
3197.Bd -literal -offset indent
3198%%
3199foo    return TOK_KEYWORD;
3200foobar return TOK_KEYWORD;
3201
3202fooba  |
3203foob   |
3204fo {
3205        /* false alarm, not really a keyword */
3206        return TOK_ID;
3207}
3208.Ed
3209.Pp
3210Eliminating backing up among a list of keywords can also be done using a
3211.Qq catch-all
3212rule:
3213.Bd -literal -offset indent
3214%%
3215foo    return TOK_KEYWORD;
3216foobar return TOK_KEYWORD;
3217
3218[a-z]+ return TOK_ID;
3219.Ed
3220.Pp
3221This is usually the best solution when appropriate.
3222.Pp
3223Backing up messages tend to cascade.
3224With a complicated set of rules it's not uncommon to get hundreds of messages.
3225If one can decipher them, though,
3226it often only takes a dozen or so rules to eliminate the backing up
3227(though it's easy to make a mistake and have an error rule accidentally match
3228a valid token; a possible future
3229.Nm
3230feature will be to automatically add rules to eliminate backing up).
3231.Pp
3232It's important to keep in mind that the benefits of eliminating
3233backing up are gained only if
3234.Em every
3235instance of backing up is eliminated.
3236Leaving just one gains nothing.
3237.Pp
3238.Em Variable
3239trailing context
3240(where both the leading and trailing parts do not have a fixed length)
3241entails almost the same performance loss as
3242.Em REJECT
3243.Pq i.e., substantial .
3244So when possible a rule like:
3245.Bd -literal -offset indent
3246%%
3247mouse|rat/(cat|dog)   run();
3248.Ed
3249.Pp
3250is better written:
3251.Bd -literal -offset indent
3252%%
3253mouse/cat|dog         run();
3254rat/cat|dog           run();
3255.Ed
3256.Pp
3257or as
3258.Bd -literal -offset indent
3259%%
3260mouse|rat/cat         run();
3261mouse|rat/dog         run();
3262.Ed
3263.Pp
3264Note that here the special
3265.Sq |\&
3266action does not provide any savings, and can even make things worse (see
3267.Sx BUGS
3268below).
3269.Pp
3270Another area where the user can increase a scanner's performance
3271.Pq and one that's easier to implement
3272arises from the fact that the longer the tokens matched,
3273the faster the scanner will run.
3274This is because with long tokens the processing of most input
3275characters takes place in the
3276.Pq short
3277inner scanning loop, and does not often have to go through the additional work
3278of setting up the scanning environment (e.g.,
3279.Fa yytext )
3280for the action.
3281Recall the scanner for C comments:
3282.Bd -literal -offset indent
3283%x comment
3284%%
3285int line_num = 1;
3286
3287"/*"                    BEGIN(comment);
3288
3289<comment>[^*\en]*
3290<comment>"*"+[^*/\en]*
3291<comment>\en             ++line_num;
3292<comment>"*"+"/"        BEGIN(INITIAL);
3293.Ed
3294.Pp
3295This could be sped up by writing it as:
3296.Bd -literal -offset indent
3297%x comment
3298%%
3299int line_num = 1;
3300
3301"/*"                    BEGIN(comment);
3302
3303<comment>[^*\en]*
3304<comment>[^*\en]*\en      ++line_num;
3305<comment>"*"+[^*/\en]*
3306<comment>"*"+[^*/\en]*\en ++line_num;
3307<comment>"*"+"/"        BEGIN(INITIAL);
3308.Ed
3309.Pp
3310Now instead of each newline requiring the processing of another action,
3311recognizing the newlines is
3312.Qq distributed
3313over the other rules to keep the matched text as long as possible.
3314Note that adding rules does
3315.Em not
3316slow down the scanner!
3317The speed of the scanner is independent of the number of rules or
3318(modulo the considerations given at the beginning of this section)
3319how complicated the rules are with regard to operators such as
3320.Sq *
3321and
3322.Sq |\& .
3323.Pp
3324A final example in speeding up a scanner:
3325scan through a file containing identifiers and keywords, one per line
3326and with no other extraneous characters, and recognize all the keywords.
3327A natural first approach is:
3328.Bd -literal -offset indent
3329%%
3330asm      |
3331auto     |
3332break    |
3333\&... etc ...
3334volatile |
3335while    /* it's a keyword */
3336
3337\&.|\en     /* it's not a keyword */
3338.Ed
3339.Pp
3340To eliminate the back-tracking, introduce a catch-all rule:
3341.Bd -literal -offset indent
3342%%
3343asm      |
3344auto     |
3345break    |
3346\&... etc ...
3347volatile |
3348while    /* it's a keyword */
3349
3350[a-z]+   |
3351\&.|\en     /* it's not a keyword */
3352.Ed
3353.Pp
3354Now, if it's guaranteed that there's exactly one word per line,
3355then we can reduce the total number of matches by a half by
3356merging in the recognition of newlines with that of the other tokens:
3357.Bd -literal -offset indent
3358%%
3359asm\en      |
3360auto\en     |
3361break\en    |
3362\&... etc ...
3363volatile\en |
3364while\en    /* it's a keyword */
3365
3366[a-z]+\en   |
3367\&.|\en       /* it's not a keyword */
3368.Ed
3369.Pp
3370One has to be careful here,
3371as we have now reintroduced backing up into the scanner.
3372In particular, while we know that there will never be any characters
3373in the input stream other than letters or newlines,
3374.Nm
3375can't figure this out, and it will plan for possibly needing to back up
3376when it has scanned a token like
3377.Qq auto
3378and then the next character is something other than a newline or a letter.
3379Previously it would then just match the
3380.Qq auto
3381rule and be done, but now it has no
3382.Qq auto
3383rule, only an
3384.Qq auto\en
3385rule.
3386To eliminate the possibility of backing up,
3387we could either duplicate all rules but without final newlines or,
3388since we never expect to encounter such an input and therefore don't
3389how it's classified, we can introduce one more catch-all rule,
3390this one which doesn't include a newline:
3391.Bd -literal -offset indent
3392%%
3393asm\en      |
3394auto\en     |
3395break\en    |
3396\&... etc ...
3397volatile\en |
3398while\en    /* it's a keyword */
3399
3400[a-z]+\en   |
3401[a-z]+     |
3402\&.|\en       /* it's not a keyword */
3403.Ed
3404.Pp
3405Compiled with
3406.Fl Cf ,
3407this is about as fast as one can get a
3408.Nm
3409scanner to go for this particular problem.
3410.Pp
3411A final note:
3412.Nm
3413is slow when matching NUL's,
3414particularly when a token contains multiple NUL's.
3415It's best to write rules which match short
3416amounts of text if it's anticipated that the text will often include NUL's.
3417.Pp
3418Another final note regarding performance: as mentioned above in the section
3419.Sx HOW THE INPUT IS MATCHED ,
3420dynamically resizing
3421.Fa yytext
3422to accommodate huge tokens is a slow process because it presently requires that
3423the
3424.Pq huge
3425token be rescanned from the beginning.
3426Thus if performance is vital, it is better to attempt to match
3427.Qq large
3428quantities of text but not
3429.Qq huge
3430quantities, where the cutoff between the two is at about 8K characters/token.
3431.Sh GENERATING C++ SCANNERS
3432.Nm
3433provides two different ways to generate scanners for use with C++.
3434The first way is to simply compile a scanner generated by
3435.Nm
3436using a C++ compiler instead of a C compiler.
3437This should not generate any compilation errors
3438(please report any found to the email address given in the
3439.Sx AUTHORS
3440section below).
3441C++ code can then be used in rule actions instead of C code.
3442Note that the default input source for scanners remains
3443.Fa yyin ,
3444and default echoing is still done to
3445.Fa yyout .
3446Both of these remain
3447.Fa FILE *
3448variables and not C++ streams.
3449.Pp
3450.Nm
3451can also be used to generate a C++ scanner class, using the
3452.Fl +
3453option (or, equivalently,
3454.Dq %option c++ ) ,
3455which is automatically specified if the name of the flex executable ends in a
3456.Sq + ,
3457such as
3458.Nm flex++ .
3459When using this option,
3460.Nm
3461defaults to generating the scanner to the file
3462.Pa lex.yy.cc
3463instead of
3464.Pa lex.yy.c .
3465The generated scanner includes the header file
3466.In g++/FlexLexer.h ,
3467which defines the interface to two C++ classes.
3468.Pp
3469The first class,
3470.Em FlexLexer ,
3471provides an abstract base class defining the general scanner class interface.
3472It provides the following member functions:
3473.Bl -tag -width Ds
3474.It const char* YYText()
3475Returns the text of the most recently matched token, the equivalent of
3476.Fa yytext .
3477.It int YYLeng()
3478Returns the length of the most recently matched token, the equivalent of
3479.Fa yyleng .
3480.It int lineno() const
3481Returns the current input line number
3482(see
3483.Dq %option yylineno ) ,
3484or 1 if
3485.Dq %option yylineno
3486was not used.
3487.It void set_debug(int flag)
3488Sets the debugging flag for the scanner, equivalent to assigning to
3489.Fa yy_flex_debug
3490(see the
3491.Sx OPTIONS
3492section above).
3493Note that the scanner must be built using
3494.Dq %option debug
3495to include debugging information in it.
3496.It int debug() const
3497Returns the current setting of the debugging flag.
3498.El
3499.Pp
3500Also provided are member functions equivalent to
3501.Fn yy_switch_to_buffer ,
3502.Fn yy_create_buffer
3503(though the first argument is an
3504.Fa std::istream*
3505object pointer and not a
3506.Fa FILE* ) ,
3507.Fn yy_flush_buffer ,
3508.Fn yy_delete_buffer ,
3509and
3510.Fn yyrestart
3511(again, the first argument is an
3512.Fa std::istream*
3513object pointer).
3514.Pp
3515The second class defined in
3516.In g++/FlexLexer.h
3517is
3518.Fa yyFlexLexer ,
3519which is derived from
3520.Fa FlexLexer .
3521It defines the following additional member functions:
3522.Bl -tag -width Ds
3523.It "yyFlexLexer(std::istream* arg_yyin = 0, std::ostream* arg_yyout = 0)"
3524Constructs a
3525.Fa yyFlexLexer
3526object using the given streams for input and output.
3527If not specified, the streams default to
3528.Fa cin
3529and
3530.Fa cout ,
3531respectively.
3532.It virtual int yylex()
3533Performs the same role as
3534.Fn yylex
3535does for ordinary flex scanners: it scans the input stream, consuming
3536tokens, until a rule's action returns a value.
3537If subclass
3538.Sq S
3539is derived from
3540.Fa yyFlexLexer ,
3541in order to access the member functions and variables of
3542.Sq S
3543inside
3544.Fn yylex ,
3545use
3546.Dq %option yyclass="S"
3547to inform
3548.Nm
3549that the
3550.Sq S
3551subclass will be used instead of
3552.Fa yyFlexLexer .
3553In this case, rather than generating
3554.Dq yyFlexLexer::yylex() ,
3555.Nm
3556generates
3557.Dq S::yylex()
3558(and also generates a dummy
3559.Dq yyFlexLexer::yylex()
3560that calls
3561.Dq yyFlexLexer::LexerError()
3562if called).
3563.It "virtual void switch_streams(std::istream* new_in = 0, std::ostream* new_out = 0)"
3564Reassigns
3565.Fa yyin
3566to
3567.Fa new_in
3568.Pq if non-nil
3569and
3570.Fa yyout
3571to
3572.Fa new_out
3573.Pq ditto ,
3574deleting the previous input buffer if
3575.Fa yyin
3576is reassigned.
3577.It int yylex(std::istream* new_in, std::ostream* new_out = 0)
3578First switches the input streams via
3579.Dq switch_streams(new_in, new_out)
3580and then returns the value of
3581.Fn yylex .
3582.El
3583.Pp
3584In addition,
3585.Fa yyFlexLexer
3586defines the following protected virtual functions which can be redefined
3587in derived classes to tailor the scanner:
3588.Bl -tag -width Ds
3589.It virtual int LexerInput(char* buf, int max_size)
3590Reads up to
3591.Fa max_size
3592characters into
3593.Fa buf
3594and returns the number of characters read.
3595To indicate end-of-input, return 0 characters.
3596Note that
3597.Qq interactive
3598scanners (see the
3599.Fl B
3600and
3601.Fl I
3602flags) define the macro
3603.Dv YY_INTERACTIVE .
3604If
3605.Fn LexerInput
3606has been redefined, and it's necessary to take different actions depending on
3607whether or not the scanner might be scanning an interactive input source,
3608it's possible to test for the presence of this name via
3609.Dq #ifdef .
3610.It virtual void LexerOutput(const char* buf, int size)
3611Writes out
3612.Fa size
3613characters from the buffer
3614.Fa buf ,
3615which, while NUL-terminated, may also contain
3616.Qq internal
3617NUL's if the scanner's rules can match text with NUL's in them.
3618.It virtual void LexerError(const char* msg)
3619Reports a fatal error message.
3620The default version of this function writes the message to the stream
3621.Fa cerr
3622and exits.
3623.El
3624.Pp
3625Note that a
3626.Fa yyFlexLexer
3627object contains its entire scanning state.
3628Thus such objects can be used to create reentrant scanners.
3629Multiple instances of the same
3630.Fa yyFlexLexer
3631class can be instantiated, and multiple C++ scanner classes can be combined
3632in the same program using the
3633.Fl P
3634option discussed above.
3635.Pp
3636Finally, note that the
3637.Dq %array
3638feature is not available to C++ scanner classes;
3639.Dq %pointer
3640must be used
3641.Pq the default .
3642.Pp
3643Here is an example of a simple C++ scanner:
3644.Bd -literal -offset indent
3645// An example of using the flex C++ scanner class.
3646
3647%{
3648#include <errno.h>
3649int mylineno = 0;
3650%}
3651
3652string  \e"[^\en"]+\e"
3653
3654ws      [ \et]+
3655
3656alpha   [A-Za-z]
3657dig     [0-9]
3658name    ({alpha}|{dig}|\e$)({alpha}|{dig}|[_.\e-/$])*
3659num1    [-+]?{dig}+\e.?([eE][-+]?{dig}+)?
3660num2    [-+]?{dig}*\e.{dig}+([eE][-+]?{dig}+)?
3661number  {num1}|{num2}
3662
3663%%
3664
3665{ws}    /* skip blanks and tabs */
3666
3667"/*" {
3668        int c;
3669
3670        while ((c = yyinput()) != 0) {
3671                if(c == '\en')
3672                    ++mylineno;
3673                else if(c == '*') {
3674                    if ((c = yyinput()) == '/')
3675                        break;
3676                    else
3677                        unput(c);
3678                }
3679        }
3680}
3681
3682{number}  cout << "number " << YYText() << '\en';
3683
3684\en        mylineno++;
3685
3686{name}    cout << "name " << YYText() << '\en';
3687
3688{string}  cout << "string " << YYText() << '\en';
3689
3690%%
3691
3692int main(int /* argc */, char** /* argv */)
3693{
3694	FlexLexer* lexer = new yyFlexLexer;
3695	while(lexer->yylex() != 0)
3696	    ;
3697	return 0;
3698}
3699.Ed
3700.Pp
3701To create multiple
3702.Pq different
3703lexer classes, use the
3704.Fl P
3705flag
3706(or the
3707.Dq prefix=
3708option)
3709to rename each
3710.Fa yyFlexLexer
3711to some other
3712.Fa xxFlexLexer .
3713.In g++/FlexLexer.h
3714can then be included in other sources once per lexer class, first renaming
3715.Fa yyFlexLexer
3716as follows:
3717.Bd -literal -offset indent
3718#undef yyFlexLexer
3719#define yyFlexLexer xxFlexLexer
3720#include <g++/FlexLexer.h>
3721
3722#undef yyFlexLexer
3723#define yyFlexLexer zzFlexLexer
3724#include <g++/FlexLexer.h>
3725.Ed
3726.Pp
3727If, for example,
3728.Dq %option prefix="xx"
3729is used for one scanner and
3730.Dq %option prefix="zz"
3731is used for the other.
3732.Pp
3733.Sy IMPORTANT :
3734the present form of the scanning class is experimental
3735and may change considerably between major releases.
3736.Sh INCOMPATIBILITIES WITH LEX AND POSIX
3737.Nm
3738is a rewrite of the
3739.At
3740.Nm lex
3741tool
3742(the two implementations do not share any code, though),
3743with some extensions and incompatibilities, both of which are of concern
3744to those who wish to write scanners acceptable to either implementation.
3745.Nm
3746is fully compliant with the
3747.Tn POSIX
3748.Nm lex
3749specification, except that when using
3750.Dq %pointer
3751.Pq the default ,
3752a call to
3753.Fn unput
3754destroys the contents of
3755.Fa yytext ,
3756which is counter to the
3757.Tn POSIX
3758specification.
3759.Pp
3760In this section we discuss all of the known areas of incompatibility between
3761.Nm ,
3762.At
3763.Nm lex ,
3764and the
3765.Tn POSIX
3766specification.
3767.Pp
3768.Nm flex Ns 's
3769.Fl l
3770option turns on maximum compatibility with the original
3771.At
3772.Nm lex
3773implementation, at the cost of a major loss in the generated scanner's
3774performance.
3775We note below which incompatibilities can be overcome using the
3776.Fl l
3777option.
3778.Pp
3779.Nm
3780is fully compatible with
3781.Nm lex
3782with the following exceptions:
3783.Bl -dash
3784.It
3785The undocumented
3786.Nm lex
3787scanner internal variable
3788.Fa yylineno
3789is not supported unless
3790.Fl l
3791or
3792.Dq %option yylineno
3793is used.
3794.Pp
3795.Fa yylineno
3796should be maintained on a per-buffer basis, rather than a per-scanner
3797.Pq single global variable
3798basis.
3799.Pp
3800.Fa yylineno
3801is not part of the
3802.Tn POSIX
3803specification.
3804.It
3805The
3806.Fn input
3807routine is not redefinable, though it may be called to read characters
3808following whatever has been matched by a rule.
3809If
3810.Fn input
3811encounters an end-of-file, the normal
3812.Fn yywrap
3813processing is done.
3814A
3815.Dq real
3816end-of-file is returned by
3817.Fn input
3818as
3819.Dv EOF .
3820.Pp
3821Input is instead controlled by defining the
3822.Dv YY_INPUT
3823macro.
3824.Pp
3825The
3826.Nm
3827restriction that
3828.Fn input
3829cannot be redefined is in accordance with the
3830.Tn POSIX
3831specification, which simply does not specify any way of controlling the
3832scanner's input other than by making an initial assignment to
3833.Fa yyin .
3834.It
3835The
3836.Fn unput
3837routine is not redefinable.
3838This restriction is in accordance with
3839.Tn POSIX .
3840.It
3841.Nm
3842scanners are not as reentrant as
3843.Nm lex
3844scanners.
3845In particular, if a scanner is interactive and
3846an interrupt handler long-jumps out of the scanner,
3847and the scanner is subsequently called again,
3848the following error message may be displayed:
3849.Pp
3850.D1 fatal flex scanner internal error--end of buffer missed
3851.Pp
3852To reenter the scanner, first use
3853.Pp
3854.Dl yyrestart(yyin);
3855.Pp
3856Note that this call will throw away any buffered input;
3857usually this isn't a problem with an interactive scanner.
3858.Pp
3859Also note that flex C++ scanner classes are reentrant,
3860so if using C++ is an option , they should be used instead.
3861See
3862.Sx GENERATING C++ SCANNERS
3863above for details.
3864.It
3865.Fn output
3866is not supported.
3867Output from the
3868.Em ECHO
3869macro is done to the file-pointer
3870.Fa yyout
3871.Pq default stdout .
3872.Pp
3873.Fn output
3874is not part of the
3875.Tn POSIX
3876specification.
3877.It
3878.Nm lex
3879does not support exclusive start conditions
3880.Pq %x ,
3881though they are in the
3882.Tn POSIX
3883specification.
3884.It
3885When definitions are expanded,
3886.Nm
3887encloses them in parentheses.
3888With
3889.Nm lex ,
3890the following:
3891.Bd -literal -offset indent
3892NAME    [A-Z][A-Z0-9]*
3893%%
3894foo{NAME}?      printf("Found it\en");
3895%%
3896.Ed
3897.Pp
3898will not match the string
3899.Qq foo
3900because when the macro is expanded the rule is equivalent to
3901.Qq foo[A-Z][A-Z0-9]*?
3902and the precedence is such that the
3903.Sq ?\&
3904is associated with
3905.Qq [A-Z0-9]* .
3906With
3907.Nm ,
3908the rule will be expanded to
3909.Qq foo([A-Z][A-Z0-9]*)?
3910and so the string
3911.Qq foo
3912will match.
3913.Pp
3914Note that if the definition begins with
3915.Sq ^
3916or ends with
3917.Sq $
3918then it is not expanded with parentheses, to allow these operators to appear in
3919definitions without losing their special meanings.
3920But the
3921.Sq Aq s ,
3922.Sq / ,
3923and
3924.Aq Aq EOF
3925operators cannot be used in a
3926.Nm
3927definition.
3928.Pp
3929Using
3930.Fl l
3931results in the
3932.Nm lex
3933behavior of no parentheses around the definition.
3934.Pp
3935The
3936.Tn POSIX
3937specification is that the definition be enclosed in parentheses.
3938.It
3939Some implementations of
3940.Nm lex
3941allow a rule's action to begin on a separate line,
3942if the rule's pattern has trailing whitespace:
3943.Bd -literal -offset indent
3944%%
3945foo|bar<space here>
3946  { foobar_action(); }
3947.Ed
3948.Pp
3949.Nm
3950does not support this feature.
3951.It
3952The
3953.Nm lex
3954.Sq %r
3955.Pq generate a Ratfor scanner
3956option is not supported.
3957It is not part of the
3958.Tn POSIX
3959specification.
3960.It
3961After a call to
3962.Fn unput ,
3963.Fa yytext
3964is undefined until the next token is matched,
3965unless the scanner was built using
3966.Dq %array .
3967This is not the case with
3968.Nm lex
3969or the
3970.Tn POSIX
3971specification.
3972The
3973.Fl l
3974option does away with this incompatibility.
3975.It
3976The precedence of the
3977.Sq {}
3978.Pq numeric range
3979operator is different.
3980.Nm lex
3981interprets
3982.Qq abc{1,3}
3983as match one, two, or three occurrences of
3984.Sq abc ,
3985whereas
3986.Nm
3987interprets it as match
3988.Sq ab
3989followed by one, two, or three occurrences of
3990.Sq c .
3991The latter is in agreement with the
3992.Tn POSIX
3993specification.
3994.It
3995The precedence of the
3996.Sq ^
3997operator is different.
3998.Nm lex
3999interprets
4000.Qq ^foo|bar
4001as match either
4002.Sq foo
4003at the beginning of a line, or
4004.Sq bar
4005anywhere, whereas
4006.Nm
4007interprets it as match either
4008.Sq foo
4009or
4010.Sq bar
4011if they come at the beginning of a line.
4012The latter is in agreement with the
4013.Tn POSIX
4014specification.
4015.It
4016The special table-size declarations such as
4017.Sq %a
4018supported by
4019.Nm lex
4020are not required by
4021.Nm
4022scanners;
4023.Nm
4024ignores them.
4025.It
4026The name
4027.Dv FLEX_SCANNER
4028is #define'd so scanners may be written for use with either
4029.Nm
4030or
4031.Nm lex .
4032Scanners also include
4033.Dv YY_FLEX_MAJOR_VERSION
4034and
4035.Dv YY_FLEX_MINOR_VERSION
4036indicating which version of
4037.Nm
4038generated the scanner
4039(for example, for the 2.5 release, these defines would be 2 and 5,
4040respectively).
4041.El
4042.Pp
4043The following
4044.Nm
4045features are not included in
4046.Nm lex
4047or the
4048.Tn POSIX
4049specification:
4050.Bd -unfilled -offset indent
4051C++ scanners
4052%option
4053start condition scopes
4054start condition stacks
4055interactive/non-interactive scanners
4056yy_scan_string() and friends
4057yyterminate()
4058yy_set_interactive()
4059yy_set_bol()
4060YY_AT_BOL()
4061<<EOF>>
4062<*>
4063YY_DECL
4064YY_START
4065YY_USER_ACTION
4066YY_USER_INIT
4067#line directives
4068%{}'s around actions
4069multiple actions on a line
4070.Ed
4071.Pp
4072plus almost all of the
4073.Nm
4074flags.
4075The last feature in the list refers to the fact that with
4076.Nm
4077multiple actions can be placed on the same line,
4078separated with semi-colons, while with
4079.Nm lex ,
4080the following
4081.Pp
4082.Dl foo    handle_foo(); ++num_foos_seen;
4083.Pp
4084is
4085.Pq rather surprisingly
4086truncated to
4087.Pp
4088.Dl foo    handle_foo();
4089.Pp
4090.Nm
4091does not truncate the action.
4092Actions that are not enclosed in braces
4093are simply terminated at the end of the line.
4094.Sh FILES
4095.Bl -tag -width "<g++/FlexLexer.h>"
4096.It Pa flex.skl
4097Skeleton scanner.
4098This file is only used when building flex, not when
4099.Nm
4100executes.
4101.It Pa lex.backup
4102Backing-up information for the
4103.Fl b
4104flag (called
4105.Pa lex.bck
4106on some systems).
4107.It Pa lex.yy.c
4108Generated scanner
4109(called
4110.Pa lexyy.c
4111on some systems).
4112.It Pa lex.yy.cc
4113Generated C++ scanner class, when using
4114.Fl + .
4115.It In g++/FlexLexer.h
4116Header file defining the C++ scanner base class,
4117.Fa FlexLexer ,
4118and its derived class,
4119.Fa yyFlexLexer .
4120.It Pa /usr/lib/libl.*
4121.Nm
4122libraries.
4123The
4124.Pa /usr/lib/libfl.*\&
4125libraries are links to these.
4126Scanners must be linked using either
4127.Fl \&ll
4128or
4129.Fl lfl .
4130.El
4131.Sh EXIT STATUS
4132.Ex -std flex
4133.Sh DIAGNOSTICS
4134.Bl -diag
4135.It warning, rule cannot be matched
4136Indicates that the given rule cannot be matched because it follows other rules
4137that will always match the same text as it.
4138For example, in the following
4139.Dq foo
4140cannot be matched because it comes after an identifier
4141.Qq catch-all
4142rule:
4143.Bd -literal -offset indent
4144[a-z]+    got_identifier();
4145foo       got_foo();
4146.Ed
4147.Pp
4148Using
4149.Em REJECT
4150in a scanner suppresses this warning.
4151.It "warning, \-s option given but default rule can be matched"
4152Means that it is possible
4153.Pq perhaps only in a particular start condition
4154that the default rule
4155.Pq match any single character
4156is the only one that will match a particular input.
4157Since
4158.Fl s
4159was given, presumably this is not intended.
4160.It reject_used_but_not_detected undefined
4161.It yymore_used_but_not_detected undefined
4162These errors can occur at compile time.
4163They indicate that the scanner uses
4164.Em REJECT
4165or
4166.Fn yymore
4167but that
4168.Nm
4169failed to notice the fact, meaning that
4170.Nm
4171scanned the first two sections looking for occurrences of these actions
4172and failed to find any, but somehow they snuck in
4173.Pq via an #include file, for example .
4174Use
4175.Dq %option reject
4176or
4177.Dq %option yymore
4178to indicate to
4179.Nm
4180that these features are really needed.
4181.It flex scanner jammed
4182A scanner compiled with
4183.Fl s
4184has encountered an input string which wasn't matched by any of its rules.
4185This error can also occur due to internal problems.
4186.It token too large, exceeds YYLMAX
4187The scanner uses
4188.Dq %array
4189and one of its rules matched a string longer than the
4190.Dv YYLMAX
4191constant
4192.Pq 8K bytes by default .
4193The value can be increased by #define'ing
4194.Dv YYLMAX
4195in the definitions section of
4196.Nm
4197input.
4198.It "scanner requires \-8 flag to use the character 'x'"
4199The scanner specification includes recognizing the 8-bit character
4200.Sq x
4201and the
4202.Fl 8
4203flag was not specified, and defaulted to 7-bit because the
4204.Fl Cf
4205or
4206.Fl CF
4207table compression options were used.
4208See the discussion of the
4209.Fl 7
4210flag for details.
4211.It flex scanner push-back overflow
4212unput() was used to push back so much text that the scanner's buffer
4213could not hold both the pushed-back text and the current token in
4214.Fa yytext .
4215Ideally the scanner should dynamically resize the buffer in this case,
4216but at present it does not.
4217.It "input buffer overflow, can't enlarge buffer because scanner uses REJECT"
4218The scanner was working on matching an extremely large token and needed
4219to expand the input buffer.
4220This doesn't work with scanners that use
4221.Em REJECT .
4222.It "fatal flex scanner internal error--end of buffer missed"
4223This can occur in a scanner which is reentered after a long-jump
4224has jumped out
4225.Pq or over
4226the scanner's activation frame.
4227Before reentering the scanner, use:
4228.Pp
4229.Dl yyrestart(yyin);
4230.Pp
4231or, as noted above, switch to using the C++ scanner class.
4232.It "too many start conditions in <> construct!"
4233More start conditions than exist were listed in a <> construct
4234(so at least one of them must have been listed twice).
4235.El
4236.Sh SEE ALSO
4237.Xr awk 1 ,
4238.Xr sed 1 ,
4239.Xr yacc 1
4240.Rs
4241.\" 4.4BSD PSD:16
4242.%A M. E. Lesk
4243.%T Lex \(em Lexical Analyzer Generator
4244.%I AT&T Bell Laboratories
4245.%R Computing Science Technical Report
4246.%N 39
4247.%D October 1975
4248.Re
4249.Rs
4250.%A John Levine
4251.%A Tony Mason
4252.%A Doug Brown
4253.%B Lex & Yacc
4254.%I O'Reilly and Associates
4255.%N 2nd edition
4256.Re
4257.Rs
4258.%A Alfred Aho
4259.%A Ravi Sethi
4260.%A Jeffrey Ullman
4261.%B Compilers: Principles, Techniques and Tools
4262.%I Addison-Wesley
4263.%D 1986
4264.%O "Describes the pattern-matching techniques used by flex (deterministic finite automata)"
4265.Re
4266.Sh STANDARDS
4267The
4268.Nm lex
4269utility is compliant with the
4270.St -p1003.1-2008
4271specification,
4272though its presence is optional.
4273.Pp
4274The flags
4275.Op Fl 78BbCdFfhIiLloPpSsTVw+? ,
4276.Op Fl -help ,
4277and
4278.Op Fl -version
4279are extensions to that specification.
4280.Pp
4281See also the
4282.Sx INCOMPATIBILITIES WITH LEX AND POSIX
4283section, above.
4284.Sh AUTHORS
4285Vern Paxson, with the help of many ideas and much inspiration from
4286Van Jacobson.
4287Original version by Jef Poskanzer.
4288The fast table representation is a partial implementation of a design done by
4289Van Jacobson.
4290The implementation was done by Kevin Gong and Vern Paxson.
4291.Pp
4292Thanks to the many
4293.Nm
4294beta-testers, feedbackers, and contributors, especially Francois Pinard,
4295Casey Leedom,
4296Robert Abramovitz,
4297Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai,
4298Neal Becker, Nelson H.F. Beebe,
4299.Mt benson@odi.com ,
4300Karl Berry, Peter A. Bigot, Simon Blanchard,
4301Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher,
4302Brian Clapper, J.T. Conklin,
4303Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David
4304Daniels, Chris G. Demetriou, Theo de Raadt,
4305Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin,
4306Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl,
4307Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz,
4308Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel,
4309Jan Hajic, Charles Hemphill, NORO Hideo,
4310Jarkko Hietaniemi, Scott Hofmann,
4311Jeff Honig, Dana Hudes, Eric Hughes, John Interrante,
4312Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones,
4313Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
4314Amir Katz,
4315.Mt ken@ken.hilco.com ,
4316Kevin B. Kenny,
4317Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht,
4318Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle,
4319David Loffredo, Mike Long,
4320Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall,
4321Bengt Martensson, Chris Metcalf,
4322Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum,
4323G.T. Nicol, Landon Noll, James Nordby, Marc Nozell,
4324Richard Ohnemus, Karsten Pahnke,
4325Sven Panne, Roland Pesch, Walter Pelissero, Gaumond Pierre,
4326Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha,
4327Frederic Raimbault, Pat Rankin, Rick Richardson,
4328Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini,
4329Andreas Scherer, Darrell Schiebel, Raf Schietekat,
4330Doug Schmidt, Philippe Schnoebelen, Andreas Schwab,
4331Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist,
4332Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor,
4333Chris Thewalt, Richard M. Timoney, Jodi Tsai,
4334Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams,
4335Ken Yap, Ron Zellar, Nathan Zelle, David Zuhn,
4336and those whose names have slipped my marginal mail-archiving skills
4337but whose contributions are appreciated all the
4338same.
4339.Pp
4340Thanks to Keith Bostic, Jon Forrest, Noah Friedman,
4341John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T.
4342Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various
4343distribution headaches.
4344.Pp
4345Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
4346to Benson Margulies and Fred Burke for C++ support;
4347to Kent Williams and Tom Epperly for C++ class support;
4348to Ove Ewerlid for support of NUL's;
4349and to Eric Hughes for support of multiple buffers.
4350.Pp
4351This work was primarily done when I was with the Real Time Systems Group
4352at the Lawrence Berkeley Laboratory in Berkeley, CA.
4353Many thanks to all there for the support I received.
4354.Pp
4355Send comments to
4356.Aq Mt vern@ee.lbl.gov .
4357.Sh BUGS
4358Some trailing context patterns cannot be properly matched and generate
4359warning messages
4360.Pq "dangerous trailing context" .
4361These are patterns where the ending of the first part of the rule
4362matches the beginning of the second part, such as
4363.Qq zx*/xy* ,
4364where the
4365.Sq x*
4366matches the
4367.Sq x
4368at the beginning of the trailing context.
4369(Note that the POSIX draft states that the text matched by such patterns
4370is undefined.)
4371.Pp
4372For some trailing context rules, parts which are actually fixed-length are
4373not recognized as such, leading to the above mentioned performance loss.
4374In particular, parts using
4375.Sq |\&
4376or
4377.Sq {n}
4378(such as
4379.Qq foo{3} )
4380are always considered variable-length.
4381.Pp
4382Combining trailing context with the special
4383.Sq |\&
4384action can result in fixed trailing context being turned into
4385the more expensive variable trailing context.
4386For example, in the following:
4387.Bd -literal -offset indent
4388%%
4389abc      |
4390xyz/def
4391.Ed
4392.Pp
4393Use of
4394.Fn unput
4395invalidates yytext and yyleng, unless the
4396.Dq %array
4397directive
4398or the
4399.Fl l
4400option has been used.
4401.Pp
4402Pattern-matching of NUL's is substantially slower than matching other
4403characters.
4404.Pp
4405Dynamic resizing of the input buffer is slow, as it entails rescanning
4406all the text matched so far by the current
4407.Pq generally huge
4408token.
4409.Pp
4410Due to both buffering of input and read-ahead,
4411it is not possible to intermix calls to
4412.In stdio.h
4413routines, such as, for example,
4414.Fn getchar ,
4415with
4416.Nm
4417rules and expect it to work.
4418Call
4419.Fn input
4420instead.
4421.Pp
4422The total table entries listed by the
4423.Fl v
4424flag excludes the number of table entries needed to determine
4425what rule has been matched.
4426The number of entries is equal to the number of DFA states
4427if the scanner does not use
4428.Em REJECT ,
4429and somewhat greater than the number of states if it does.
4430.Pp
4431.Em REJECT
4432cannot be used with the
4433.Fl f
4434or
4435.Fl F
4436options.
4437.Pp
4438The
4439.Nm
4440internal algorithms need documentation.
4441