xref: /csrg-svn/lib/libcompat/regexp/regexp.3 (revision 48375)
1*48375Scael.\" Copyright 1991 The Regents of the University of California.
2*48375Scael.\" All rights reserved.
3*48375Scael.\"
4*48375Scael.\" %sccs.include.redist.man%
5*48375Scael.\"
6*48375Scael.\"     @(#)regexp.3	5.2 (Berkeley) 04/20/91
7*48375Scael.\"
8*48375Scael.Dd
9*48375Scael.Dt REGEXP 3
10*48375Scael.Os
11*48375Scael.Sh NAME
12*48375Scael.Nm regcomp ,
13*48375Scael.Nm regexec ,
14*48375Scael.Nm regsub ,
15*48375Scael.Nm regerror
16*48375Scael.Nd regular expression handlers
17*48375Scael.Sh SYNOPSIS
18*48375Scael.Fd #include <regexp.h>
19*48375Scael.Ft regexp *
20*48375Scael.Fn regcomp "const char *exp"
21*48375Scael.Ft int
22*48375Scael.Fn regexec "const regexp *prog" "const char *string"
23*48375Scael.Ft void
24*48375Scael.Fn regsub "const regexp *prog" "const char *source" "char *dest"
25*48375Scael.Sh DESCRIPTION
26*48375ScaelThe
27*48375Scael.Fn regcomp ,
28*48375Scael.Fn regexec ,
29*48375Scael.Fn regsub ,
30*48375Scaeland
31*48375Scael.Fn regerror
32*48375Scaelfunctions
33*48375Scaelimplement
34*48375Scael.Xr egrep 1 Ns -style
3541347Sbosticregular expressions and supporting facilities.
36*48375Scael.Pp
37*48375ScaelThe
38*48375Scael.Fn regcomp
39*48375Scaelfunction
4041347Sbosticcompiles a regular expression into a structure of type
41*48375Scael.Xr regexp ,
4241347Sbosticand returns a pointer to it.
4341347SbosticThe space has been allocated using
44*48375Scael.Xr malloc 3
4541347Sbosticand may be released by
46*48375Scael.Xr free .
47*48375Scael.Pp
48*48375ScaelThe
49*48375Scael.Fn regexec
50*48375Scaelfunction
51*48375Scaelmatches a
52*48375Scael.Dv NUL Ns -terminated
53*48375Scael.Fa string
54*48375Scaelagainst the compiled regular expression
55*48375Scaelin
56*48375Scael.Fa prog .
5741347SbosticIt returns 1 for success and 0 for failure, and adjusts the contents of
58*48375Scael.Fa prog Ns 's
59*48375Scael.Em startp
60*48375Scaeland
61*48375Scael.Em endp
62*48375Scael(see below) accordingly.
63*48375Scael.Pp
6441347SbosticThe members of a
65*48375Scael.Xr regexp
6641347Sbosticstructure include at least the following (not necessarily in order):
67*48375Scael.Bd -literal -offset indent
6841347Sbosticchar *startp[NSUBEXP];
6941347Sbosticchar *endp[NSUBEXP];
70*48375Scael.Ed
71*48375Scael.Pp
7241347Sbosticwhere
73*48375Scael.Dv NSUBEXP
7441347Sbosticis defined (as 10) in the header file.
75*48375ScaelOnce a successful
76*48375Scael.Fn regexec
77*48375Scaelhas been done using the
78*48375Scael.Fn regexp ,
79*48375Scaeleach
80*48375Scael.Em startp Ns - Em endp
81*48375Scaelpair describes one substring
82*48375Scaelwithin the
83*48375Scael.Fa string ,
84*48375Scaelwith the
85*48375Scael.Em startp
86*48375Scaelpointing to the first character of the substring and
87*48375Scaelthe
88*48375Scael.Em endp
89*48375Scaelpointing to the first character following the substring.
90*48375ScaelThe 0th substring is the substring of
91*48375Scael.Fa string
92*48375Scaelthat matched the whole
9341347Sbosticregular expression.
9441347SbosticThe others are those substrings that matched parenthesized expressions
9541347Sbosticwithin the regular expression, with parenthesized expressions numbered
9641347Sbosticin left-to-right order of their opening parentheses.
97*48375Scael.Pp
98*48375ScaelThe
99*48375Scael.Fn regsub
100*48375Scaelfunction
101*48375Scaelcopies
102*48375Scael.Fa source
103*48375Scaelto
104*48375Scael.Fa dest ,
105*48375Scaelmaking substitutions according to the
106*48375Scaelmost recent
107*48375Scael.Fn regexec
108*48375Scaelperformed using
109*48375Scael.Fa prog .
110*48375ScaelEach instance of `&' in
111*48375Scael.Fa source
112*48375Scaelis replaced by the substring
113*48375Scaelindicated by
114*48375Scael.Em startp Ns Bq
115*48375Scaeland
116*48375Scael.Em endp Ns Bq .
117*48375ScaelEach instance of
118*48375Scael.Sq \e Ns Em n ,
119*48375Scaelwhere
120*48375Scael.Em n
121*48375Scaelis a digit, is replaced by
12241347Sbosticthe substring indicated by
123*48375Scael.Em startp Ns Bq Em n
124*48375Scaeland
125*48375Scael.Em endp Ns Bq Em n .
126*48375ScaelTo get a literal `&' or
127*48375Scael.Sq \e Ns Em n
128*48375Scaelinto
129*48375Scael.Fa dest ,
130*48375Scaelprefix it with `\e';
131*48375Scaelto get a literal `\e' preceding `&' or
132*48375Scael.Sq \e Ns Em n ,
133*48375Scaelprefix it with
13441347Sbosticanother `\e'.
135*48375Scael.Pp
136*48375ScaelThe
137*48375Scael.Fn regerror
138*48375Scaelfunction
139*48375Scaelis called whenever an error is detected in
140*48375Scael.Fn regcomp ,
141*48375Scael.Fn regexec ,
142*48375Scaelor
143*48375Scael.Fn regsub .
144*48375ScaelThe default
145*48375Scael.Fn regerror
146*48375Scaelwrites the string
147*48375Scael.Fa msg ,
14841347Sbosticwith a suitable indicator of origin,
14941347Sbosticon the standard
15041347Sbosticerror output
151*48375Scaeland invokes
152*48375Scael.Xr exit 2 .
153*48375ScaelThe
154*48375Scael.Fn regerror
155*48375Scaelfunction
15641347Sbosticcan be replaced by the user if other actions are desirable.
157*48375Scael.Sh REGULAR EXPRESSION SYNTAX
158*48375ScaelA regular expression is zero or more
159*48375Scael.Em branches ,
160*48375Scaelseparated by `|'.
16141347SbosticIt matches anything that matches one of the branches.
162*48375Scael.Pp
163*48375ScaelA branch is zero or more
164*48375Scael.Em pieces ,
165*48375Scaelconcatenated.
16641347SbosticIt matches a match for the first, followed by a match for the second, etc.
167*48375Scael.Pp
168*48375ScaelA piece is an
169*48375Scael.Em atom
170*48375Scaelpossibly followed by `*', `+', or `?'.
17141347SbosticAn atom followed by `*' matches a sequence of 0 or more matches of the atom.
17241347SbosticAn atom followed by `+' matches a sequence of 1 or more matches of the atom.
17341347SbosticAn atom followed by `?' matches a match of the atom, or the null string.
174*48375Scael.Pp
17541347SbosticAn atom is a regular expression in parentheses (matching a match for the
176*48375Scaelregular expression), a
177*48375Scael.Em range
178*48375Scael(see below), `.'
17941347Sbostic(matching any single character), `^' (matching the null string at the
18041347Sbosticbeginning of the input string), `$' (matching the null string at the
18141347Sbosticend of the input string), a `\e' followed by a single character (matching
18241347Sbosticthat character), or a single character with no other significance
18341347Sbostic(matching that character).
184*48375Scael.Pp
185*48375ScaelA
186*48375Scael.Em range
187*48375Scaelis a sequence of characters enclosed in `[]'.
18841347SbosticIt normally matches any single character from the sequence.
18941347SbosticIf the sequence begins with `^',
190*48375Scaelit matches any single character
191*48375Scael.Em not
192*48375Scaelfrom the rest of the sequence.
19341347SbosticIf two characters in the sequence are separated by `\-', this is shorthand
194*48375Scaelfor the full list of
195*48375Scael.Tn ASCII
196*48375Scaelcharacters between them
19741347Sbostic(e.g. `[0-9]' matches any decimal digit).
19841347SbosticTo include a literal `]' in the sequence, make it the first character
19941347Sbostic(following a possible `^').
20041347SbosticTo include a literal `\-', make it the first or last character.
201*48375Scael.Sh AMBIGUITY
20241347SbosticIf a regular expression could match two different parts of the input string,
20341347Sbosticit will match the one which begins earliest.
204*48375ScaelIf both begin in the same place but match different lengths, or match
20541347Sbosticthe same length in different ways, life gets messier, as follows.
206*48375Scael.Pp
20741347SbosticIn general, the possibilities in a list of branches are considered in
20841347Sbosticleft-to-right order, the possibilities for `*', `+', and `?' are
20941347Sbosticconsidered longest-first, nested constructs are considered from the
21041347Sbosticoutermost in, and concatenated constructs are considered leftmost-first.
21141347SbosticThe match that will be chosen is the one that uses the earliest
21241347Sbosticpossibility in the first choice that has to be made.
21341347SbosticIf there is more than one choice, the next will be made in the same manner
21441347Sbostic(earliest possibility) subject to the decision on the first choice.
21541347SbosticAnd so forth.
216*48375Scael.Pp
217*48375ScaelFor example,
218*48375Scael.Sq Li (ab|a)b*c
219*48375Scaelcould match
220*48375Scael`abc' in one of two ways.
22141347SbosticThe first choice is between `ab' and `a'; since `ab' is earlier, and does
22241347Sbosticlead to a successful overall match, it is chosen.
22341347SbosticSince the `b' is already spoken for,
22441347Sbosticthe `b*' must match its last possibility\(emthe empty string\(emsince
22541347Sbosticit must respect the earlier choice.
226*48375Scael.Pp
22741347SbosticIn the particular case where no `|'s are present and there is only one
22841347Sbostic`*', `+', or `?', the net effect is that the longest possible
22941347Sbosticmatch will be chosen.
230*48375ScaelSo
231*48375Scael.Sq Li ab* ,
232*48375Scaelpresented with `xabbbby', will match `abbbb'.
233*48375ScaelNote that if
234*48375Scael.Sq Li ab* ,
235*48375Scaelis tried against `xabyabbbz', it
23641347Sbosticwill match `ab' just after `x', due to the begins-earliest rule.
23741347Sbostic(In effect, the decision on where to start the match is the first choice
23841347Sbosticto be made, hence subsequent choices must respect it even if this leads them
23941347Sbosticto less-preferred alternatives.)
240*48375Scael.Sh RETURN VALUES
241*48375ScaelThe
242*48375Scael.Fn regcomp
243*48375Scaelfunction
244*48375Scaelreturns
245*48375Scael.Dv NULL
246*48375Scaelfor a failure
247*48375Scael.Pf ( Fn regerror
248*48375Scaelpermitting),
24941347Sbosticwhere failures are syntax errors, exceeding implementation limits,
25041347Sbosticor applying `+' or `*' to a possibly-null operand.
251*48375Scael.Sh SEE ALSO
252*48375Scael.Xr ed 1 ,
253*48375Scael.Xr ex 1 ,
254*48375Scael.Xr expr 1 ,
255*48375Scael.Xr egrep 1 ,
256*48375Scael.Xr fgrep 1 ,
257*48375Scael.Xr grep 1 ,
258*48375Scael.Xr regex 3
259*48375Scael.Sh HISTORY
260*48375ScaelBoth code and manual page for
261*48375Scael.Fn regcomp ,
262*48375Scael.Fn regexec ,
263*48375Scael.Fn regsub ,
264*48375Scaeland
265*48375Scael.Fn regerror
266*48375Scaelwere written at the University of Toronto
267*48375Scaeland appeared in
268*48375Scael.Bx 4.3 tahoe .
269*48375ScaelThey are intended to be compatible with the Bell V8
270*48375Scael.Xr regexp 3 ,
27141347Sbosticbut are not derived from Bell code.
272*48375Scael.Sh BUGS
27341347SbosticEmpty branches and empty regular expressions are not portable to V8.
274*48375Scael.Pp
27541347SbosticThe restriction against
27641347Sbosticapplying `*' or `+' to a possibly-null operand is an artifact of the
27741347Sbosticsimplistic implementation.
278*48375Scael.Pp
279*48375ScaelDoes not support
280*48375Scael.Xr egrep Ns 's
281*48375Scaelnewline-separated branches;
282*48375Scaelneither does the V8
283*48375Scael.Xr regexp 3 ,
284*48375Scaelthough.
285*48375Scael.Pp
28641347SbosticDue to emphasis on
28741347Sbosticcompactness and simplicity,
28841347Sbosticit's not strikingly fast.
28941347SbosticIt does give special attention to handling simple cases quickly.
290