xref: /netbsd-src/lib/libcompat/regexp/regexp.3 (revision d9158b13b5dfe46201430699a3f7a235ecf28df3)
1.\" Copyright (c) 1991, 1993
2.\"	The Regents of the University of California.  All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\" 3. All advertising materials mentioning features or use of this software
13.\"    must display the following acknowledgement:
14.\"	This product includes software developed by the University of
15.\"	California, Berkeley and its contributors.
16.\" 4. Neither the name of the University nor the names of its contributors
17.\"    may be used to endorse or promote products derived from this software
18.\"    without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\"     @(#)regexp.3	8.1 (Berkeley) 6/4/93
33.\"
34.Dd June 4, 1993
35.Dt REGEXP 3
36.Os
37.Sh NAME
38.Nm regcomp ,
39.Nm regexec ,
40.Nm regsub ,
41.Nm regerror
42.Nd regular expression handlers
43.Sh SYNOPSIS
44.Fd #include <regexp.h>
45.Ft regexp *
46.Fn regcomp "const char *exp"
47.Ft int
48.Fn regexec "const regexp *prog" "const char *string"
49.Ft void
50.Fn regsub "const regexp *prog" "const char *source" "char *dest"
51.Sh DESCRIPTION
52This interface is made obsolete by
53.Xr regex 3 .
54.Pp
55The
56.Fn regcomp ,
57.Fn regexec ,
58.Fn regsub ,
59and
60.Fn regerror
61functions
62implement
63.Xr egrep 1 Ns -style
64regular expressions and supporting facilities.
65.Pp
66The
67.Fn regcomp
68function
69compiles a regular expression into a structure of type
70.Xr regexp ,
71and returns a pointer to it.
72The space has been allocated using
73.Xr malloc 3
74and may be released by
75.Xr free .
76.Pp
77The
78.Fn regexec
79function
80matches a
81.Dv NUL Ns -terminated
82.Fa string
83against the compiled regular expression
84in
85.Fa prog .
86It returns 1 for success and 0 for failure, and adjusts the contents of
87.Fa prog Ns 's
88.Em startp
89and
90.Em endp
91(see below) accordingly.
92.Pp
93The members of a
94.Xr regexp
95structure include at least the following (not necessarily in order):
96.Bd -literal -offset indent
97char *startp[NSUBEXP];
98char *endp[NSUBEXP];
99.Ed
100.Pp
101where
102.Dv NSUBEXP
103is defined (as 10) in the header file.
104Once a successful
105.Fn regexec
106has been done using the
107.Fn regexp ,
108each
109.Em startp Ns - Em endp
110pair describes one substring
111within the
112.Fa string ,
113with the
114.Em startp
115pointing to the first character of the substring and
116the
117.Em endp
118pointing to the first character following the substring.
119The 0th substring is the substring of
120.Fa string
121that matched the whole
122regular expression.
123The others are those substrings that matched parenthesized expressions
124within the regular expression, with parenthesized expressions numbered
125in left-to-right order of their opening parentheses.
126.Pp
127The
128.Fn regsub
129function
130copies
131.Fa source
132to
133.Fa dest ,
134making substitutions according to the
135most recent
136.Fn regexec
137performed using
138.Fa prog .
139Each instance of `&' in
140.Fa source
141is replaced by the substring
142indicated by
143.Em startp Ns Bq
144and
145.Em endp Ns Bq .
146Each instance of
147.Sq \e Ns Em n ,
148where
149.Em n
150is a digit, is replaced by
151the substring indicated by
152.Em startp Ns Bq Em n
153and
154.Em endp Ns Bq Em n .
155To get a literal `&' or
156.Sq \e Ns Em n
157into
158.Fa dest ,
159prefix it with `\e';
160to get a literal `\e' preceding `&' or
161.Sq \e Ns Em n ,
162prefix it with
163another `\e'.
164.Pp
165The
166.Fn regerror
167function
168is called whenever an error is detected in
169.Fn regcomp ,
170.Fn regexec ,
171or
172.Fn regsub .
173The default
174.Fn regerror
175writes the string
176.Fa msg ,
177with a suitable indicator of origin,
178on the standard
179error output
180and invokes
181.Xr exit 2 .
182The
183.Fn regerror
184function
185can be replaced by the user if other actions are desirable.
186.Sh REGULAR EXPRESSION SYNTAX
187A regular expression is zero or more
188.Em branches ,
189separated by `|'.
190It matches anything that matches one of the branches.
191.Pp
192A branch is zero or more
193.Em pieces ,
194concatenated.
195It matches a match for the first, followed by a match for the second, etc.
196.Pp
197A piece is an
198.Em atom
199possibly followed by `*', `+', or `?'.
200An atom followed by `*' matches a sequence of 0 or more matches of the atom.
201An atom followed by `+' matches a sequence of 1 or more matches of the atom.
202An atom followed by `?' matches a match of the atom, or the null string.
203.Pp
204An atom is a regular expression in parentheses (matching a match for the
205regular expression), a
206.Em range
207(see below), `.'
208(matching any single character), `^' (matching the null string at the
209beginning of the input string), `$' (matching the null string at the
210end of the input string), a `\e' followed by a single character (matching
211that character), or a single character with no other significance
212(matching that character).
213.Pp
214A
215.Em range
216is a sequence of characters enclosed in `[]'.
217It normally matches any single character from the sequence.
218If the sequence begins with `^',
219it matches any single character
220.Em not
221from the rest of the sequence.
222If two characters in the sequence are separated by `\-', this is shorthand
223for the full list of
224.Tn ASCII
225characters between them
226(e.g. `[0-9]' matches any decimal digit).
227To include a literal `]' in the sequence, make it the first character
228(following a possible `^').
229To include a literal `\-', make it the first or last character.
230.Sh AMBIGUITY
231If a regular expression could match two different parts of the input string,
232it will match the one which begins earliest.
233If both begin in the same place but match different lengths, or match
234the same length in different ways, life gets messier, as follows.
235.Pp
236In general, the possibilities in a list of branches are considered in
237left-to-right order, the possibilities for `*', `+', and `?' are
238considered longest-first, nested constructs are considered from the
239outermost in, and concatenated constructs are considered leftmost-first.
240The match that will be chosen is the one that uses the earliest
241possibility in the first choice that has to be made.
242If there is more than one choice, the next will be made in the same manner
243(earliest possibility) subject to the decision on the first choice.
244And so forth.
245.Pp
246For example,
247.Sq Li (ab|a)b*c
248could match
249`abc' in one of two ways.
250The first choice is between `ab' and `a'; since `ab' is earlier, and does
251lead to a successful overall match, it is chosen.
252Since the `b' is already spoken for,
253the `b*' must match its last possibility\(emthe empty string\(emsince
254it must respect the earlier choice.
255.Pp
256In the particular case where no `|'s are present and there is only one
257`*', `+', or `?', the net effect is that the longest possible
258match will be chosen.
259So
260.Sq Li ab* ,
261presented with `xabbbby', will match `abbbb'.
262Note that if
263.Sq Li ab* ,
264is tried against `xabyabbbz', it
265will match `ab' just after `x', due to the begins-earliest rule.
266(In effect, the decision on where to start the match is the first choice
267to be made, hence subsequent choices must respect it even if this leads them
268to less-preferred alternatives.)
269.Sh RETURN VALUES
270The
271.Fn regcomp
272function
273returns
274.Dv NULL
275for a failure
276.Pf ( Fn regerror
277permitting),
278where failures are syntax errors, exceeding implementation limits,
279or applying `+' or `*' to a possibly-null operand.
280.Sh SEE ALSO
281.Xr ed 1 ,
282.Xr ex 1 ,
283.Xr expr 1 ,
284.Xr egrep 1 ,
285.Xr fgrep 1 ,
286.Xr grep 1 ,
287.Xr regex 3
288.Sh HISTORY
289Both code and manual page for
290.Fn regcomp ,
291.Fn regexec ,
292.Fn regsub ,
293and
294.Fn regerror
295were written at the University of Toronto
296and appeared in
297.Bx 4.3 tahoe .
298They are intended to be compatible with the Bell V8
299.Xr regexp 3 ,
300but are not derived from Bell code.
301.Sh BUGS
302Empty branches and empty regular expressions are not portable to V8.
303.Pp
304The restriction against
305applying `*' or `+' to a possibly-null operand is an artifact of the
306simplistic implementation.
307.Pp
308Does not support
309.Xr egrep Ns 's
310newline-separated branches;
311neither does the V8
312.Xr regexp 3 ,
313though.
314.Pp
315Due to emphasis on
316compactness and simplicity,
317it's not strikingly fast.
318It does give special attention to handling simple cases quickly.
319