xref: /netbsd-src/lib/libcompat/regexp/regexp.3 (revision 7c7c171d130af9949261bc7dce2150a03c3d239c)
1.\" Copyright (c) 1991, 1993
2.\"	The Regents of the University of California.  All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\" 3. All advertising materials mentioning features or use of this software
13.\"    must display the following acknowledgement:
14.\"	This product includes software developed by the University of
15.\"	California, Berkeley and its contributors.
16.\" 4. Neither the name of the University nor the names of its contributors
17.\"    may be used to endorse or promote products derived from this software
18.\"    without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\"     from: @(#)regexp.3	8.1 (Berkeley) 6/4/93
33.\"	$NetBSD: regexp.3,v 1.7 1998/02/05 18:51:31 perry Exp $
34.\"
35.Dd June 4, 1993
36.Dt REGEXP 3
37.Os
38.Sh NAME
39.Nm regcomp ,
40.Nm regexec ,
41.Nm regsub ,
42.Nm regerror
43.Nd regular expression handlers
44.Sh LIBRARY
45.Lb libcompat
46.Sh SYNOPSIS
47.Fd #include <regexp.h>
48.Ft regexp *
49.Fn regcomp "const char *exp"
50.Ft int
51.Fn regexec "const regexp *prog" "const char *string"
52.Ft void
53.Fn regsub "const regexp *prog" "const char *source" "char *dest"
54.Sh DESCRIPTION
55.Bf -symbolic
56This interface is made obsolete by
57.Xr regex 3 .
58It is available from the compatibility library, libcompat.
59.Ef
60.Pp
61The
62.Fn regcomp ,
63.Fn regexec ,
64.Fn regsub ,
65and
66.Fn regerror
67functions
68implement
69.Xr egrep 1 Ns -style
70regular expressions and supporting facilities.
71.Pp
72The
73.Fn regcomp
74function
75compiles a regular expression into a structure of type
76.Xr regexp ,
77and returns a pointer to it.
78The space has been allocated using
79.Xr malloc 3
80and may be released by
81.Xr free .
82.Pp
83The
84.Fn regexec
85function
86matches a
87.Dv NUL Ns -terminated
88.Fa string
89against the compiled regular expression
90in
91.Fa prog .
92It returns 1 for success and 0 for failure, and adjusts the contents of
93.Fa prog Ns 's
94.Em startp
95and
96.Em endp
97(see below) accordingly.
98.Pp
99The members of a
100.Xr regexp
101structure include at least the following (not necessarily in order):
102.Bd -literal -offset indent
103char *startp[NSUBEXP];
104char *endp[NSUBEXP];
105.Ed
106.Pp
107where
108.Dv NSUBEXP
109is defined (as 10) in the header file.
110Once a successful
111.Fn regexec
112has been done using the
113.Fn regexp ,
114each
115.Em startp Ns - Em endp
116pair describes one substring
117within the
118.Fa string ,
119with the
120.Em startp
121pointing to the first character of the substring and
122the
123.Em endp
124pointing to the first character following the substring.
125The 0th substring is the substring of
126.Fa string
127that matched the whole
128regular expression.
129The others are those substrings that matched parenthesized expressions
130within the regular expression, with parenthesized expressions numbered
131in left-to-right order of their opening parentheses.
132.Pp
133The
134.Fn regsub
135function
136copies
137.Fa source
138to
139.Fa dest ,
140making substitutions according to the
141most recent
142.Fn regexec
143performed using
144.Fa prog .
145Each instance of `&' in
146.Fa source
147is replaced by the substring
148indicated by
149.Em startp Ns Bq
150and
151.Em endp Ns Bq .
152Each instance of
153.Sq \e Ns Em n ,
154where
155.Em n
156is a digit, is replaced by
157the substring indicated by
158.Em startp Ns Bq Em n
159and
160.Em endp Ns Bq Em n .
161To get a literal `&' or
162.Sq \e Ns Em n
163into
164.Fa dest ,
165prefix it with `\e';
166to get a literal `\e' preceding `&' or
167.Sq \e Ns Em n ,
168prefix it with
169another `\e'.
170.Pp
171The
172.Fn regerror
173function
174is called whenever an error is detected in
175.Fn regcomp ,
176.Fn regexec ,
177or
178.Fn regsub .
179The default
180.Fn regerror
181writes the string
182.Fa msg ,
183with a suitable indicator of origin,
184on the standard
185error output
186and invokes
187.Xr exit 3 .
188The
189.Fn regerror
190function
191can be replaced by the user if other actions are desirable.
192.Sh REGULAR EXPRESSION SYNTAX
193A regular expression is zero or more
194.Em branches ,
195separated by `|'.
196It matches anything that matches one of the branches.
197.Pp
198A branch is zero or more
199.Em pieces ,
200concatenated.
201It matches a match for the first, followed by a match for the second, etc.
202.Pp
203A piece is an
204.Em atom
205possibly followed by `*', `+', or `?'.
206An atom followed by `*' matches a sequence of 0 or more matches of the atom.
207An atom followed by `+' matches a sequence of 1 or more matches of the atom.
208An atom followed by `?' matches a match of the atom, or the null string.
209.Pp
210An atom is a regular expression in parentheses (matching a match for the
211regular expression), a
212.Em range
213(see below), `.'
214(matching any single character), `^' (matching the null string at the
215beginning of the input string), `$' (matching the null string at the
216end of the input string), a `\e' followed by a single character (matching
217that character), or a single character with no other significance
218(matching that character).
219.Pp
220A
221.Em range
222is a sequence of characters enclosed in `[]'.
223It normally matches any single character from the sequence.
224If the sequence begins with `^',
225it matches any single character
226.Em not
227from the rest of the sequence.
228If two characters in the sequence are separated by `\-', this is shorthand
229for the full list of
230.Tn ASCII
231characters between them
232(e.g. `[0-9]' matches any decimal digit).
233To include a literal `]' in the sequence, make it the first character
234(following a possible `^').
235To include a literal `\-', make it the first or last character.
236.Sh AMBIGUITY
237If a regular expression could match two different parts of the input string,
238it will match the one which begins earliest.
239If both begin in the same place but match different lengths, or match
240the same length in different ways, life gets messier, as follows.
241.Pp
242In general, the possibilities in a list of branches are considered in
243left-to-right order, the possibilities for `*', `+', and `?' are
244considered longest-first, nested constructs are considered from the
245outermost in, and concatenated constructs are considered leftmost-first.
246The match that will be chosen is the one that uses the earliest
247possibility in the first choice that has to be made.
248If there is more than one choice, the next will be made in the same manner
249(earliest possibility) subject to the decision on the first choice.
250And so forth.
251.Pp
252For example,
253.Sq Li (ab|a)b*c
254could match
255`abc' in one of two ways.
256The first choice is between `ab' and `a'; since `ab' is earlier, and does
257lead to a successful overall match, it is chosen.
258Since the `b' is already spoken for,
259the `b*' must match its last possibility\(emthe empty string\(emsince
260it must respect the earlier choice.
261.Pp
262In the particular case where no `|'s are present and there is only one
263`*', `+', or `?', the net effect is that the longest possible
264match will be chosen.
265So
266.Sq Li ab* ,
267presented with `xabbbby', will match `abbbb'.
268Note that if
269.Sq Li ab* ,
270is tried against `xabyabbbz', it
271will match `ab' just after `x', due to the begins-earliest rule.
272(In effect, the decision on where to start the match is the first choice
273to be made, hence subsequent choices must respect it even if this leads them
274to less-preferred alternatives.)
275.Sh RETURN VALUES
276The
277.Fn regcomp
278function
279returns
280.Dv NULL
281for a failure
282.Pf ( Fn regerror
283permitting),
284where failures are syntax errors, exceeding implementation limits,
285or applying `+' or `*' to a possibly-null operand.
286.Sh SEE ALSO
287.Xr ed 1 ,
288.Xr ex 1 ,
289.Xr expr 1 ,
290.Xr egrep 1 ,
291.Xr fgrep 1 ,
292.Xr grep 1 ,
293.Xr regex 3
294.Sh HISTORY
295Both code and manual page for
296.Fn regcomp ,
297.Fn regexec ,
298.Fn regsub ,
299and
300.Fn regerror
301were written at the University of Toronto
302and appeared in
303.Bx 4.3 tahoe .
304They are intended to be compatible with the Bell V8
305.Xr regexp 3 ,
306but are not derived from Bell code.
307.Sh BUGS
308Empty branches and empty regular expressions are not portable to V8.
309.Pp
310The restriction against
311applying `*' or `+' to a possibly-null operand is an artifact of the
312simplistic implementation.
313.Pp
314Does not support
315.Xr egrep Ns 's
316newline-separated branches;
317neither does the V8
318.Xr regexp 3 ,
319though.
320.Pp
321Due to emphasis on
322compactness and simplicity,
323it's not strikingly fast.
324It does give special attention to handling simple cases quickly.
325