xref: /netbsd-src/lib/libcompat/regexp/regexp.3 (revision 2a399c6883d870daece976daec6ffa7bb7f934ce)
1.\" Copyright (c) 1991, 1993
2.\"	The Regents of the University of California.  All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\" 3. All advertising materials mentioning features or use of this software
13.\"    must display the following acknowledgement:
14.\"	This product includes software developed by the University of
15.\"	California, Berkeley and its contributors.
16.\" 4. Neither the name of the University nor the names of its contributors
17.\"    may be used to endorse or promote products derived from this software
18.\"    without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\"     from: @(#)regexp.3	8.1 (Berkeley) 6/4/93
33.\"	$NetBSD: regexp.3,v 1.6 1997/10/09 10:21:14 lukem Exp $
34.\"
35.Dd June 4, 1993
36.Dt REGEXP 3
37.Os
38.Sh NAME
39.Nm regcomp ,
40.Nm regexec ,
41.Nm regsub ,
42.Nm regerror
43.Nd regular expression handlers
44.Sh SYNOPSIS
45.Fd #include <regexp.h>
46.Ft regexp *
47.Fn regcomp "const char *exp"
48.Ft int
49.Fn regexec "const regexp *prog" "const char *string"
50.Ft void
51.Fn regsub "const regexp *prog" "const char *source" "char *dest"
52.Sh DESCRIPTION
53.Bf -symbolic
54This interface is made obsolete by
55.Xr regex 3 .
56It is available from the compatibility library, libcompat.
57.Ef
58.Pp
59The
60.Fn regcomp ,
61.Fn regexec ,
62.Fn regsub ,
63and
64.Fn regerror
65functions
66implement
67.Xr egrep 1 Ns -style
68regular expressions and supporting facilities.
69.Pp
70The
71.Fn regcomp
72function
73compiles a regular expression into a structure of type
74.Xr regexp ,
75and returns a pointer to it.
76The space has been allocated using
77.Xr malloc 3
78and may be released by
79.Xr free .
80.Pp
81The
82.Fn regexec
83function
84matches a
85.Dv NUL Ns -terminated
86.Fa string
87against the compiled regular expression
88in
89.Fa prog .
90It returns 1 for success and 0 for failure, and adjusts the contents of
91.Fa prog Ns 's
92.Em startp
93and
94.Em endp
95(see below) accordingly.
96.Pp
97The members of a
98.Xr regexp
99structure include at least the following (not necessarily in order):
100.Bd -literal -offset indent
101char *startp[NSUBEXP];
102char *endp[NSUBEXP];
103.Ed
104.Pp
105where
106.Dv NSUBEXP
107is defined (as 10) in the header file.
108Once a successful
109.Fn regexec
110has been done using the
111.Fn regexp ,
112each
113.Em startp Ns - Em endp
114pair describes one substring
115within the
116.Fa string ,
117with the
118.Em startp
119pointing to the first character of the substring and
120the
121.Em endp
122pointing to the first character following the substring.
123The 0th substring is the substring of
124.Fa string
125that matched the whole
126regular expression.
127The others are those substrings that matched parenthesized expressions
128within the regular expression, with parenthesized expressions numbered
129in left-to-right order of their opening parentheses.
130.Pp
131The
132.Fn regsub
133function
134copies
135.Fa source
136to
137.Fa dest ,
138making substitutions according to the
139most recent
140.Fn regexec
141performed using
142.Fa prog .
143Each instance of `&' in
144.Fa source
145is replaced by the substring
146indicated by
147.Em startp Ns Bq
148and
149.Em endp Ns Bq .
150Each instance of
151.Sq \e Ns Em n ,
152where
153.Em n
154is a digit, is replaced by
155the substring indicated by
156.Em startp Ns Bq Em n
157and
158.Em endp Ns Bq Em n .
159To get a literal `&' or
160.Sq \e Ns Em n
161into
162.Fa dest ,
163prefix it with `\e';
164to get a literal `\e' preceding `&' or
165.Sq \e Ns Em n ,
166prefix it with
167another `\e'.
168.Pp
169The
170.Fn regerror
171function
172is called whenever an error is detected in
173.Fn regcomp ,
174.Fn regexec ,
175or
176.Fn regsub .
177The default
178.Fn regerror
179writes the string
180.Fa msg ,
181with a suitable indicator of origin,
182on the standard
183error output
184and invokes
185.Xr exit 3 .
186The
187.Fn regerror
188function
189can be replaced by the user if other actions are desirable.
190.Sh REGULAR EXPRESSION SYNTAX
191A regular expression is zero or more
192.Em branches ,
193separated by `|'.
194It matches anything that matches one of the branches.
195.Pp
196A branch is zero or more
197.Em pieces ,
198concatenated.
199It matches a match for the first, followed by a match for the second, etc.
200.Pp
201A piece is an
202.Em atom
203possibly followed by `*', `+', or `?'.
204An atom followed by `*' matches a sequence of 0 or more matches of the atom.
205An atom followed by `+' matches a sequence of 1 or more matches of the atom.
206An atom followed by `?' matches a match of the atom, or the null string.
207.Pp
208An atom is a regular expression in parentheses (matching a match for the
209regular expression), a
210.Em range
211(see below), `.'
212(matching any single character), `^' (matching the null string at the
213beginning of the input string), `$' (matching the null string at the
214end of the input string), a `\e' followed by a single character (matching
215that character), or a single character with no other significance
216(matching that character).
217.Pp
218A
219.Em range
220is a sequence of characters enclosed in `[]'.
221It normally matches any single character from the sequence.
222If the sequence begins with `^',
223it matches any single character
224.Em not
225from the rest of the sequence.
226If two characters in the sequence are separated by `\-', this is shorthand
227for the full list of
228.Tn ASCII
229characters between them
230(e.g. `[0-9]' matches any decimal digit).
231To include a literal `]' in the sequence, make it the first character
232(following a possible `^').
233To include a literal `\-', make it the first or last character.
234.Sh AMBIGUITY
235If a regular expression could match two different parts of the input string,
236it will match the one which begins earliest.
237If both begin in the same place but match different lengths, or match
238the same length in different ways, life gets messier, as follows.
239.Pp
240In general, the possibilities in a list of branches are considered in
241left-to-right order, the possibilities for `*', `+', and `?' are
242considered longest-first, nested constructs are considered from the
243outermost in, and concatenated constructs are considered leftmost-first.
244The match that will be chosen is the one that uses the earliest
245possibility in the first choice that has to be made.
246If there is more than one choice, the next will be made in the same manner
247(earliest possibility) subject to the decision on the first choice.
248And so forth.
249.Pp
250For example,
251.Sq Li (ab|a)b*c
252could match
253`abc' in one of two ways.
254The first choice is between `ab' and `a'; since `ab' is earlier, and does
255lead to a successful overall match, it is chosen.
256Since the `b' is already spoken for,
257the `b*' must match its last possibility\(emthe empty string\(emsince
258it must respect the earlier choice.
259.Pp
260In the particular case where no `|'s are present and there is only one
261`*', `+', or `?', the net effect is that the longest possible
262match will be chosen.
263So
264.Sq Li ab* ,
265presented with `xabbbby', will match `abbbb'.
266Note that if
267.Sq Li ab* ,
268is tried against `xabyabbbz', it
269will match `ab' just after `x', due to the begins-earliest rule.
270(In effect, the decision on where to start the match is the first choice
271to be made, hence subsequent choices must respect it even if this leads them
272to less-preferred alternatives.)
273.Sh RETURN VALUES
274The
275.Fn regcomp
276function
277returns
278.Dv NULL
279for a failure
280.Pf ( Fn regerror
281permitting),
282where failures are syntax errors, exceeding implementation limits,
283or applying `+' or `*' to a possibly-null operand.
284.Sh SEE ALSO
285.Xr ed 1 ,
286.Xr ex 1 ,
287.Xr expr 1 ,
288.Xr egrep 1 ,
289.Xr fgrep 1 ,
290.Xr grep 1 ,
291.Xr regex 3
292.Sh HISTORY
293Both code and manual page for
294.Fn regcomp ,
295.Fn regexec ,
296.Fn regsub ,
297and
298.Fn regerror
299were written at the University of Toronto
300and appeared in
301.Bx 4.3 tahoe .
302They are intended to be compatible with the Bell V8
303.Xr regexp 3 ,
304but are not derived from Bell code.
305.Sh BUGS
306Empty branches and empty regular expressions are not portable to V8.
307.Pp
308The restriction against
309applying `*' or `+' to a possibly-null operand is an artifact of the
310simplistic implementation.
311.Pp
312Does not support
313.Xr egrep Ns 's
314newline-separated branches;
315neither does the V8
316.Xr regexp 3 ,
317though.
318.Pp
319Due to emphasis on
320compactness and simplicity,
321it's not strikingly fast.
322It does give special attention to handling simple cases quickly.
323