xref: /netbsd-src/lib/libcompat/regexp/regexp.3 (revision dc306354b0b29af51801a7632f1e95265a68cd81)
1.\" Copyright (c) 1991, 1993
2.\"	The Regents of the University of California.  All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\" 3. All advertising materials mentioning features or use of this software
13.\"    must display the following acknowledgement:
14.\"	This product includes software developed by the University of
15.\"	California, Berkeley and its contributors.
16.\" 4. Neither the name of the University nor the names of its contributors
17.\"    may be used to endorse or promote products derived from this software
18.\"    without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\"     from: @(#)regexp.3	8.1 (Berkeley) 6/4/93
33.\"	$NetBSD: regexp.3,v 1.9 1998/04/29 19:25:25 fair Exp $
34.\"
35.Dd June 4, 1993
36.Dt REGEXP 3
37.Os
38.Sh NAME
39.Nm regcomp ,
40.Nm regexec ,
41.Nm regsub ,
42.Nm regerror
43.Nd regular expression handlers
44.Sh LIBRARY
45.Lb libcompat
46.Sh SYNOPSIS
47.Fd #include <regexp.h>
48.Ft regexp *
49.Fn regcomp "const char *exp"
50.Ft int
51.Fn regexec "const regexp *prog" "const char *string"
52.Ft void
53.Fn regsub "const regexp *prog" "const char *source" "char *dest"
54.Sh DESCRIPTION
55.Bf -symbolic
56This interface is made obsolete by
57.Xr regex 3 .
58It is available from the compatibility library, libcompat.
59.Ef
60.Pp
61The
62.Fn regcomp ,
63.Fn regexec ,
64.Fn regsub ,
65and
66.Fn regerror
67functions implement
68.Xr egrep 1 Ns -style
69regular expressions and supporting facilities.
70.Pp
71The
72.Fn regcomp
73function
74compiles a regular expression into a structure of type
75.Em regexp ,
76and returns a pointer to it.
77The space has been allocated using
78.Xr malloc 3
79and may be released by
80.Xr free 3 .
81.Pp
82The
83.Fn regexec
84function
85matches a
86.Dv NUL Ns -terminated
87.Fa string
88against the compiled regular expression
89in
90.Fa prog .
91It returns 1 for success and 0 for failure, and adjusts the contents of
92.Fa prog Ns 's
93.Em startp
94and
95.Em endp
96(see below) accordingly.
97.Pp
98The members of a
99.Em regexp
100structure include at least the following (not necessarily in order):
101.Bd -literal -offset indent
102char *startp[NSUBEXP];
103char *endp[NSUBEXP];
104.Ed
105.Pp
106where
107.Dv NSUBEXP
108is defined (as 10) in the header file.
109Once a successful
110.Fn regexec
111has been done using the
112.Fn regexp ,
113each
114.Em startp Ns - Em endp
115pair describes one substring
116within the
117.Fa string ,
118with the
119.Em startp
120pointing to the first character of the substring and
121the
122.Em endp
123pointing to the first character following the substring.
124The 0th substring is the substring of
125.Fa string
126that matched the whole
127regular expression.
128The others are those substrings that matched parenthesized expressions
129within the regular expression, with parenthesized expressions numbered
130in left-to-right order of their opening parentheses.
131.Pp
132The
133.Fn regsub
134function
135copies
136.Fa source
137to
138.Fa dest ,
139making substitutions according to the
140most recent
141.Fn regexec
142performed using
143.Fa prog .
144Each instance of `&' in
145.Fa source
146is replaced by the substring
147indicated by
148.Em startp Ns Bq
149and
150.Em endp Ns Bq .
151Each instance of
152.Sq \e Ns Em n ,
153where
154.Em n
155is a digit, is replaced by
156the substring indicated by
157.Em startp Ns Bq Em n
158and
159.Em endp Ns Bq Em n .
160To get a literal `&' or
161.Sq \e Ns Em n
162into
163.Fa dest ,
164prefix it with `\e';
165to get a literal `\e' preceding `&' or
166.Sq \e Ns Em n ,
167prefix it with
168another `\e'.
169.Pp
170The
171.Fn regerror
172function
173is called whenever an error is detected in
174.Fn regcomp ,
175.Fn regexec ,
176or
177.Fn regsub .
178The default
179.Fn regerror
180writes the string
181.Fa msg ,
182with a suitable indicator of origin,
183on the standard
184error output
185and invokes
186.Xr exit 3 .
187The
188.Fn regerror
189function
190can be replaced by the user if other actions are desirable.
191.Sh REGULAR EXPRESSION SYNTAX
192A regular expression is zero or more
193.Em branches ,
194separated by `|'.
195It matches anything that matches one of the branches.
196.Pp
197A branch is zero or more
198.Em pieces ,
199concatenated.
200It matches a match for the first, followed by a match for the second, etc.
201.Pp
202A piece is an
203.Em atom
204possibly followed by `*', `+', or `?'.
205An atom followed by `*' matches a sequence of 0 or more matches of the atom.
206An atom followed by `+' matches a sequence of 1 or more matches of the atom.
207An atom followed by `?' matches a match of the atom, or the null string.
208.Pp
209An atom is a regular expression in parentheses (matching a match for the
210regular expression), a
211.Em range
212(see below), `.'
213(matching any single character), `^' (matching the null string at the
214beginning of the input string), `$' (matching the null string at the
215end of the input string), a `\e' followed by a single character (matching
216that character), or a single character with no other significance
217(matching that character).
218.Pp
219A
220.Em range
221is a sequence of characters enclosed in `[]'.
222It normally matches any single character from the sequence.
223If the sequence begins with `^',
224it matches any single character
225.Em not
226from the rest of the sequence.
227If two characters in the sequence are separated by `\-', this is shorthand
228for the full list of
229.Tn ASCII
230characters between them
231(e.g. `[0-9]' matches any decimal digit).
232To include a literal `]' in the sequence, make it the first character
233(following a possible `^').
234To include a literal `\-', make it the first or last character.
235.Sh AMBIGUITY
236If a regular expression could match two different parts of the input string,
237it will match the one which begins earliest.
238If both begin in the same place but match different lengths, or match
239the same length in different ways, life gets messier, as follows.
240.Pp
241In general, the possibilities in a list of branches are considered in
242left-to-right order, the possibilities for `*', `+', and `?' are
243considered longest-first, nested constructs are considered from the
244outermost in, and concatenated constructs are considered leftmost-first.
245The match that will be chosen is the one that uses the earliest
246possibility in the first choice that has to be made.
247If there is more than one choice, the next will be made in the same manner
248(earliest possibility) subject to the decision on the first choice.
249And so forth.
250.Pp
251For example,
252.Sq Li (ab|a)b*c
253could match
254`abc' in one of two ways.
255The first choice is between `ab' and `a'; since `ab' is earlier, and does
256lead to a successful overall match, it is chosen.
257Since the `b' is already spoken for,
258the `b*' must match its last possibility\(emthe empty string\(emsince
259it must respect the earlier choice.
260.Pp
261In the particular case where no `|'s are present and there is only one
262`*', `+', or `?', the net effect is that the longest possible
263match will be chosen.
264So
265.Sq Li ab* ,
266presented with `xabbbby', will match `abbbb'.
267Note that if
268.Sq Li ab* ,
269is tried against `xabyabbbz', it
270will match `ab' just after `x', due to the begins-earliest rule.
271(In effect, the decision on where to start the match is the first choice
272to be made, hence subsequent choices must respect it even if this leads them
273to less-preferred alternatives.)
274.Sh RETURN VALUES
275The
276.Fn regcomp
277function
278returns
279.Dv NULL
280for a failure
281.Pf ( Fn regerror
282permitting),
283where failures are syntax errors, exceeding implementation limits,
284or applying `+' or `*' to a possibly-null operand.
285.Sh SEE ALSO
286.Xr ed 1 ,
287.Xr ex 1 ,
288.Xr expr 1 ,
289.Xr egrep 1 ,
290.Xr fgrep 1 ,
291.Xr grep 1 ,
292.Xr regex 3
293.Sh HISTORY
294Both code and manual page for
295.Fn regcomp ,
296.Fn regexec ,
297.Fn regsub ,
298and
299.Fn regerror
300were written at the University of Toronto
301and appeared in
302.Bx 4.3 tahoe .
303They are intended to be compatible with the Bell V8
304.Xr regexp 3 ,
305but are not derived from Bell code.
306.Sh BUGS
307Empty branches and empty regular expressions are not portable to V8.
308.Pp
309The restriction against
310applying `*' or `+' to a possibly-null operand is an artifact of the
311simplistic implementation.
312.Pp
313Does not support
314.Xr egrep 1 Ns 's
315newline-separated branches;
316neither does the V8
317.Xr regexp 3 ,
318though.
319.Pp
320Due to emphasis on
321compactness and simplicity,
322it's not strikingly fast.
323It does give special attention to handling simple cases quickly.
324