xref: /netbsd-src/lib/libc/regex/regex.3 (revision 5e4c038a45edbc7d63b7c2daa76e29f88b64a4e3)
1.\"	$NetBSD: regex.3,v 1.13 2002/04/29 01:41:44 simonb Exp $
2.\"
3.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
4.\" Copyright (c) 1992, 1993, 1994
5.\"	The Regents of the University of California.  All rights reserved.
6.\"
7.\" This code is derived from software contributed to Berkeley by
8.\" Henry Spencer.
9.\"
10.\" Redistribution and use in source and binary forms, with or without
11.\" modification, are permitted provided that the following conditions
12.\" are met:
13.\" 1. Redistributions of source code must retain the above copyright
14.\"    notice, this list of conditions and the following disclaimer.
15.\" 2. Redistributions in binary form must reproduce the above copyright
16.\"    notice, this list of conditions and the following disclaimer in the
17.\"    documentation and/or other materials provided with the distribution.
18.\" 3. All advertising materials mentioning features or use of this software
19.\"    must display the following acknowledgement:
20.\"	This product includes software developed by the University of
21.\"	California, Berkeley and its contributors.
22.\" 4. Neither the name of the University nor the names of its contributors
23.\"    may be used to endorse or promote products derived from this software
24.\"    without specific prior written permission.
25.\"
26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
29.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
36.\" SUCH DAMAGE.
37.\"
38.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
39.\"
40.Dd March 20, 1994
41.Dt REGEX 3
42.Os
43.Sh NAME
44.Nm regex ,
45.Nm regcomp ,
46.Nm regexec ,
47.Nm regerror ,
48.Nm regfree
49.Nd regular-expression library
50.Sh LIBRARY
51.Lb libc
52.Sh SYNOPSIS
53.Fd #include \*[Lt]regex.h\*[Gt]
54.Ft int
55.Fn regcomp "regex_t *preg" "const char *pattern" "int cflags"
56.Ft int
57.Fn regexec "const regex_t *preg" "const char *string" "size_t nmatch" "regmatch_t pmatch[]" "int eflags"
58.Ft size_t
59.Fn regerror "int errcode" "const regex_t *preg" "char *errbuf" "size_t errbuf_size"
60.Ft void
61.Fn regfree "regex_t *preg"
62.Sh DESCRIPTION
63These routines implement
64.St -p1003.2-92
65regular expressions (``RE''s);
66see
67.Xr re_format 7 .
68.Fn regcomp
69compiles an RE written as a string into an internal form,
70.Fn regexec
71matches that internal form against a string and reports results,
72.Fn regerror
73transforms error codes from either into human-readable messages,
74and
75.Fn regfree
76frees any dynamically-allocated storage used by the internal form
77of an RE.
78.Pp
79The header
80.Em \*[Lt]regex.h\*[Gt]
81declares two structure types,
82.Fa regex_t
83and
84.Fa regmatch_t ,
85the former for compiled internal forms and the latter for match reporting.
86It also declares the four functions,
87a type
88.Fa regoff_t ,
89and a number of constants with names starting with ``REG_''.
90.Pp
91.Fn regcomp
92compiles the regular expression contained in the
93.Fa pattern
94string,
95subject to the flags in
96.Fa cflags ,
97and places the results in the
98.Fa regex_t
99structure pointed to by
100.Fa preg .
101.Fa cflags
102is the bitwise OR of zero or more of the following flags:
103.Bl -tag -width XXXREG_EXTENDED
104.It Dv REG_EXTENDED
105Compile modern (``extended'') REs, rather than the obsolete
106(``basic'') REs that are the default.
107.It Dv REG_BASIC
108This is a synonym for 0,
109provided as a counterpart to REG_EXTENDED to improve readability.
110.It Dv REG_NOSPEC
111Compile with recognition of all special characters turned off.  All
112characters are thus considered ordinary, so the ``RE'' is a literal
113string.
114This is an extension, compatible with but not specified by
115.St -p1003.2-92 ,
116and should be used with caution in software intended to be portable to
117other systems.
118.Dv REG_EXTENDED
119and
120.Dv REG_NOSPEC
121may not be used in the same call to
122.Fn regcomp .
123.It Dv REG_ICASE
124Compile for matching that ignores upper/lower case distinctions. See
125.Xr re_format 7 .
126.It Dv REG_NOSUB
127Compile for matching that need only report success or failure, not
128what was matched.
129.It Dv REG_NEWLINE
130Compile for newline-sensitive matching.
131By default, newline is a completely ordinary character with no special
132meaning in either REs or strings.
133With this flag,
134`[^' bracket expressions and `.' never match newline,
135a `^' anchor matches the null string after any newline in the string
136in addition to its normal function,
137and the `$' anchor matches the null string before any newline in the
138string in addition to its normal function.
139.It Dv REG_PEND
140The regular expression ends, not at the first NUL, but just before the
141character pointed to by the
142.Fa re_endp
143member of the structure pointed to by
144.Fa preg .
145The
146.Fa re_endp
147member is of type
148.Fa "const\ char\ *" .
149This flag permits inclusion of NULs in the RE; they are considered
150ordinary characters.
151This is an extension, compatible with but not specified by
152.St -p1003.2-92 ,
153and should be used with caution in software intended to be portable to
154other systems.
155.El
156.Pp
157When successful,
158.Fn regcomp
159returns 0 and fills in the structure pointed to by
160.Fa preg .
161One member of that structure (other than
162.Fa re_endp )
163is publicized:
164.Fa re_nsub ,
165of type
166.Fa size_t ,
167contains the number of parenthesized subexpressions within the RE
168(except that the value of this member is undefined if the
169.Dv REG_NOSUB
170flag was used).
171If
172.Fn regcomp
173fails, it returns a non-zero error code;
174see
175.Sx DIAGNOSTICS .
176.Pp
177.Fn regexec
178matches the compiled RE pointed to by
179.Fa preg
180against the
181.Fa string ,
182subject to the flags in
183.Fa eflags ,
184and reports results using
185.Fa nmatch ,
186.Fa pmatch ,
187and the returned value.
188The RE must have been compiled by a previous invocation of
189.Fn regcomp .
190The compiled form is not altered during execution of
191.Fn regexec ,
192so a single compiled RE can be used simultaneously by multiple threads.
193.Pp
194By default,
195the NUL-terminated string pointed to by
196.Fa string
197is considered to be the text of an entire line, minus any terminating
198newline.
199The
200.Fa eflags
201argument is the bitwise OR of zero or more of the following flags:
202.Bl -tag -width XXXREG_NOTBOL
203.It Dv REG_NOTBOL
204The first character of the string
205is not the beginning of a line, so the `^' anchor should not match before it.
206This does not affect the behavior of newlines under
207.Dv REG_NEWLINE .
208.It Dv REG_NOTEOL
209The NUL terminating the string does not end a line, so the `$' anchor
210should not match before it.  This does not affect the behavior of
211newlines under
212.Dv REG_NEWLINE .
213.It Dv REG_STARTEND
214The string is considered to start at
215.Fa string
216+
217.Fa pmatch[0].rm_so
218and to have a terminating NUL located at
219.Fa string
220+
221.Fa pmatch[0].rm_eo
222(there need not actually be a NUL at that location),
223regardless of the value of
224.Fa nmatch .
225See below for the definition of
226.Fa pmatch
227and
228.Fa nmatch .
229This is an extension, compatible with but not specified by
230.St -p1003.2-92 ,
231and should be used with caution in software intended to be portable to
232other systems.
233Note that a non-zero
234.Fa rm_so
235does not imply
236.Dv REG_NOTBOL ;
237.Dv REG_STARTEND
238affects only the location of the string, not how it is matched.
239.El
240.Pp
241See
242.Xr re_format 7
243for a discussion of what is matched in situations where an RE or a
244portion thereof could match any of several substrings of
245.Fa string .
246.Pp
247Normally,
248.Fn regexec
249returns 0 for success and the non-zero code
250.Dv REG_NOMATCH
251for failure.
252Other non-zero error codes may be returned in exceptional situations;
253see
254.Sx DIAGNOSTICS .
255.Pp
256If
257.Dv REG_NOSUB
258was specified in the compilation of the RE, or if
259.Fa nmatch
260is 0,
261.Fn regexec
262ignores the
263.Fa pmatch
264argument (but see below for the case where
265.Dv REG_STARTEND
266is specified).
267Otherwise,
268.Fa pmatch
269points to an array of
270.Fa nmatch
271structures of type
272.Fa regmatch_t .
273Such a structure has at least the members
274.Fa rm_so
275and
276.Fa rm_eo ,
277both of type
278.Fa regoff_t
279(a signed arithmetic type at least as large as an
280.Fa off_t
281and a
282.Fa ssize_t ) ,
283containing respectively the offset of the first character of a substring
284and the offset of the first character after the end of the substring.
285Offsets are measured from the beginning of the
286.Fa string
287argument given to
288.Fn regexec .
289An empty substring is denoted by equal offsets,
290both indicating the character following the empty substring.
291.Pp
292The 0th member of the
293.Fa pmatch
294array is filled in to indicate what substring of
295.Fa string
296was matched by the entire RE.
297Remaining members report what substring was matched by parenthesized
298subexpressions within the RE;
299member
300.Fa i
301reports subexpression
302.Fa i ,
303with subexpressions counted (starting at 1) by the order of their
304opening parentheses in the RE, left to right.
305Unused entries in the array\(emcorresponding either to subexpressions that
306did not participate in the match at all, or to subexpressions that do not
307exist in the RE (that is,
308.Fa i
309\*[Gt]
310.Fa preg-\*[Gt]re_nsub )
311\(emhave both
312.Fa rm_so
313and
314.Fa rm_eo
315set to -1.
316If a subexpression participated in the match several times,
317the reported substring is the last one it matched.
318(Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
319the parenthesized subexpression matches each of the three `b's and then
320an infinite number of empty strings following the last `b',
321so the reported substring is one of the empties.)
322.Pp
323If
324.Dv REG_STARTEND
325is specified,
326.Fa pmatch
327must point to at least one
328.Fa regmatch_t
329(even if
330.Fa nmatch
331is 0 or
332.Dv REG_NOSUB
333was specified),
334to hold the input offsets for
335.Dv REG_STARTEND .
336Use for output is still entirely controlled by
337.Fa nmatch ;
338if
339.Fa nmatch
340is 0 or
341.Dv REG_NOSUB
342was specified,
343the value of
344.Fa pmatch [0]
345will not be changed by a successful
346.Fn regexec .
347.Pp
348.Fn regerror
349maps a non-zero
350.Fa errcode
351from either
352.Fn regcomp
353or
354.Fn regexec
355to a human-readable, printable message.
356If
357.Fa preg
358is non-NULL,
359the error code should have arisen from use of the
360.Fa regex_t
361pointed to by
362.Fa preg ,
363and if the error code came from
364.Fn regcomp ,
365it should have been the result from the most recent
366.Fn regcomp
367using that
368.Fa regex_t . (
369.Fn regerror
370may be able to supply a more detailed message using information
371from the
372.Fa regex_t . )
373.Fn regerror
374places the NUL-terminated message into the buffer pointed to by
375.Fa errbuf ,
376limiting the length (including the NUL) to at most
377.Fa errbuf_size
378bytes.
379If the whole message won't fit,
380as much of it as will fit before the terminating NUL is supplied.
381In any case,
382the returned value is the size of buffer needed to hold the whole
383message (including terminating NUL).
384If
385.Fa errbuf_size
386is 0,
387.Fa errbuf
388is ignored but the return value is still correct.
389.Pp
390If the
391.Fa errcode
392given to
393.Fn regerror
394is first ORed with
395.Dv REG_ITOA ,
396the ``message'' that results is the printable name of the error code,
397e.g. ``REG_NOMATCH'',
398rather than an explanation thereof.
399If
400.Fa errcode
401is
402.Dv REG_ATOI ,
403then
404.Fa preg
405shall be non-NULL and the
406.Fa re_endp
407member of the structure it points to
408must point to the printable name of an error code;
409in this case, the result in
410.Fa errbuf
411is the decimal digits of
412the numeric value of the error code
413(0 if the name is not recognized).
414.Dv REG_ITOA
415and
416.Dv REG_ATOI
417are intended primarily as debugging facilities;
418they are extensions, compatible with but not specified by
419.St -p1003.2-92 ,
420and should be used with caution in software intended to be portable to
421other systems.
422Be warned also that they are considered experimental and changes are possible.
423.Pp
424.Fn regfree
425frees any dynamically-allocated storage associated with the compiled RE
426pointed to by
427.Fa preg .
428The remaining
429.Fa regex_t
430is no longer a valid compiled RE
431and the effect of supplying it to
432.Fn regexec
433or
434.Fn regerror
435is undefined.
436.Pp
437None of these functions references global variables except for tables
438of constants;
439all are safe for use from multiple threads if the arguments are safe.
440.Sh IMPLEMENTATION CHOICES
441There are a number of decisions that
442.St -p1003.2-92
443leaves up to the implementor,
444either by explicitly saying ``undefined'' or by virtue of them being
445forbidden by the RE grammar.
446This implementation treats them as follows.
447.Pp
448See
449.Xr re_format 7
450for a discussion of the definition of case-independent matching.
451.Pp
452There is no particular limit on the length of REs,
453except insofar as memory is limited.
454Memory usage is approximately linear in RE size, and largely insensitive
455to RE complexity, except for bounded repetitions.
456See BUGS for one short RE using them
457that will run almost any system out of memory.
458.Pp
459A backslashed character other than one specifically given a magic meaning
460by
461.St -p1003.2-92
462(such magic meanings occur only in obsolete [``basic''] REs)
463is taken as an ordinary character.
464.Pp
465Any unmatched [ is a
466.Dv REG_EBRACK
467error.
468.Pp
469Equivalence classes cannot begin or end bracket-expression ranges.
470The endpoint of one range cannot begin another.
471.Pp
472.Dv RE_DUP_MAX ,
473the limit on repetition counts in bounded repetitions, is 255.
474.Pp
475A repetition operator (?, *, +, or bounds) cannot follow another
476repetition operator.
477A repetition operator cannot begin an expression or subexpression
478or follow `^' or `|'.
479.Pp
480`|' cannot appear first or last in a (sub)expression or after another `|',
481i.e. an operand of `|' cannot be an empty subexpression.
482An empty parenthesized subexpression, `()', is legal and matches an
483empty (sub)string.
484An empty string is not a legal RE.
485.Pp
486A `{' followed by a digit is considered the beginning of bounds for a
487bounded repetition, which must then follow the syntax for bounds.
488A `{' \fInot\fR followed by a digit is considered an ordinary character.
489.Pp
490`^' and `$' beginning and ending subexpressions in obsolete (``basic'')
491REs are anchors, not ordinary characters.
492.Sh DIAGNOSTICS
493Non-zero error codes from
494.Fn regcomp
495and
496.Fn regexec
497include the following:
498.Pp
499.Bl -tag -width XXXREG_ECOLLATE -compact
500.It Dv REG_NOMATCH
501regexec() failed to match
502.It Dv REG_BADPAT
503invalid regular expression
504.It Dv REG_ECOLLATE
505invalid collating element
506.It Dv REG_ECTYPE
507invalid character class
508.It Dv REG_EESCAPE
509\e applied to unescapable character
510.It Dv REG_ESUBREG
511invalid backreference number
512.It Dv REG_EBRACK
513brackets [ ] not balanced
514.It Dv REG_EPAREN
515parentheses ( ) not balanced
516.It Dv REG_EBRACE
517braces { } not balanced
518.It Dv REG_BADBR
519invalid repetition count(s) in { }
520.It Dv REG_ERANGE
521invalid character range in [ ]
522.It Dv REG_ESPACE
523ran out of memory
524.It Dv REG_BADRPT
525?, *, or + operand invalid
526.It Dv REG_EMPTY
527empty (sub)expression
528.It Dv REG_ASSERT
529``can't happen''\(emyou found a bug
530.It Dv REG_INVARG
531invalid argument, e.g. negative-length string
532.El
533.Sh SEE ALSO
534.Xr grep 1 ,
535.Xr sed 1 ,
536.Xr re_format 7
537.Pp
538.St -p1003.2-92 ,
539sections 2.8 (Regular Expression Notation)
540and
541B.5 (C Binding for Regular Expression Matching).
542.Sh HISTORY
543Originally written by Henry Spencer.
544Altered for inclusion in the
545.Bx 4.4
546distribution.
547.Sh BUGS
548This is an alpha release with known defects.
549Please report problems.
550.Pp
551There is one known functionality bug.
552The implementation of internationalization is incomplete:
553the locale is always assumed to be the default one of
554.St -p1003.2-92 ,
555and only the collating elements etc. of that locale are available.
556.Pp
557The back-reference code is subtle and doubts linger about its correctness
558in complex cases.
559.Pp
560.Fn regexec
561performance is poor.
562This will improve with later releases.
563.Fa nmatch
564exceeding 0 is expensive;
565.Fa nmatch
566exceeding 1 is worse.
567.Fa regexec
568is largely insensitive to RE complexity
569.Em except
570that back references are massively expensive.
571RE length does matter; in particular, there is a strong speed bonus
572for keeping RE length under about 30 characters,
573with most special characters counting roughly double.
574.Pp
575.Fn regcomp
576implements bounded repetitions by macro expansion,
577which is costly in time and space if counts are large
578or bounded repetitions are nested.
579An RE like, say,
580`((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
581will (eventually) run almost any existing machine out of swap space.
582.Pp
583There are suspected problems with response to obscure error conditions.
584Notably,
585certain kinds of internal overflow,
586produced only by truly enormous REs or by multiply nested bounded repetitions,
587are probably not handled well.
588.Pp
589Due to a mistake in
590.St -p1003.2-92 ,
591things like `a)b' are legal REs because `)' is a special character
592only in the presence of a previous unmatched `('.  This can't be fixed
593until the spec is fixed.
594.Pp
595The standard's definition of back references is vague.
596For example, does
597`a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
598Until the standard is clarified, behavior in such cases should not be
599relied on.
600.Pp
601The implementation of word-boundary matching is a bit of a kludge,
602and bugs may lurk in combinations of word-boundary matching and anchoring.
603