xref: /netbsd-src/lib/libc/regex/regex.3 (revision ce83c69e00a77ed28196913aa6cd4d90b18a5a40)
1.\"	$NetBSD: regex.3,v 1.9 1998/08/29 08:32:34 lukem Exp $
2.\"
3.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
4.\" Copyright (c) 1992, 1993, 1994
5.\"	The Regents of the University of California.  All rights reserved.
6.\"
7.\" This code is derived from software contributed to Berkeley by
8.\" Henry Spencer.
9.\"
10.\" Redistribution and use in source and binary forms, with or without
11.\" modification, are permitted provided that the following conditions
12.\" are met:
13.\" 1. Redistributions of source code must retain the above copyright
14.\"    notice, this list of conditions and the following disclaimer.
15.\" 2. Redistributions in binary form must reproduce the above copyright
16.\"    notice, this list of conditions and the following disclaimer in the
17.\"    documentation and/or other materials provided with the distribution.
18.\" 3. All advertising materials mentioning features or use of this software
19.\"    must display the following acknowledgement:
20.\"	This product includes software developed by the University of
21.\"	California, Berkeley and its contributors.
22.\" 4. Neither the name of the University nor the names of its contributors
23.\"    may be used to endorse or promote products derived from this software
24.\"    without specific prior written permission.
25.\"
26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
29.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
36.\" SUCH DAMAGE.
37.\"
38.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
39.\"
40.Dd March 20, 1994
41.Dt REGEX 3
42.Os
43.Sh NAME
44.Nm regex ,
45.Nm regcomp ,
46.Nm regexec ,
47.Nm regerror ,
48.Nm regfree
49.Nd regular-expression library
50.Sh LIBRARY
51.Lb libc
52.Sh SYNOPSIS
53.Fd #include <sys/types.h>
54.Fd #include <regex.h>
55.Ft int
56.Fn regcomp "regex_t *preg" "const char *pattern" "int cflags"
57.Ft int
58.Fn regexec "const regex_t *preg" "const char *string" "size_t nmatch" "regmatch_t pmatch[]" "int eflags"
59.Ft size_t
60.Fn regerror "int errcode" "const regex_t *preg" "char *errbuf" "size_t errbuf_size"
61.Ft void
62.Fn regfree "regex_t *preg"
63.Sh DESCRIPTION
64These routines implement
65.St -p1003.2-92
66regular expressions (``RE''s);
67see
68.Xr re_format 7 .
69.Fn regcomp
70compiles an RE written as a string into an internal form,
71.Fn regexec
72matches that internal form against a string and reports results,
73.Fn regerror
74transforms error codes from either into human-readable messages,
75and
76.Fn regfree
77frees any dynamically-allocated storage used by the internal form
78of an RE.
79.Pp
80The header
81.Em <regex.h>
82declares two structure types,
83.Fa regex_t
84and
85.Fa regmatch_t ,
86the former for compiled internal forms and the latter for match reporting.
87It also declares the four functions,
88a type
89.Fa regoff_t ,
90and a number of constants with names starting with ``REG_''.
91.Pp
92.Fn regcomp
93compiles the regular expression contained in the
94.Fa pattern
95string,
96subject to the flags in
97.Fa cflags ,
98and places the results in the
99.Fa regex_t
100structure pointed to by
101.Fa preg .
102.Fa cflags
103is the bitwise OR of zero or more of the following flags:
104.Bl -tag -width XXXREG_EXTENDED
105.It Dv REG_EXTENDED
106Compile modern (``extended'') REs, rather than the obsolete
107(``basic'') REs that are the default.
108.It Dv REG_BASIC
109This is a synonym for 0,
110provided as a counterpart to REG_EXTENDED to improve readability.
111.It Dv REG_NOSPEC
112Compile with recognition of all special characters turned off.  All
113characters are thus considered ordinary, so the ``RE'' is a literal
114string.
115This is an extension, compatible with but not specified by
116.St -p1003.2-92 ,
117and should be used with caution in software intended to be portable to
118other systems.
119.Dv REG_EXTENDED
120and
121.Dv REG_NOSPEC
122may not be used in the same call to
123.Fn regcomp .
124.It Dv REG_ICASE
125Compile for matching that ignores upper/lower case distinctions. See
126.Xr re_format 7 .
127.It Dv REG_NOSUB
128Compile for matching that need only report success or failure, not
129what was matched.
130.It Dv REG_NEWLINE
131Compile for newline-sensitive matching.
132By default, newline is a completely ordinary character with no special
133meaning in either REs or strings.
134With this flag,
135`[^' bracket expressions and `.' never match newline,
136a `^' anchor matches the null string after any newline in the string
137in addition to its normal function,
138and the `$' anchor matches the null string before any newline in the
139string in addition to its normal function.
140.It Dv REG_PEND
141The regular expression ends, not at the first NUL, but just before the
142character pointed to by the
143.Fa re_endp
144member of the structure pointed to by
145.Fa preg .
146The
147.Fa re_endp
148member is of type
149.Fa "const\ char\ *" .
150This flag permits inclusion of NULs in the RE; they are considered
151ordinary characters.
152This is an extension, compatible with but not specified by
153.St -p1003.2-92 ,
154and should be used with caution in software intended to be portable to
155other systems.
156.El
157.Pp
158When successful,
159.Fn regcomp
160returns 0 and fills in the structure pointed to by
161.Fa preg .
162One member of that structure (other than
163.Fa re_endp )
164is publicized:
165.Fa re_nsub ,
166of type
167.Fa size_t ,
168contains the number of parenthesized subexpressions within the RE
169(except that the value of this member is undefined if the
170.Dv REG_NOSUB
171flag was used).
172If
173.Fn regcomp
174fails, it returns a non-zero error code;
175see DIAGNOSTICS.
176.Pp
177.Fn regexec
178matches the compiled RE pointed to by
179.Fa preg
180against the
181.Fa string ,
182subject to the flags in
183.Fa eflags ,
184and reports results using
185.Fa nmatch ,
186.Fa pmatch ,
187and the returned value.
188The RE must have been compiled by a previous invocation of
189.Fn regcomp .
190The compiled form is not altered during execution of
191.Fn regexec ,
192so a single compiled RE can be used simultaneously by multiple threads.
193.Pp
194By default,
195the NUL-terminated string pointed to by
196.Fa string
197is considered to be the text of an entire line, minus any terminating
198newline.
199The
200.Fa eflags
201argument is the bitwise OR of zero or more of the following flags:
202.Bl -tag -width XXXREG_NOTBOL
203.It Dv REG_NOTBOL
204The first character of the string
205is not the beginning of a line, so the `^' anchor should not match before it.
206This does not affect the behavior of newlines under
207.Dv REG_NEWLINE .
208.It Dv REG_NOTEOL
209The NUL terminating the string does not end a line, so the `$' anchor
210should not match before it.  This does not affect the behavior of
211newlines under
212.Dv REG_NEWLINE .
213.It Dv REG_STARTEND
214The string is considered to start at
215.Fa string
216+
217.Fa pmatch[0].rm_so
218and to have a terminating NUL located at
219.Fa string
220+
221.Fa pmatch[0].rm_eo
222(there need not actually be a NUL at that location),
223regardless of the value of
224.Fa nmatch .
225See below for the definition of
226.Fa pmatch
227and
228.Fa nmatch .
229This is an extension, compatible with but not specified by
230.St -p1003.2-92 ,
231and should be used with caution in software intended to be portable to
232other systems.
233Note that a non-zero
234.Fa rm_so
235does not imply
236.Dv REG_NOTBOL ;
237.Dv REG_STARTEND
238affects only the location of the string, not how it is matched.
239.El
240.Pp
241See
242.Xr re_format 7
243for a discussion of what is matched in situations where an RE or a
244portion thereof could match any of several substrings of
245.Fa string .
246.Pp
247Normally,
248.Fn regexec
249returns 0 for success and the non-zero code
250.Dv REG_NOMATCH
251for failure.
252Other non-zero error codes may be returned in exceptional situations;
253see DIAGNOSTICS.
254.Pp
255If
256.Dv REG_NOSUB
257was specified in the compilation of the RE, or if
258.Fa nmatch
259is 0,
260.Fn regexec
261ignores the
262.Fa pmatch
263argument (but see below for the case where
264.Dv REG_STARTEND
265is specified).
266Otherwise,
267.Fa pmatch
268points to an array of
269.Fa nmatch
270structures of type
271.Fa regmatch_t .
272Such a structure has at least the members
273.Fa rm_so
274and
275.Fa rm_eo ,
276both of type
277.Fa regoff_t
278(a signed arithmetic type at least as large as an
279.Fa off_t
280and a
281.Fa ssize_t ) ,
282containing respectively the offset of the first character of a substring
283and the offset of the first character after the end of the substring.
284Offsets are measured from the beginning of the
285.Fa string
286argument given to
287.Fn regexec .
288An empty substring is denoted by equal offsets,
289both indicating the character following the empty substring.
290.Pp
291The 0th member of the
292.Fa pmatch
293array is filled in to indicate what substring of
294.Fa string
295was matched by the entire RE.
296Remaining members report what substring was matched by parenthesized
297subexpressions within the RE;
298member
299.Fa i
300reports subexpression
301.Fa i ,
302with subexpressions counted (starting at 1) by the order of their
303opening parentheses in the RE, left to right.
304Unused entries in the array\(emcorresponding either to subexpressions that
305did not participate in the match at all, or to subexpressions that do not
306exist in the RE (that is,
307.Fa i
308>
309.Fa preg->re_nsub )
310\(emhave both
311.Fa rm_so
312and
313.Fa rm_eo
314set to -1.
315If a subexpression participated in the match several times,
316the reported substring is the last one it matched.
317(Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
318the parenthesized subexpression matches each of the three `b's and then
319an infinite number of empty strings following the last `b',
320so the reported substring is one of the empties.)
321.Pp
322If
323.Dv REG_STARTEND
324is specified,
325.Fa pmatch
326must point to at least one
327.Fa regmatch_t
328(even if
329.Fa nmatch
330is 0 or
331.Dv REG_NOSUB
332was specified),
333to hold the input offsets for
334.Dv REG_STARTEND .
335Use for output is still entirely controlled by
336.Fa nmatch ;
337if
338.Fa nmatch
339is 0 or
340.Dv REG_NOSUB
341was specified,
342the value of
343.Fa pmatch [0]
344will not be changed by a successful
345.Fn regexec .
346.Pp
347.Fn regerror
348maps a non-zero
349.Fa errcode
350from either
351.Fn regcomp
352or
353.Fn regexec
354to a human-readable, printable message.
355If
356.Fa preg
357is non-NULL,
358the error code should have arisen from use of the
359.Fa regex_t
360pointed to by
361.Fa preg ,
362and if the error code came from
363.Fn regcomp ,
364it should have been the result from the most recent
365.Fn regcomp
366using that
367.Fa regex_t . (
368.Fn regerror
369may be able to supply a more detailed message using information
370from the
371.Fa regex_t . )
372.Fn regerror
373places the NUL-terminated message into the buffer pointed to by
374.Fa errbuf ,
375limiting the length (including the NUL) to at most
376.Fa errbuf_size
377bytes.
378If the whole message won't fit,
379as much of it as will fit before the terminating NUL is supplied.
380In any case,
381the returned value is the size of buffer needed to hold the whole
382message (including terminating NUL).
383If
384.Fa errbuf_size
385is 0,
386.Fa errbuf
387is ignored but the return value is still correct.
388.Pp
389If the
390.Fa errcode
391given to
392.Fn regerror
393is first ORed with
394.Dv REG_ITOA ,
395the ``message'' that results is the printable name of the error code,
396e.g. ``REG_NOMATCH'',
397rather than an explanation thereof.
398If
399.Fa errcode
400.Dv REG_ATOI ,
401then
402.Fa preg
403shall be non-NULL and the
404.Fa re_endp
405member of the structure it points to
406must point to the printable name of an error code;
407in this case, the result in
408.Fa errbuf
409is the decimal digits of
410the numeric value of the error code
411(0 if the name is not recognized).
412.Dv REG_ITOA
413and
414.Dv REG_ATOI
415are intended primarily as debugging facilities;
416they are extensions, compatible with but not specified by
417.St -p1003.2-92 ,
418and should be used with caution in software intended to be portable to
419other systems.
420Be warned also that they are considered experimental and changes are possible.
421.Pp
422.Fn regfree
423frees any dynamically-allocated storage associated with the compiled RE
424pointed to by
425.Fa preg .
426The remaining
427.Fa regex_t
428is no longer a valid compiled RE
429and the effect of supplying it to
430.Fn regexec
431or
432.Fn regerror
433is undefined.
434.Pp
435None of these functions references global variables except for tables
436of constants;
437all are safe for use from multiple threads if the arguments are safe.
438.Sh IMPLEMENTATION CHOICES
439There are a number of decisions that
440.St -p1003.2-92
441leaves up to the implementor,
442either by explicitly saying ``undefined'' or by virtue of them being
443forbidden by the RE grammar.
444This implementation treats them as follows.
445.Pp
446See
447.Xr re_format 7
448for a discussion of the definition of case-independent matching.
449.Pp
450There is no particular limit on the length of REs,
451except insofar as memory is limited.
452Memory usage is approximately linear in RE size, and largely insensitive
453to RE complexity, except for bounded repetitions.
454See BUGS for one short RE using them
455that will run almost any system out of memory.
456.Pp
457A backslashed character other than one specifically given a magic meaning
458by
459.St -p1003.2-92
460(such magic meanings occur only in obsolete [``basic''] REs)
461is taken as an ordinary character.
462.Pp
463Any unmatched [ is a
464.Dv REG_EBRACK
465error.
466.Pp
467Equivalence classes cannot begin or end bracket-expression ranges.
468The endpoint of one range cannot begin another.
469.Pp
470.Dv RE_DUP_MAX ,
471the limit on repetition counts in bounded repetitions, is 255.
472.Pp
473A repetition operator (?, *, +, or bounds) cannot follow another
474repetition operator.
475A repetition operator cannot begin an expression or subexpression
476or follow `^' or `|'.
477.Pp
478`|' cannot appear first or last in a (sub)expression or after another `|',
479i.e. an operand of `|' cannot be an empty subexpression.
480An empty parenthesized subexpression, `()', is legal and matches an
481empty (sub)string.
482An empty string is not a legal RE.
483.Pp
484A `{' followed by a digit is considered the beginning of bounds for a
485bounded repetition, which must then follow the syntax for bounds.
486A `{' \fInot\fR followed by a digit is considered an ordinary character.
487.Pp
488`^' and `$' beginning and ending subexpressions in obsolete (``basic'')
489REs are anchors, not ordinary characters.
490.Sh SEE ALSO
491.Xr grep 1 ,
492.Xr sed 1 ,
493.Xr re_format 7
494.Pp
495.St -p1003.2-92 ,
496sections 2.8 (Regular Expression Notation)
497and
498B.5 (C Binding for Regular Expression Matching).
499.Sh DIAGNOSTICS
500Non-zero error codes from
501.Fn regcomp
502and
503.Fn regexec
504include the following:
505.Pp
506.Bl -tag -width XXXREG_ECOLLATE -compact
507.It Dv REG_NOMATCH
508regexec() failed to match
509.It Dv REG_BADPAT
510invalid regular expression
511.It Dv REG_ECOLLATE
512invalid collating element
513.It Dv REG_ECTYPE
514invalid character class
515.It Dv REG_EESCAPE
516\e applied to unescapable character
517.It Dv REG_ESUBREG
518invalid backreference number
519.It Dv REG_EBRACK
520brackets [ ] not balanced
521.It Dv REG_EPAREN
522parentheses ( ) not balanced
523.It Dv REG_EBRACE
524braces { } not balanced
525.It Dv REG_BADBR
526invalid repetition count(s) in { }
527.It Dv REG_ERANGE
528invalid character range in [ ]
529.It Dv REG_ESPACE
530ran out of memory
531.It Dv REG_BADRPT
532?, *, or + operand invalid
533.It Dv REG_EMPTY
534empty (sub)expression
535.It Dv REG_ASSERT
536``can't happen''\(emyou found a bug
537.It Dv REG_INVARG
538invalid argument, e.g. negative-length string
539.El
540.Sh HISTORY
541Originally written by Henry Spencer.
542Altered for inclusion in the
543.Bx 4.4
544distribution.
545.Sh BUGS
546This is an alpha release with known defects.
547Please report problems.
548.Pp
549There is one known functionality bug.
550The implementation of internationalization is incomplete:
551the locale is always assumed to be the default one of
552.St -p1003.2-92 ,
553and only the collating elements etc. of that locale are available.
554.Pp
555The back-reference code is subtle and doubts linger about its correctness
556in complex cases.
557.Pp
558.Fn regexec
559performance is poor.
560This will improve with later releases.
561.Fa nmatch
562exceeding 0 is expensive;
563.Fa nmatch
564exceeding 1 is worse.
565.Fa regexec
566is largely insensitive to RE complexity
567.Em except
568that back references are massively expensive.
569RE length does matter; in particular, there is a strong speed bonus
570for keeping RE length under about 30 characters,
571with most special characters counting roughly double.
572.Pp
573.Fn regcomp
574implements bounded repetitions by macro expansion,
575which is costly in time and space if counts are large
576or bounded repetitions are nested.
577An RE like, say,
578`((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
579will (eventually) run almost any existing machine out of swap space.
580.Pp
581There are suspected problems with response to obscure error conditions.
582Notably,
583certain kinds of internal overflow,
584produced only by truly enormous REs or by multiply nested bounded repetitions,
585are probably not handled well.
586.Pp
587Due to a mistake in
588.St -p1003.2-92 ,
589things like `a)b' are legal REs because `)' is a special character
590only in the presence of a previous unmatched `('.  This can't be fixed
591until the spec is fixed.
592.Pp
593The standard's definition of back references is vague.
594For example, does
595`a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
596Until the standard is clarified, behavior in such cases should not be
597relied on.
598.Pp
599The implementation of word-boundary matching is a bit of a kludge,
600and bugs may lurk in combinations of word-boundary matching and anchoring.
601