xref: /netbsd-src/lib/libc/regex/regex.3 (revision dc1e66a8e42afc4d90a37464e0108641f5596be9)
1.\"	$NetBSD: regex.3,v 1.6 1998/02/05 07:28:03 perry Exp $
2.\"
3.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
4.\" Copyright (c) 1992, 1993, 1994
5.\"	The Regents of the University of California.  All rights reserved.
6.\"
7.\" This code is derived from software contributed to Berkeley by
8.\" Henry Spencer.
9.\"
10.\" Redistribution and use in source and binary forms, with or without
11.\" modification, are permitted provided that the following conditions
12.\" are met:
13.\" 1. Redistributions of source code must retain the above copyright
14.\"    notice, this list of conditions and the following disclaimer.
15.\" 2. Redistributions in binary form must reproduce the above copyright
16.\"    notice, this list of conditions and the following disclaimer in the
17.\"    documentation and/or other materials provided with the distribution.
18.\" 3. All advertising materials mentioning features or use of this software
19.\"    must display the following acknowledgement:
20.\"	This product includes software developed by the University of
21.\"	California, Berkeley and its contributors.
22.\" 4. Neither the name of the University nor the names of its contributors
23.\"    may be used to endorse or promote products derived from this software
24.\"    without specific prior written permission.
25.\"
26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
29.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
36.\" SUCH DAMAGE.
37.\"
38.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
39.\"
40.Dd March 20, 1994
41.Dt REGEX 3
42.Os
43.Sh NAME
44.Nm regex ,
45.Nm regcomp ,
46.Nm regexec ,
47.Nm regerror ,
48.Nm regfree
49.Nd regular-expression library
50.Sh SYNOPSIS
51.Fd #include <sys/types.h>
52.Fd #include <regex.h>
53.Ft int
54.Fn regcomp "regex_t *preg" "const char *pattern" "int cflags"
55.Ft int
56.Fn regexec "const regex_t *preg" "const char *string" "size_t nmatch" "regmatch_t pmatch[]" "int eflags"
57.Ft size_t
58.Fn regerror "int errcode" "const regex_t *preg" "char *errbuf" "size_t errbuf_size"
59.Ft void
60.Fn regfree "regex_t *preg"
61.Sh DESCRIPTION
62These routines implement
63.St -p1003.2-92
64regular expressions (``RE''s);
65see
66.Xr re_format 7 .
67.Fn regcomp
68compiles an RE written as a string into an internal form,
69.Fn regexec
70matches that internal form against a string and reports results,
71.Fn regerror
72transforms error codes from either into human-readable messages,
73and
74.Fn regfree
75frees any dynamically-allocated storage used by the internal form
76of an RE.
77.Pp
78The header
79.Em <regex.h>
80declares two structure types,
81.Fa regex_t
82and
83.Fa regmatch_t ,
84the former for compiled internal forms and the latter for match reporting.
85It also declares the four functions,
86a type
87.Fa regoff_t ,
88and a number of constants with names starting with ``REG_''.
89.Pp
90.Fn Regcomp
91compiles the regular expression contained in the
92.Fa pattern
93string,
94subject to the flags in
95.Fa cflags ,
96and places the results in the
97.Fa regex_t
98structure pointed to by
99.Fa preg .
100.Fa Cflags
101is the bitwise OR of zero or more of the following flags:
102.Bl -tag -width XXXREG_EXTENDED
103.It Dv REG_EXTENDED
104Compile modern (``extended'') REs, rather than the obsolete
105(``basic'') REs that are the default.
106.It Dv REG_BASIC
107This is a synonym for 0,
108provided as a counterpart to REG_EXTENDED to improve readability.
109.It Dv REG_NOSPEC
110Compile with recognition of all special characters turned off.  All
111characters are thus considered ordinary, so the ``RE'' is a literal
112string.
113This is an extension, compatible with but not specified by
114.St -p1003.2-92 ,
115and should be used with caution in software intended to be portable to
116other systems.
117.Dv REG_EXTENDED
118and
119.Dv REG_NOSPEC
120may not be used in the same call to
121.Fn regcomp .
122.It Dv REG_ICASE
123Compile for matching that ignores upper/lower case distinctions. See
124.Xr re_format 7 .
125.It Dv REG_NOSUB
126Compile for matching that need only report success or failure, not
127what was matched.
128.It Dv REG_NEWLINE
129Compile for newline-sensitive matching.
130By default, newline is a completely ordinary character with no special
131meaning in either REs or strings.
132With this flag,
133`[^' bracket expressions and `.' never match newline,
134a `^' anchor matches the null string after any newline in the string
135in addition to its normal function,
136and the `$' anchor matches the null string before any newline in the
137string in addition to its normal function.
138.It Dv REG_PEND
139The regular expression ends, not at the first NUL, but just before the
140character pointed to by the
141.Fa re_endp
142member of the structure pointed to by
143.Fa preg .
144The
145.Fa re_endp
146member is of type
147.Fa "const\ char\ *" .
148This flag permits inclusion of NULs in the RE; they are considered
149ordinary characters.
150This is an extension, compatible with but not specified by
151.St -p1003.2-92 ,
152and should be used with caution in software intended to be portable to
153other systems.
154.El
155.Pp
156When successful,
157.Fn regcomp
158returns 0 and fills in the structure pointed to by
159.Fa preg .
160One member of that structure (other than
161.Fa re_endp )
162is publicized:
163.Fa re_nsub ,
164of type
165.Fa size_t ,
166contains the number of parenthesized subexpressions within the RE
167(except that the value of this member is undefined if the
168.Dv REG_NOSUB
169flag was used).
170If
171.Fn regcomp
172fails, it returns a non-zero error code;
173see DIAGNOSTICS.
174.Pp
175.Fn Regexec
176matches the compiled RE pointed to by
177.Fa preg
178against the
179.Fa string ,
180subject to the flags in
181.Fa eflags ,
182and reports results using
183.Fa nmatch ,
184.Fa pmatch ,
185and the returned value.
186The RE must have been compiled by a previous invocation of
187.Fn regcomp .
188The compiled form is not altered during execution of
189.Fn regexec ,
190so a single compiled RE can be used simultaneously by multiple threads.
191.Pp
192By default,
193the NUL-terminated string pointed to by
194.Fa string
195is considered to be the text of an entire line, minus any terminating
196newline.
197The
198.Fa eflags
199argument is the bitwise OR of zero or more of the following flags:
200.Bl -tag -width XXXREG_NOTBOL
201.It Dv REG_NOTBOL
202The first character of the string
203is not the beginning of a line, so the `^' anchor should not match before it.
204This does not affect the behavior of newlines under
205.Dv REG_NEWLINE .
206.It Dv REG_NOTEOL
207The NUL terminating the string does not end a line, so the `$' anchor
208should not match before it.  This does not affect the behavior of
209newlines under
210.Dv REG_NEWLINE .
211.It Dv REG_STARTEND
212The string is considered to start at
213.Fa string
214+
215.Fa pmatch[0].rm_so
216and to have a terminating NUL located at
217.Fa string
218+
219.Fa pmatch[0].rm_eo
220(there need not actually be a NUL at that location),
221regardless of the value of
222.Fa nmatch .
223See below for the definition of
224.Fa pmatch
225and
226.Fa nmatch .
227This is an extension, compatible with but not specified by
228.St -p1003.2-92 ,
229and should be used with caution in software intended to be portable to
230other systems.
231Note that a non-zero
232.Fa rm_so
233does not imply
234.Dv REG_NOTBOL ;
235.Dv REG_STARTEND
236affects only the location of the string, not how it is matched.
237.El
238.Pp
239See
240.Xr re_format 7
241for a discussion of what is matched in situations where an RE or a
242portion thereof could match any of several substrings of
243.Fa string .
244.Pp
245Normally,
246.Fn regexec
247returns 0 for success and the non-zero code
248.Dv REG_NOMATCH
249for failure.
250Other non-zero error codes may be returned in exceptional situations;
251see DIAGNOSTICS.
252.Pp
253If
254.Dv REG_NOSUB
255was specified in the compilation of the RE, or if
256.Fa nmatch
257is 0,
258.Fn regexec
259ignores the
260.Fa pmatch
261argument (but see below for the case where
262.Dv REG_STARTEND
263is specified).
264Otherwise,
265.Fa pmatch
266points to an array of
267.Fa nmatch
268structures of type
269.Fa regmatch_t .
270Such a structure has at least the members
271.Fa rm_so
272and
273.Fa rm_eo ,
274both of type
275.Fa regoff_t
276(a signed arithmetic type at least as large as an
277.Fa off_t
278and a
279.Fa ssize_t ) ,
280containing respectively the offset of the first character of a substring
281and the offset of the first character after the end of the substring.
282Offsets are measured from the beginning of the
283.Fa string
284argument given to
285.Fn regexec .
286An empty substring is denoted by equal offsets,
287both indicating the character following the empty substring.
288.Pp
289The 0th member of the
290.Fa pmatch
291array is filled in to indicate what substring of
292.Fa string
293was matched by the entire RE.
294Remaining members report what substring was matched by parenthesized
295subexpressions within the RE;
296member
297.Fa i
298reports subexpression
299.Fa i ,
300with subexpressions counted (starting at 1) by the order of their
301opening parentheses in the RE, left to right.
302Unused entries in the array\(emcorresponding either to subexpressions that
303did not participate in the match at all, or to subexpressions that do not
304exist in the RE (that is,
305.Fa i
306>
307.Fa preg->re_nsub )
308\(emhave both
309.Fa rm_so
310and
311.Fa rm_eo
312set to -1.
313If a subexpression participated in the match several times,
314the reported substring is the last one it matched.
315(Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
316the parenthesized subexpression matches each of the three `b's and then
317an infinite number of empty strings following the last `b',
318so the reported substring is one of the empties.)
319.Pp
320If
321.Dv REG_STARTEND
322is specified,
323.Fa pmatch
324must point to at least one
325.Fa regmatch_t
326(even if
327.Fa nmatch
328is 0 or
329.Dv REG_NOSUB
330was specified),
331to hold the input offsets for
332.Dv REG_STARTEND .
333Use for output is still entirely controlled by
334.Fa nmatch ;
335if
336.Fa nmatch
337is 0 or
338.Dv REG_NOSUB
339was specified,
340the value of
341.Fa pmatch [0]
342will not be changed by a successful
343.Fn regexec .
344.Pp
345.Fn Regerror
346maps a non-zero
347.Fa errcode
348from either
349.Fn regcomp
350or
351.Fn regexec
352to a human-readable, printable message.
353If
354.Fa preg
355is non-NULL,
356the error code should have arisen from use of the
357.Fa regex_t
358pointed to by
359.Fa preg ,
360and if the error code came from
361.Fn regcomp ,
362it should have been the result from the most recent
363.Fn regcomp
364using that
365.Fa regex_t . (
366.Fn Regerror
367may be able to supply a more detailed message using information
368from the
369.Fa regex_t . )
370.Fn Regerror
371places the NUL-terminated message into the buffer pointed to by
372.Fa errbuf ,
373limiting the length (including the NUL) to at most
374.Fa errbuf_size
375bytes.
376If the whole message won't fit,
377as much of it as will fit before the terminating NUL is supplied.
378In any case,
379the returned value is the size of buffer needed to hold the whole
380message (including terminating NUL).
381If
382.Fa errbuf_size
383is 0,
384.Fa errbuf
385is ignored but the return value is still correct.
386.Pp
387If the
388.Fa errcode
389given to
390.Fn regerror
391is first ORed with
392.Dv REG_ITOA ,
393the ``message'' that results is the printable name of the error code,
394e.g. ``REG_NOMATCH'',
395rather than an explanation thereof.
396If
397.Fa errcode
398.Dv REG_ATOI ,
399then
400.Fa preg
401shall be non-NULL and the
402.Fa re_endp
403member of the structure it points to
404must point to the printable name of an error code;
405in this case, the result in
406.Fa errbuf
407is the decimal digits of
408the numeric value of the error code
409(0 if the name is not recognized).
410.Dv REG_ITOA
411and
412.Dv REG_ATOI
413are intended primarily as debugging facilities;
414they are extensions, compatible with but not specified by
415.St -p1003.2-92 ,
416and should be used with caution in software intended to be portable to
417other systems.
418Be warned also that they are considered experimental and changes are possible.
419.Pp
420.Fn Regfree
421frees any dynamically-allocated storage associated with the compiled RE
422pointed to by
423.Fa preg .
424The remaining
425.Fa regex_t
426is no longer a valid compiled RE
427and the effect of supplying it to
428.Fn regexec
429or
430.Fn regerror
431is undefined.
432.Pp
433None of these functions references global variables except for tables
434of constants;
435all are safe for use from multiple threads if the arguments are safe.
436.Sh IMPLEMENTATION CHOICES
437There are a number of decisions that
438.St -p1003.2-92
439leaves up to the implementor,
440either by explicitly saying ``undefined'' or by virtue of them being
441forbidden by the RE grammar.
442This implementation treats them as follows.
443.Pp
444See
445.Xr re_format 7
446for a discussion of the definition of case-independent matching.
447.Pp
448There is no particular limit on the length of REs,
449except insofar as memory is limited.
450Memory usage is approximately linear in RE size, and largely insensitive
451to RE complexity, except for bounded repetitions.
452See BUGS for one short RE using them
453that will run almost any system out of memory.
454.Pp
455A backslashed character other than one specifically given a magic meaning
456by
457.St -p1003.2-92
458(such magic meanings occur only in obsolete [``basic''] REs)
459is taken as an ordinary character.
460.Pp
461Any unmatched [ is a
462.Dv REG_EBRACK
463error.
464.Pp
465Equivalence classes cannot begin or end bracket-expression ranges.
466The endpoint of one range cannot begin another.
467.Pp
468.Dv RE_DUP_MAX ,
469the limit on repetition counts in bounded repetitions, is 255.
470.Pp
471A repetition operator (?, *, +, or bounds) cannot follow another
472repetition operator.
473A repetition operator cannot begin an expression or subexpression
474or follow `^' or `|'.
475.Pp
476`|' cannot appear first or last in a (sub)expression or after another `|',
477i.e. an operand of `|' cannot be an empty subexpression.
478An empty parenthesized subexpression, `()', is legal and matches an
479empty (sub)string.
480An empty string is not a legal RE.
481.Pp
482A `{' followed by a digit is considered the beginning of bounds for a
483bounded repetition, which must then follow the syntax for bounds.
484A `{' \fInot\fR followed by a digit is considered an ordinary character.
485.Pp
486`^' and `$' beginning and ending subexpressions in obsolete (``basic'')
487REs are anchors, not ordinary characters.
488.Sh SEE ALSO
489.Xr grep 1 ,
490.Xr sed 1 ,
491.Xr re_format 7
492.Pp
493.St -p1003.2-92 ,
494sections 2.8 (Regular Expression Notation)
495and
496B.5 (C Binding for Regular Expression Matching).
497.Sh DIAGNOSTICS
498Non-zero error codes from
499.Fn regcomp
500and
501.Fn regexec
502include the following:
503.Pp
504.Bl -tag -width XXXREG_ECOLLATE -compact
505.It Dv REG_NOMATCH
506regexec() failed to match
507.It Dv REG_BADPAT
508invalid regular expression
509.It Dv REG_ECOLLATE
510invalid collating element
511.It Dv REG_ECTYPE
512invalid character class
513.It Dv REG_EESCAPE
514\e applied to unescapable character
515.It Dv REG_ESUBREG
516invalid backreference number
517.It Dv REG_EBRACK
518brackets [ ] not balanced
519.It Dv REG_EPAREN
520parentheses ( ) not balanced
521.It Dv REG_EBRACE
522braces { } not balanced
523.It Dv REG_BADBR
524invalid repetition count(s) in { }
525.It Dv REG_ERANGE
526invalid character range in [ ]
527.It Dv REG_ESPACE
528ran out of memory
529.It Dv REG_BADRPT
530?, *, or + operand invalid
531.It Dv REG_EMPTY
532empty (sub)expression
533.It Dv REG_ASSERT
534``can't happen''\(emyou found a bug
535.It Dv REG_INVARG
536invalid argument, e.g. negative-length string
537.El
538.Sh HISTORY
539Originally written by Henry Spencer.
540Altered for inclusion in the 4.4BSD distribution.
541.Sh BUGS
542This is an alpha release with known defects.
543Please report problems.
544.Pp
545There is one known functionality bug.
546The implementation of internationalization is incomplete:
547the locale is always assumed to be the default one of
548.St -p1003.2-92 ,
549and only the collating elements etc. of that locale are available.
550.Pp
551The back-reference code is subtle and doubts linger about its correctness
552in complex cases.
553.Pp
554.Fn Regexec
555performance is poor.
556This will improve with later releases.
557.Fa Nmatch
558exceeding 0 is expensive;
559.Fa nmatch
560exceeding 1 is worse.
561.Fa Regexec
562is largely insensitive to RE complexity
563.Em except
564that back references are massively expensive.
565RE length does matter; in particular, there is a strong speed bonus
566for keeping RE length under about 30 characters,
567with most special characters counting roughly double.
568.Pp
569.Fn Regcomp
570implements bounded repetitions by macro expansion,
571which is costly in time and space if counts are large
572or bounded repetitions are nested.
573An RE like, say,
574`((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
575will (eventually) run almost any existing machine out of swap space.
576.Pp
577There are suspected problems with response to obscure error conditions.
578Notably,
579certain kinds of internal overflow,
580produced only by truly enormous REs or by multiply nested bounded repetitions,
581are probably not handled well.
582.Pp
583Due to a mistake in
584.St -p1003.2-92 ,
585things like `a)b' are legal REs because `)' is a special character
586only in the presence of a previous unmatched `('.  This can't be fixed
587until the spec is fixed.
588.Pp
589The standard's definition of back references is vague.
590For example, does
591`a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
592Until the standard is clarified, behavior in such cases should not be
593relied on.
594.Pp
595The implementation of word-boundary matching is a bit of a kludge,
596and bugs may lurk in combinations of word-boundary matching and anchoring.
597