xref: /openbsd-src/lib/libc/regex/regex.3 (revision 1fc27e414118cd8922c6b93fbaeb7a5246bfd593)
1.\"	$OpenBSD: regex.3,v 1.11 1999/07/09 13:35:22 aaron Exp $
2.\"
3.\" Copyright (c) 1997, Phillip F Knaack. All rights reserved.
4.\"
5.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
6.\" Copyright (c) 1992, 1993, 1994
7.\"	The Regents of the University of California.  All rights reserved.
8.\"
9.\" This code is derived from software contributed to Berkeley by
10.\" Henry Spencer.
11.\"
12.\" Redistribution and use in source and binary forms, with or without
13.\" modification, are permitted provided that the following conditions
14.\" are met:
15.\" 1. Redistributions of source code must retain the above copyright
16.\"    notice, this list of conditions and the following disclaimer.
17.\" 2. Redistributions in binary form must reproduce the above copyright
18.\"    notice, this list of conditions and the following disclaimer in the
19.\"    documentation and/or other materials provided with the distribution.
20.\" 3. All advertising materials mentioning features or use of this software
21.\"    must display the following acknowledgement:
22.\"	This product includes software developed by the University of
23.\"	California, Berkeley and its contributors.
24.\" 4. Neither the name of the University nor the names of its contributors
25.\"    may be used to endorse or promote products derived from this software
26.\"    without specific prior written permission.
27.\"
28.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
29.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
30.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
31.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
32.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
33.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
34.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
35.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
36.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
37.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
38.\" SUCH DAMAGE.
39.\"
40.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
41.\"
42.Dd March 20, 1994
43.Dt REGEX 3
44.Os
45.Sh NAME
46.Nm regcomp, regexec, regerror, regfree
47.Nd regular-expression library
48.Sh SYNOPSIS
49.Fd #include <sys/types.h>
50.Fd #include <regex.h>
51.Fo "int regcomp"
52.Fa "regex_t *preg"
53.Fa "const char *pattern"
54.Fa "int cflags"
55.Fc
56.Fo "int regexec"
57.Fa "const regex_t *preg"
58.Fa "const char *string"
59.Fa "size_t nmatch"
60.Fa "regmatch_t pmatch[]"
61.Fa "int eflags"
62.Fc
63.Fo "size_t regerror"
64.Fa "int errcode"
65.Fa "const regex_t *preg"
66.Fa "char *errbuf"
67.Fa "size_t errbuf_size"
68.Fc
69.Fo "void regfree"
70.Fa "regex_t *preg"
71.Fc
72.Sh DESCRIPTION
73These routines implement POSIX 1003.2 regular expressions
74.Pq Dq REs ;
75see
76.Xr re_format 7 .
77.Fn regcomp
78compiles an RE written as a string into an internal form,
79.Fn regexec
80matches that internal form against a string and reports results,
81.Fn regerror
82transforms error codes from either into human-readable messages, and
83.Fn regfree
84frees any dynamically allocated storage used by the internal form
85of an RE.
86.Pp
87The header
88.Aq Pa regex.h
89declares two structure types,
90.Li regex_t
91and
92.Li regmatch_t ,
93the former for compiled internal forms and the latter for match reporting.
94It also declares the four functions,
95a type
96.Li regoff_t ,
97and a number of constants with names starting with
98.Dv REG_ .
99.Pp
100.Fn regcomp
101compiles the regular expression contained in the
102.Fa pattern
103string,
104subject to the flags in
105.Fa cflags ,
106and places the results in the
107.Li regex_t
108structure pointed to by
109.Fa preg .
110.Fa cflags
111is the bitwise
112.Tn OR
113of zero or more of the following flags:
114.Bl -tag -width XREG_EXTENDEDX
115.It Dv REG_EXTENDED
116Compile modern
117.Pq Dq extended
118REs,
119rather than the obsolete
120.Pq Dq basic
121REs that are the default.
122.It Dv REG_BASIC
123This is a synonym for 0,
124provided as a counterpart to
125.Dv REG_EXTENDED
126to improve readability.
127.It Dv REG_NOSPEC
128Compile with recognition of all special characters turned off.
129All characters are thus considered ordinary,
130so the RE is a literal string.
131This is an extension,
132compatible with but not specified by POSIX 1003.2,
133and should be used with
134caution in software intended to be portable to other systems.
135.Dv REG_EXTENDED
136and
137.Dv REG_NOSPEC
138may not be used in the same call to
139.Fn regcomp .
140.It Dv REG_ICASE
141Compile for matching that ignores upper/lower case distinctions.
142See
143.Xr re_format 7 .
144.It Dv REG_NOSUB
145Compile for matching that need only report success or failure,
146not what was matched.
147.It Dv REG_NEWLINE
148Compile for newline-sensitive matching.
149By default, newline is a completely ordinary character with no special
150meaning in either REs or strings.
151With this flag,
152.Ql \&[^
153bracket expressions and
154.Ql \&.
155never match newline,
156a
157.Ql ^
158anchor matches the null string after any newline in the string
159in addition to its normal function,
160and the
161.Ql $
162anchor matches the null string before any newline in the
163string in addition to its normal function.
164.It Dv REG_PEND
165The regular expression ends,
166not at the first NUL,
167but just before the character pointed to by the
168.Fa re_endp
169member of the structure pointed to by
170.Fa preg .
171The
172.Fa re_endp
173member is of type
174.Fa const\ char\ * .
175This flag permits inclusion of NULs in the RE;
176they are considered ordinary characters.
177This is an extension,
178compatible with but not specified by POSIX 1003.2,
179and should be used with
180caution in software intended to be portable to other systems.
181.El
182.Pp
183When successful,
184.Fn regcomp
185returns 0 and fills in the structure pointed to by
186.Fa preg .
187One member of that structure
188(other than
189.Fa re_endp )
190is publicized:
191.Fa re_nsub ,
192of type
193.Fa size_t ,
194contains the number of parenthesized subexpressions within the RE
195(except that the value of this member is undefined if the
196.Dv REG_NOSUB
197flag was used).
198If
199.Fn regcomp
200fails, it returns a non-zero error code;
201see DIAGNOSTICS.
202.Pp
203.Fn regexec
204matches the compiled RE pointed to by
205.Fa preg
206against the
207.Fa string ,
208subject to the flags in
209.Fa eflags ,
210and reports results using
211.Fa nmatch ,
212.Fa pmatch ,
213and the returned value.
214The RE must have been compiled by a previous invocation of
215.Fn regcomp .
216The compiled form is not altered during execution of
217.Fn regexec ,
218so a single compiled RE can be used simultaneously by multiple threads.
219.Pp
220By default,
221the null-terminated string pointed to by
222.Fa string
223is considered to be the text of an entire line, minus any terminating
224newline.
225The
226.Fa eflags
227argument is the bitwise
228.Tn OR
229of zero or more of the following flags:
230.Bl -tag -width XREG_STARTENDX
231.It Dv REG_NOTBOL
232The first character of
233the string
234is not the beginning of a line, so the
235.Ql ^
236anchor should not match before it.
237This does not affect the behavior of newlines under
238.Dv REG_NEWLINE .
239.It Dv REG_NOTEOL
240The NUL terminating
241the string
242does not end a line, so the
243.Ql $
244anchor should not match before it.
245This does not affect the behavior of newlines under
246.Dv REG_NEWLINE .
247.It Dv REG_STARTEND
248The string is considered to start at
249\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
250and to have a terminating NUL located at
251\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
252(there need not actually be a NUL at that location),
253regardless of the value of
254.Fa nmatch .
255See below for the definition of
256.Fa pmatch
257and
258.Fa nmatch .
259This is an extension,
260compatible with but not specified by POSIX 1003.2,
261and should be used with
262caution in software intended to be portable to other systems.
263Note that a non-zero \fIrm_so\fR does not imply
264.Dv REG_NOTBOL ;
265.Dv REG_STARTEND
266affects only the location of the string,
267not how it is matched.
268.El
269.Pp
270See
271.Xr re_format 7
272for a discussion of what is matched in situations where an RE or a
273portion thereof could match any of several substrings of
274.Fa string .
275.Pp
276Normally,
277.Fn regexec
278returns 0 for success and the non-zero code
279.Dv REG_NOMATCH
280for failure.
281Other non-zero error codes may be returned in exceptional situations;
282see DIAGNOSTICS.
283.Pp
284If
285.Dv REG_NOSUB
286was specified in the compilation of the RE,
287or if
288.Fa nmatch
289is 0,
290.Fn regexec
291ignores the
292.Fa pmatch
293argument (but see below for the case where
294.Dv REG_STARTEND
295is specified).
296Otherwise,
297.Fa pmatch
298points to an array of
299.Fa nmatch
300structures of type
301.Li regmatch_t .
302Such a structure has at least the members
303.Fa rm_so
304and
305.Fa rm_eo ,
306both of type
307.Fa regoff_t
308(a signed arithmetic type at least as large as an
309.Li off_t
310and a
311.Li ssize_t ),
312containing respectively the offset of the first character of a substring
313and the offset of the first character after the end of the substring.
314Offsets are measured from the beginning of the
315.Fa string
316argument given to
317.Fn regexec .
318An empty substring is denoted by equal offsets,
319both indicating the character following the empty substring.
320.Pp
321The 0th member of the
322.Fa pmatch
323array is filled in to indicate what substring of
324.Fa string
325was matched by the entire RE.
326Remaining members report what substring was matched by parenthesized
327subexpressions within the RE;
328member
329.Va i
330reports subexpression
331.Va i ,
332with subexpressions counted (starting at 1) by the order of their opening
333parentheses in the RE, left to right.
334Unused entries in the array\(emcorresponding either to subexpressions that
335did not participate in the match at all, or to subexpressions that do not
336exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
337.Fa rm_so
338and
339.Fa rm_eo
340set to \-1.
341If a subexpression participated in the match several times,
342the reported substring is the last one it matched.
343(Note, as an example in particular, that when the RE
344.Dq (b*)+
345matches
346.Dq bbb ,
347the parenthesized subexpression matches each of the three
348.Sq b Ns s
349and then
350an infinite number of empty strings following the last
351.Sq b ,
352so the reported substring is one of the empties.)
353.Pp
354If
355.Dv REG_STARTEND
356is specified,
357.Fa pmatch
358must point to at least one
359.Li regmatch_t
360(even if
361.Fa nmatch
362is 0 or
363.Dv REG_NOSUB
364was specified),
365to hold the input offsets for
366.Dv REG_STARTEND .
367Use for output is still entirely controlled by
368.Fa nmatch ;
369if
370.Fa nmatch
371is 0 or
372.Dv REG_NOSUB
373was specified,
374the value of
375.Fa pmatch[0]
376will not be changed by a successful
377.Fn regexec .
378.Pp
379.Fn regerror
380maps a non-zero
381.Va errcode
382from either
383.Fn regcomp
384or
385.Fn regexec
386to a human-readable, printable message.
387If
388.Fa preg
389is non-NULL,
390the error code should have arisen from use of
391the
392.Li regex_t
393pointed to by
394.Fa preg ,
395and if the error code came from
396.Fn regcomp ,
397it should have been the result from the most recent
398.Fn regcomp
399using that
400.Li regex_t . (
401.Fn regerror
402may be able to supply a more detailed message using information
403from the
404.Li regex_t . )
405.Fn regerror
406places the null-terminated message into the buffer pointed to by
407.Fa errbuf ,
408limiting the length (including the NUL) to at most
409.Fa errbuf_size
410bytes.
411If the whole message won't fit,
412as much of it as will fit before the terminating NUL is supplied.
413In any case,
414the returned value is the size of buffer needed to hold the whole
415message (including the terminating NUL).
416If
417.Fa errbuf_size
418is 0,
419.Fa errbuf
420is ignored but the return value is still correct.
421.Pp
422If the
423.Fa errcode
424given to
425.Fn regerror
426is first
427.Tn OR Ns 'ed
428with
429.Dv REG_ITOA ,
430the
431.Dq message
432that results is the printable name of the error code,
433e.g.,
434.Dq REG_NOMATCH ,
435rather than an explanation thereof.
436If
437.Fa errcode
438is
439.Dv REG_ATOI ,
440then
441.Fa preg
442shall be non-null and the
443.Fa re_endp
444member of the structure it points to
445must point to the printable name of an error code;
446in this case, the result in
447.Fa errbuf
448is the decimal digits of
449the numeric value of the error code
450(0 if the name is not recognized).
451.Dv REG_ITOA
452and
453.Dv REG_ATOI
454are intended primarily as debugging facilities;
455they are extensions,
456compatible with but not specified by POSIX 1003.2,
457and should be used with
458caution in software intended to be portable to other systems.
459Be warned also that they are considered experimental and changes are possible.
460.Pp
461.Fn regfree
462frees any dynamically allocated storage associated with the compiled RE
463pointed to by
464.Fa preg .
465The remaining
466.Li regex_t
467is no longer a valid compiled RE
468and the effect of supplying it to
469.Fn regexec
470or
471.Fn regerror
472is undefined.
473.Pp
474None of these functions references global variables except for tables
475of constants;
476all are safe for use from multiple threads if the arguments are safe.
477.Sh IMPLEMENTATION CHOICES
478There are a number of decisions that 1003.2 leaves up to the implementor,
479either by explicitly saying
480.Dq undefined
481or by virtue of them being
482forbidden by the RE grammar.
483This implementation treats them as follows.
484.Pp
485See
486.Xr re_format 7
487for a discussion of the definition of case-independent matching.
488.Pp
489There is no particular limit on the length of REs,
490except insofar as memory is limited.
491Memory usage is approximately linear in RE size, and largely insensitive
492to RE complexity, except for bounded repetitions.
493See
494.Sx BUGS
495for one short RE using them
496that will run almost any system out of memory.
497.Pp
498A backslashed character other than one specifically given a magic meaning
499by 1003.2 (such magic meanings occur only in obsolete REs)
500is taken as an ordinary character.
501.Pp
502Any unmatched
503.Ql \&[
504is a
505.Dv REG_EBRACK
506error.
507.Pp
508Equivalence classes cannot begin or end bracket-expression ranges.
509The endpoint of one range cannot begin another.
510.Pp
511RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
512.Pp
513A repetition operator (?, *, +, or bounds) cannot follow another
514repetition operator.
515A repetition operator cannot begin an expression or subexpression
516or follow
517.Ql ^
518or
519.Ql | .
520.Pp
521A
522.Ql |
523cannot appear first or last in a (sub)expression, or after another
524.Ql | ,
525i.e., an operand of
526.Ql |
527cannot be an empty subexpression.
528An empty parenthesized subexpression,
529.Ql \&(\&) ,
530is legal and matches an
531empty (sub)string.
532An empty string is not a legal RE.
533.Pp
534A
535.Ql {
536followed by a digit is considered the beginning of bounds for a
537bounded repetition, which must then follow the syntax for bounds.
538A
539.Ql {
540.Em not
541followed by a digit is considered an ordinary character.
542.Pp
543.Ql ^
544and
545.Ql $
546beginning and ending subexpressions in obsolete
547.Pq Dq basic
548REs are anchors, not ordinary characters.
549.Sh SEE ALSO
550.Xr grep 1 ,
551.Xr re_format 7
552.Pp
553POSIX 1003.2, sections 2.8 (Regular Expression Notation)
554and
555B.5 (C Binding for Regular Expression Matching).
556.Sh DIAGNOSTICS
557Non-zero error codes from
558.Fn regcomp
559and
560.Fn regexec
561include the following:
562.Pp
563.Bl -tag -compact -width XREG_ECOLLATEX
564.It Er REG_NOMATCH
565regexec() failed to match
566.It Er REG_BADPAT
567invalid regular expression
568.It Er REG_ECOLLATE
569invalid collating element
570.It Er REG_ECTYPE
571invalid character class
572.It Er REG_EESCAPE
573\e applied to unescapable character
574.It Er REG_ESUBREG
575invalid backreference number
576.It Er REG_EBRACK
577brackets [ ] not balanced
578.It Er REG_EPAREN
579parentheses ( ) not balanced
580.It Er REG_EBRACE
581braces { } not balanced
582.It Er REG_BADBR
583invalid repetition count(s) in { }
584.It Er REG_ERANGE
585invalid character range in [ ]
586.It Er REG_ESPACE
587ran out of memory
588.It Er REG_BADRPT
589?, *, or + operand invalid
590.It Er REG_EMPTY
591empty (sub)expression
592.It Er REG_ASSERT
593.Dq can't happen
594\(emyou found a bug
595.It Er REG_INVARG
596invalid argument, e.g. negative-length string
597.El
598.Sh HISTORY
599Originally written by Henry Spencer.
600Altered for inclusion in the
601.Bx 4.4
602distribution.
603.Sh BUGS
604This is an alpha release with known defects.
605Please report problems.
606.Pp
607There is one known functionality bug.
608The implementation of internationalization is incomplete:
609the locale is always assumed to be the default one of 1003.2,
610and only the collating elements etc. of that locale are available.
611.Pp
612The back-reference code is subtle and doubts linger about its correctness
613in complex cases.
614.Pp
615.Fn regexec
616performance is poor.
617This will improve with later releases.
618.Fa nmatch
619exceeding 0 is expensive;
620.Fa nmatch
621exceeding 1 is worse.
622.Fn regexec
623is largely insensitive to RE complexity
624.Em except
625that back references are massively expensive.
626RE length does matter; in particular, there is a strong speed bonus
627for keeping RE length under about 30 characters,
628with most special characters counting roughly double.
629.Pp
630.Fn regcomp
631implements bounded repetitions by macro expansion,
632which is costly in time and space if counts are large
633or bounded repetitions are nested.
634A RE like, say,
635.Dq ((((a{1,100}){1,100}){1,100}){1,100}){1,100}
636will (eventually) run almost any existing machine out of swap space.
637.Pp
638There are suspected problems with response to obscure error conditions.
639Notably,
640certain kinds of internal overflow,
641produced only by truly enormous REs or by multiply nested bounded repetitions,
642are probably not handled well.
643.Pp
644Due to a mistake in 1003.2, things like
645.Ql a)b
646are legal REs because
647.Ql \&)
648is
649a special character only in the presence of a previous unmatched
650.Ql \&( .
651This can't be fixed until the spec is fixed.
652.Pp
653The standard's definition of back references is vague.
654For example, does
655.Dq a\e(\e(b\e)*\e2\e)*d
656match
657.Dq abbbd ?
658Until the standard is clarified,
659behavior in such cases should not be relied on.
660.Pp
661The implementation of word-boundary matching is a bit of a kludge,
662and bugs may lurk in combinations of word-boundary matching and anchoring.
663