xref: /openbsd-src/lib/libc/regex/regex.3 (revision ce7e0fc6a9d74d25b78fb6ad846387717f5172b6)
1.\"	$OpenBSD: regex.3,v 1.15 2000/08/09 15:56:45 aaron Exp $
2.\"
3.\" Copyright (c) 1997, Phillip F Knaack. All rights reserved.
4.\"
5.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
6.\" Copyright (c) 1992, 1993, 1994
7.\"	The Regents of the University of California.  All rights reserved.
8.\"
9.\" This code is derived from software contributed to Berkeley by
10.\" Henry Spencer.
11.\"
12.\" Redistribution and use in source and binary forms, with or without
13.\" modification, are permitted provided that the following conditions
14.\" are met:
15.\" 1. Redistributions of source code must retain the above copyright
16.\"    notice, this list of conditions and the following disclaimer.
17.\" 2. Redistributions in binary form must reproduce the above copyright
18.\"    notice, this list of conditions and the following disclaimer in the
19.\"    documentation and/or other materials provided with the distribution.
20.\" 3. All advertising materials mentioning features or use of this software
21.\"    must display the following acknowledgement:
22.\"	This product includes software developed by the University of
23.\"	California, Berkeley and its contributors.
24.\" 4. Neither the name of the University nor the names of its contributors
25.\"    may be used to endorse or promote products derived from this software
26.\"    without specific prior written permission.
27.\"
28.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
29.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
30.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
31.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
32.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
33.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
34.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
35.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
36.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
37.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
38.\" SUCH DAMAGE.
39.\"
40.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
41.\"
42.Dd March 20, 1994
43.Dt REGEX 3
44.Os
45.Sh NAME
46.Nm regcomp ,
47.Nm regexec ,
48.Nm regerror ,
49.Nm regfree
50.Nd regular-expression library
51.Sh SYNOPSIS
52.Fd #include <sys/types.h>
53.Fd #include <regex.h>
54.Fo "int regcomp"
55.Fa "regex_t *preg"
56.Fa "const char *pattern"
57.Fa "int cflags"
58.Fc
59.Fo "int regexec"
60.Fa "const regex_t *preg"
61.Fa "const char *string"
62.Fa "size_t nmatch"
63.Fa "regmatch_t pmatch[]"
64.Fa "int eflags"
65.Fc
66.Fo "size_t regerror"
67.Fa "int errcode"
68.Fa "const regex_t *preg"
69.Fa "char *errbuf"
70.Fa "size_t errbuf_size"
71.Fc
72.Fo "void regfree"
73.Fa "regex_t *preg"
74.Fc
75.Sh DESCRIPTION
76These routines implement POSIX 1003.2 regular expressions
77.Pq Dq REs ;
78see
79.Xr re_format 7 .
80.Fn regcomp
81compiles an RE written as a string into an internal form,
82.Fn regexec
83matches that internal form against a string and reports results,
84.Fn regerror
85transforms error codes from either into human-readable messages, and
86.Fn regfree
87frees any dynamically allocated storage used by the internal form
88of an RE.
89.Pp
90The header
91.Aq Pa regex.h
92declares two structure types,
93.Li regex_t
94and
95.Li regmatch_t ,
96the former for compiled internal forms and the latter for match reporting.
97It also declares the four functions,
98a type
99.Li regoff_t ,
100and a number of constants with names starting with
101.Dv REG_ .
102.Pp
103.Fn regcomp
104compiles the regular expression contained in the
105.Fa pattern
106string,
107subject to the flags in
108.Fa cflags ,
109and places the results in the
110.Li regex_t
111structure pointed to by
112.Fa preg .
113.Fa cflags
114is the bitwise
115.Tn OR
116of zero or more of the following flags:
117.Bl -tag -width XREG_EXTENDEDX
118.It Dv REG_EXTENDED
119Compile modern
120.Pq Dq extended
121REs,
122rather than the obsolete
123.Pq Dq basic
124REs that are the default.
125.It Dv REG_BASIC
126This is a synonym for 0,
127provided as a counterpart to
128.Dv REG_EXTENDED
129to improve readability.
130.It Dv REG_NOSPEC
131Compile with recognition of all special characters turned off.
132All characters are thus considered ordinary,
133so the RE is a literal string.
134This is an extension,
135compatible with but not specified by POSIX 1003.2,
136and should be used with
137caution in software intended to be portable to other systems.
138.Dv REG_EXTENDED
139and
140.Dv REG_NOSPEC
141may not be used in the same call to
142.Fn regcomp .
143.It Dv REG_ICASE
144Compile for matching that ignores upper/lower case distinctions.
145See
146.Xr re_format 7 .
147.It Dv REG_NOSUB
148Compile for matching that need only report success or failure,
149not what was matched.
150.It Dv REG_NEWLINE
151Compile for newline-sensitive matching.
152By default, newline is a completely ordinary character with no special
153meaning in either REs or strings.
154With this flag,
155.Ql \&[^
156bracket expressions and
157.Ql \&.
158never match newline,
159a
160.Ql ^
161anchor matches the null string after any newline in the string
162in addition to its normal function,
163and the
164.Ql $
165anchor matches the null string before any newline in the
166string in addition to its normal function.
167.It Dv REG_PEND
168The regular expression ends,
169not at the first NUL,
170but just before the character pointed to by the
171.Fa re_endp
172member of the structure pointed to by
173.Fa preg .
174The
175.Fa re_endp
176member is of type
177.Fa const\ char\ * .
178This flag permits inclusion of NULs in the RE;
179they are considered ordinary characters.
180This is an extension,
181compatible with but not specified by POSIX 1003.2,
182and should be used with
183caution in software intended to be portable to other systems.
184.El
185.Pp
186When successful,
187.Fn regcomp
188returns 0 and fills in the structure pointed to by
189.Fa preg .
190One member of that structure
191(other than
192.Fa re_endp )
193is publicized:
194.Fa re_nsub ,
195of type
196.Fa size_t ,
197contains the number of parenthesized subexpressions within the RE
198(except that the value of this member is undefined if the
199.Dv REG_NOSUB
200flag was used).
201If
202.Fn regcomp
203fails, it returns a non-zero error code;
204see DIAGNOSTICS.
205.Pp
206.Fn regexec
207matches the compiled RE pointed to by
208.Fa preg
209against the
210.Fa string ,
211subject to the flags in
212.Fa eflags ,
213and reports results using
214.Fa nmatch ,
215.Fa pmatch ,
216and the returned value.
217The RE must have been compiled by a previous invocation of
218.Fn regcomp .
219The compiled form is not altered during execution of
220.Fn regexec ,
221so a single compiled RE can be used simultaneously by multiple threads.
222.Pp
223By default,
224the null-terminated string pointed to by
225.Fa string
226is considered to be the text of an entire line, minus any terminating
227newline.
228The
229.Fa eflags
230argument is the bitwise
231.Tn OR
232of zero or more of the following flags:
233.Bl -tag -width XREG_STARTENDX
234.It Dv REG_NOTBOL
235The first character of
236the string
237is not the beginning of a line, so the
238.Ql ^
239anchor should not match before it.
240This does not affect the behavior of newlines under
241.Dv REG_NEWLINE .
242.It Dv REG_NOTEOL
243The NUL terminating
244the string
245does not end a line, so the
246.Ql $
247anchor should not match before it.
248This does not affect the behavior of newlines under
249.Dv REG_NEWLINE .
250.It Dv REG_STARTEND
251The string is considered to start at
252\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
253and to have a terminating NUL located at
254\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
255(there need not actually be a NUL at that location),
256regardless of the value of
257.Fa nmatch .
258See below for the definition of
259.Fa pmatch
260and
261.Fa nmatch .
262This is an extension,
263compatible with but not specified by POSIX 1003.2,
264and should be used with
265caution in software intended to be portable to other systems.
266Note that a non-zero \fIrm_so\fR does not imply
267.Dv REG_NOTBOL ;
268.Dv REG_STARTEND
269affects only the location of the string,
270not how it is matched.
271.El
272.Pp
273See
274.Xr re_format 7
275for a discussion of what is matched in situations where an RE or a
276portion thereof could match any of several substrings of
277.Fa string .
278.Pp
279Normally,
280.Fn regexec
281returns 0 for success and the non-zero code
282.Dv REG_NOMATCH
283for failure.
284Other non-zero error codes may be returned in exceptional situations;
285see DIAGNOSTICS.
286.Pp
287If
288.Dv REG_NOSUB
289was specified in the compilation of the RE,
290or if
291.Fa nmatch
292is 0,
293.Fn regexec
294ignores the
295.Fa pmatch
296argument (but see below for the case where
297.Dv REG_STARTEND
298is specified).
299Otherwise,
300.Fa pmatch
301points to an array of
302.Fa nmatch
303structures of type
304.Li regmatch_t .
305Such a structure has at least the members
306.Fa rm_so
307and
308.Fa rm_eo ,
309both of type
310.Fa regoff_t
311(a signed arithmetic type at least as large as an
312.Li off_t
313and a
314.Li ssize_t ) ,
315containing respectively the offset of the first character of a substring
316and the offset of the first character after the end of the substring.
317Offsets are measured from the beginning of the
318.Fa string
319argument given to
320.Fn regexec .
321An empty substring is denoted by equal offsets,
322both indicating the character following the empty substring.
323.Pp
324The 0th member of the
325.Fa pmatch
326array is filled in to indicate what substring of
327.Fa string
328was matched by the entire RE.
329Remaining members report what substring was matched by parenthesized
330subexpressions within the RE;
331member
332.Va i
333reports subexpression
334.Va i ,
335with subexpressions counted (starting at 1) by the order of their opening
336parentheses in the RE, left to right.
337Unused entries in the array\(emcorresponding either to subexpressions that
338did not participate in the match at all, or to subexpressions that do not
339exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
340.Fa rm_so
341and
342.Fa rm_eo
343set to \-1.
344If a subexpression participated in the match several times,
345the reported substring is the last one it matched.
346(Note, as an example in particular, that when the RE
347.Dq (b*)+
348matches
349.Dq bbb ,
350the parenthesized subexpression matches each of the three
351.Sq b Ns s
352and then
353an infinite number of empty strings following the last
354.Sq b ,
355so the reported substring is one of the empties.)
356.Pp
357If
358.Dv REG_STARTEND
359is specified,
360.Fa pmatch
361must point to at least one
362.Li regmatch_t
363(even if
364.Fa nmatch
365is 0 or
366.Dv REG_NOSUB
367was specified),
368to hold the input offsets for
369.Dv REG_STARTEND .
370Use for output is still entirely controlled by
371.Fa nmatch ;
372if
373.Fa nmatch
374is 0 or
375.Dv REG_NOSUB
376was specified,
377the value of
378.Fa pmatch[0]
379will not be changed by a successful
380.Fn regexec .
381.Pp
382.Fn regerror
383maps a non-zero
384.Va errcode
385from either
386.Fn regcomp
387or
388.Fn regexec
389to a human-readable, printable message.
390If
391.Fa preg
392is non-NULL,
393the error code should have arisen from use of
394the
395.Li regex_t
396pointed to by
397.Fa preg ,
398and if the error code came from
399.Fn regcomp ,
400it should have been the result from the most recent
401.Fn regcomp
402using that
403.Li regex_t .
404.Pf ( Fn regerror
405may be able to supply a more detailed message using information
406from the
407.Li regex_t . )
408.Fn regerror
409places the null-terminated message into the buffer pointed to by
410.Fa errbuf ,
411limiting the length (including the NUL) to at most
412.Fa errbuf_size
413bytes.
414If the whole message won't fit,
415as much of it as will fit before the terminating NUL is supplied.
416In any case,
417the returned value is the size of buffer needed to hold the whole
418message (including the terminating NUL).
419If
420.Fa errbuf_size
421is 0,
422.Fa errbuf
423is ignored but the return value is still correct.
424.Pp
425If the
426.Fa errcode
427given to
428.Fn regerror
429is first
430.Tn OR Ns 'ed
431with
432.Dv REG_ITOA ,
433the
434.Dq message
435that results is the printable name of the error code,
436e.g.,
437.Dq REG_NOMATCH ,
438rather than an explanation thereof.
439If
440.Fa errcode
441is
442.Dv REG_ATOI ,
443then
444.Fa preg
445shall be non-null and the
446.Fa re_endp
447member of the structure it points to
448must point to the printable name of an error code;
449in this case, the result in
450.Fa errbuf
451is the decimal digits of
452the numeric value of the error code
453(0 if the name is not recognized).
454.Dv REG_ITOA
455and
456.Dv REG_ATOI
457are intended primarily as debugging facilities;
458they are extensions,
459compatible with but not specified by POSIX 1003.2,
460and should be used with
461caution in software intended to be portable to other systems.
462Be warned also that they are considered experimental and changes are possible.
463.Pp
464.Fn regfree
465frees any dynamically allocated storage associated with the compiled RE
466pointed to by
467.Fa preg .
468The remaining
469.Li regex_t
470is no longer a valid compiled RE
471and the effect of supplying it to
472.Fn regexec
473or
474.Fn regerror
475is undefined.
476.Pp
477None of these functions references global variables except for tables
478of constants;
479all are safe for use from multiple threads if the arguments are safe.
480.Sh IMPLEMENTATION CHOICES
481There are a number of decisions that 1003.2 leaves up to the implementor,
482either by explicitly saying
483.Dq undefined
484or by virtue of them being
485forbidden by the RE grammar.
486This implementation treats them as follows.
487.Pp
488See
489.Xr re_format 7
490for a discussion of the definition of case-independent matching.
491.Pp
492There is no particular limit on the length of REs,
493except insofar as memory is limited.
494Memory usage is approximately linear in RE size, and largely insensitive
495to RE complexity, except for bounded repetitions.
496See
497.Sx BUGS
498for one short RE using them
499that will run almost any system out of memory.
500.Pp
501A backslashed character other than one specifically given a magic meaning
502by 1003.2 (such magic meanings occur only in obsolete REs)
503is taken as an ordinary character.
504.Pp
505Any unmatched
506.Ql \&[
507is a
508.Dv REG_EBRACK
509error.
510.Pp
511Equivalence classes cannot begin or end bracket-expression ranges.
512The endpoint of one range cannot begin another.
513.Pp
514RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
515.Pp
516A repetition operator (?, *, +, or bounds) cannot follow another
517repetition operator.
518A repetition operator cannot begin an expression or subexpression
519or follow
520.Ql ^
521or
522.Ql | .
523.Pp
524A
525.Ql |
526cannot appear first or last in a (sub)expression, or after another
527.Ql | ,
528i.e., an operand of
529.Ql |
530cannot be an empty subexpression.
531An empty parenthesized subexpression,
532.Ql \&(\&) ,
533is legal and matches an
534empty (sub)string.
535An empty string is not a legal RE.
536.Pp
537A
538.Ql {
539followed by a digit is considered the beginning of bounds for a
540bounded repetition, which must then follow the syntax for bounds.
541A
542.Ql {
543.Em not
544followed by a digit is considered an ordinary character.
545.Pp
546.Ql ^
547and
548.Ql $
549beginning and ending subexpressions in obsolete
550.Pq Dq basic
551REs are anchors, not ordinary characters.
552.Sh SEE ALSO
553.Xr grep 1 ,
554.Xr re_format 7
555.Pp
556POSIX 1003.2, sections 2.8 (Regular Expression Notation)
557and
558B.5 (C Binding for Regular Expression Matching).
559.Sh DIAGNOSTICS
560Non-zero error codes from
561.Fn regcomp
562and
563.Fn regexec
564include the following:
565.Pp
566.Bl -tag -compact -width XREG_ECOLLATEX
567.It Er REG_NOMATCH
568regexec() failed to match
569.It Er REG_BADPAT
570invalid regular expression
571.It Er REG_ECOLLATE
572invalid collating element
573.It Er REG_ECTYPE
574invalid character class
575.It Er REG_EESCAPE
576\e applied to unescapable character
577.It Er REG_ESUBREG
578invalid backreference number
579.It Er REG_EBRACK
580brackets [ ] not balanced
581.It Er REG_EPAREN
582parentheses ( ) not balanced
583.It Er REG_EBRACE
584braces { } not balanced
585.It Er REG_BADBR
586invalid repetition count(s) in { }
587.It Er REG_ERANGE
588invalid character range in [ ]
589.It Er REG_ESPACE
590ran out of memory
591.It Er REG_BADRPT
592?, *, or + operand invalid
593.It Er REG_EMPTY
594empty (sub)expression
595.It Er REG_ASSERT
596.Dq can't happen
597\(emyou found a bug
598.It Er REG_INVARG
599invalid argument, e.g., negative-length string
600.El
601.Sh HISTORY
602Originally written by Henry Spencer.
603Altered for inclusion in the
604.Bx 4.4
605distribution.
606.Sh BUGS
607This is an alpha release with known defects.
608Please report problems.
609.Pp
610There is one known functionality bug.
611The implementation of internationalization is incomplete:
612the locale is always assumed to be the default one of 1003.2,
613and only the collating elements etc. of that locale are available.
614.Pp
615The back-reference code is subtle and doubts linger about its correctness
616in complex cases.
617.Pp
618.Fn regexec
619performance is poor.
620This will improve with later releases.
621.Fa nmatch
622exceeding 0 is expensive;
623.Fa nmatch
624exceeding 1 is worse.
625.Fn regexec
626is largely insensitive to RE complexity
627.Em except
628that back references are massively expensive.
629RE length does matter; in particular, there is a strong speed bonus
630for keeping RE length under about 30 characters,
631with most special characters counting roughly double.
632.Pp
633.Fn regcomp
634implements bounded repetitions by macro expansion,
635which is costly in time and space if counts are large
636or bounded repetitions are nested.
637A RE like, say,
638.Dq ((((a{1,100}){1,100}){1,100}){1,100}){1,100}
639will (eventually) run almost any existing machine out of swap space.
640.Pp
641There are suspected problems with response to obscure error conditions.
642Notably,
643certain kinds of internal overflow,
644produced only by truly enormous REs or by multiply nested bounded repetitions,
645are probably not handled well.
646.Pp
647Due to a mistake in 1003.2, things like
648.Ql a)b
649are legal REs because
650.Ql \&)
651is
652a special character only in the presence of a previous unmatched
653.Ql \&( .
654This can't be fixed until the spec is fixed.
655.Pp
656The standard's definition of back references is vague.
657For example, does
658.Dq a\e(\e(b\e)*\e2\e)*d
659match
660.Dq abbbd ?
661Until the standard is clarified,
662behavior in such cases should not be relied on.
663.Pp
664The implementation of word-boundary matching is a bit of a kludge,
665and bugs may lurk in combinations of word-boundary matching and anchoring.
666