xref: /openbsd-src/lib/libc/regex/regex.3 (revision a28daedfc357b214be5c701aa8ba8adb29a7f1c2)
1.\"	$OpenBSD: regex.3,v 1.21 2007/05/31 19:19:30 jmc Exp $
2.\"
3.\" Copyright (c) 1997, Phillip F Knaack. All rights reserved.
4.\"
5.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
6.\" Copyright (c) 1992, 1993, 1994
7.\"	The Regents of the University of California.  All rights reserved.
8.\"
9.\" This code is derived from software contributed to Berkeley by
10.\" Henry Spencer.
11.\"
12.\" Redistribution and use in source and binary forms, with or without
13.\" modification, are permitted provided that the following conditions
14.\" are met:
15.\" 1. Redistributions of source code must retain the above copyright
16.\"    notice, this list of conditions and the following disclaimer.
17.\" 2. Redistributions in binary form must reproduce the above copyright
18.\"    notice, this list of conditions and the following disclaimer in the
19.\"    documentation and/or other materials provided with the distribution.
20.\" 3. Neither the name of the University nor the names of its contributors
21.\"    may be used to endorse or promote products derived from this software
22.\"    without specific prior written permission.
23.\"
24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
27.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
34.\" SUCH DAMAGE.
35.\"
36.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
37.\"
38.Dd $Mdocdate: May 31 2007 $
39.Dt REGEX 3
40.Os
41.Sh NAME
42.Nm regcomp ,
43.Nm regexec ,
44.Nm regerror ,
45.Nm regfree
46.Nd regular expression routines
47.Sh SYNOPSIS
48.Fd #include <sys/types.h>
49.Fd #include <regex.h>
50.Ft int
51.Fn regcomp "regex_t *preg" "const char *pattern" "int cflags"
52.Pp
53.Ft int
54.Fn regexec "const regex_t *preg" "const char *string" "size_t nmatch" \
55            "regmatch_t pmatch[]" "int eflags"
56.Pp
57.Ft size_t
58.Fn regerror "int errcode" "const regex_t *preg" "char *errbuf" \
59             "size_t errbuf_size"
60.Pp
61.Ft void
62.Fn regfree "regex_t *preg"
63.Sh DESCRIPTION
64These routines implement
65.St -p1003.2
66regular expressions
67.Pq Dq REs ;
68see
69.Xr re_format 7 .
70.Fn regcomp
71compiles an RE written as a string into an internal form,
72.Fn regexec
73matches that internal form against a string and reports results,
74.Fn regerror
75transforms error codes from either into human-readable messages, and
76.Fn regfree
77frees any dynamically allocated storage used by the internal form
78of an RE.
79.Pp
80The header
81.Aq Pa regex.h
82declares two structure types,
83.Li regex_t
84and
85.Li regmatch_t ,
86the former for compiled internal forms and the latter for match reporting.
87It also declares the four functions,
88a type
89.Li regoff_t ,
90and a number of constants with names starting with
91.Dv REG_ .
92.Pp
93.Fn regcomp
94compiles the regular expression contained in the
95.Fa pattern
96string,
97subject to the flags in
98.Fa cflags ,
99and places the results in the
100.Li regex_t
101structure pointed to by
102.Fa preg .
103.Fa cflags
104is the bitwise
105.Tn OR
106of zero or more of the following flags:
107.Bl -tag -width XREG_EXTENDEDX
108.It Dv REG_EXTENDED
109Compile modern
110.Pq Dq extended
111REs,
112rather than the obsolete
113.Pq Dq basic
114REs that are the default.
115.It Dv REG_BASIC
116This is a synonym for 0,
117provided as a counterpart to
118.Dv REG_EXTENDED
119to improve readability.
120.It Dv REG_NOSPEC
121Compile with recognition of all special characters turned off.
122All characters are thus considered ordinary,
123so the RE is a literal string.
124This is an extension,
125compatible with but not specified by
126.St -p1003.2 ,
127and should be used with
128caution in software intended to be portable to other systems.
129.Dv REG_EXTENDED
130and
131.Dv REG_NOSPEC
132may not be used in the same call to
133.Fn regcomp .
134.It Dv REG_ICASE
135Compile for matching that ignores upper/lower case distinctions.
136See
137.Xr re_format 7 .
138.It Dv REG_NOSUB
139Compile for matching that need only report success or failure,
140not what was matched.
141.It Dv REG_NEWLINE
142Compile for newline-sensitive matching.
143By default, newline is a completely ordinary character with no special
144meaning in either REs or strings.
145With this flag,
146.Ql \&[^
147bracket expressions and
148.Ql \&.
149never match newline,
150a
151.Ql ^
152anchor matches the null string after any newline in the string
153in addition to its normal function,
154and the
155.Ql $
156anchor matches the null string before any newline in the
157string in addition to its normal function.
158.It Dv REG_PEND
159The regular expression ends,
160not at the first NUL,
161but just before the character pointed to by the
162.Fa re_endp
163member of the structure pointed to by
164.Fa preg .
165The
166.Fa re_endp
167member is of type
168.Fa const\ char\ * .
169This flag permits inclusion of NULs in the RE;
170they are considered ordinary characters.
171This is an extension,
172compatible with but not specified by
173.St -p1003.2 ,
174and should be used with
175caution in software intended to be portable to other systems.
176.El
177.Pp
178When successful,
179.Fn regcomp
180returns 0 and fills in the structure pointed to by
181.Fa preg .
182One member of that structure
183(other than
184.Fa re_endp )
185is publicized:
186.Fa re_nsub ,
187of type
188.Fa size_t ,
189contains the number of parenthesized subexpressions within the RE
190(except that the value of this member is undefined if the
191.Dv REG_NOSUB
192flag was used).
193If
194.Fn regcomp
195fails, it returns a non-zero error code;
196see DIAGNOSTICS.
197.Pp
198.Fn regexec
199matches the compiled RE pointed to by
200.Fa preg
201against the
202.Fa string ,
203subject to the flags in
204.Fa eflags ,
205and reports results using
206.Fa nmatch ,
207.Fa pmatch ,
208and the returned value.
209The RE must have been compiled by a previous invocation of
210.Fn regcomp .
211The compiled form is not altered during execution of
212.Fn regexec ,
213so a single compiled RE can be used simultaneously by multiple threads.
214.Pp
215By default,
216the NUL-terminated string pointed to by
217.Fa string
218is considered to be the text of an entire line, minus any terminating
219newline.
220The
221.Fa eflags
222argument is the bitwise
223.Tn OR
224of zero or more of the following flags:
225.Bl -tag -width XREG_STARTENDX
226.It Dv REG_NOTBOL
227The first character of
228the string
229is not the beginning of a line, so the
230.Ql ^
231anchor should not match before it.
232This does not affect the behavior of newlines under
233.Dv REG_NEWLINE .
234.It Dv REG_NOTEOL
235The NUL terminating
236the string
237does not end a line, so the
238.Ql $
239anchor should not match before it.
240This does not affect the behavior of newlines under
241.Dv REG_NEWLINE .
242.It Dv REG_STARTEND
243The string is considered to start at
244\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
245and to have a terminating NUL located at
246\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
247(there need not actually be a NUL at that location),
248regardless of the value of
249.Fa nmatch .
250See below for the definition of
251.Fa pmatch
252and
253.Fa nmatch .
254This is an extension,
255compatible with but not specified by
256.St -p1003.2 ,
257and should be used with
258caution in software intended to be portable to other systems.
259Note that a non-zero \fIrm_so\fR does not imply
260.Dv REG_NOTBOL ;
261.Dv REG_STARTEND
262affects only the location of the string,
263not how it is matched.
264.El
265.Pp
266See
267.Xr re_format 7
268for a discussion of what is matched in situations where an RE or a
269portion thereof could match any of several substrings of
270.Fa string .
271.Pp
272Normally,
273.Fn regexec
274returns 0 for success and the non-zero code
275.Dv REG_NOMATCH
276for failure.
277Other non-zero error codes may be returned in exceptional situations;
278see DIAGNOSTICS.
279.Pp
280If
281.Dv REG_NOSUB
282was specified in the compilation of the RE,
283or if
284.Fa nmatch
285is 0,
286.Fn regexec
287ignores the
288.Fa pmatch
289argument (but see below for the case where
290.Dv REG_STARTEND
291is specified).
292Otherwise,
293.Fa pmatch
294points to an array of
295.Fa nmatch
296structures of type
297.Li regmatch_t .
298Such a structure has at least the members
299.Fa rm_so
300and
301.Fa rm_eo ,
302both of type
303.Fa regoff_t
304(a signed arithmetic type at least as large as an
305.Li off_t
306and a
307.Li ssize_t ) ,
308containing respectively the offset of the first character of a substring
309and the offset of the first character after the end of the substring.
310Offsets are measured from the beginning of the
311.Fa string
312argument given to
313.Fn regexec .
314An empty substring is denoted by equal offsets,
315both indicating the character following the empty substring.
316.Pp
317The 0th member of the
318.Fa pmatch
319array is filled in to indicate what substring of
320.Fa string
321was matched by the entire RE.
322Remaining members report what substring was matched by parenthesized
323subexpressions within the RE;
324member
325.Va i
326reports subexpression
327.Va i ,
328with subexpressions counted (starting at 1) by the order of their opening
329parentheses in the RE, left to right.
330Unused entries in the array\(emcorresponding either to subexpressions that
331did not participate in the match at all, or to subexpressions that do not
332exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
333.Fa rm_so
334and
335.Fa rm_eo
336set to \-1.
337If a subexpression participated in the match several times,
338the reported substring is the last one it matched.
339(Note, as an example in particular, that when the RE
340.Dq (b*)+
341matches
342.Dq bbb ,
343the parenthesized subexpression matches each of the three
344.Sq b Ns s
345and then
346an infinite number of empty strings following the last
347.Sq b ,
348so the reported substring is one of the empties.)
349.Pp
350If
351.Dv REG_STARTEND
352is specified,
353.Fa pmatch
354must point to at least one
355.Li regmatch_t
356(even if
357.Fa nmatch
358is 0 or
359.Dv REG_NOSUB
360was specified),
361to hold the input offsets for
362.Dv REG_STARTEND .
363Use for output is still entirely controlled by
364.Fa nmatch ;
365if
366.Fa nmatch
367is 0 or
368.Dv REG_NOSUB
369was specified,
370the value of
371.Fa pmatch[0]
372will not be changed by a successful
373.Fn regexec .
374.Pp
375.Fn regerror
376maps a non-zero
377.Va errcode
378from either
379.Fn regcomp
380or
381.Fn regexec
382to a human-readable, printable message.
383If
384.Fa preg
385is non-NULL,
386the error code should have arisen from use of
387the
388.Li regex_t
389pointed to by
390.Fa preg ,
391and if the error code came from
392.Fn regcomp ,
393it should have been the result from the most recent
394.Fn regcomp
395using that
396.Li regex_t .
397.Pf ( Fn regerror
398may be able to supply a more detailed message using information
399from the
400.Li regex_t . )
401.Fn regerror
402places the NUL-terminated message into the buffer pointed to by
403.Fa errbuf ,
404limiting the length (including the NUL) to at most
405.Fa errbuf_size
406bytes.
407If the whole message won't fit,
408as much of it as will fit before the terminating NUL is supplied.
409In any case,
410the returned value is the size of buffer needed to hold the whole
411message (including the terminating NUL).
412If
413.Fa errbuf_size
414is 0,
415.Fa errbuf
416is ignored but the return value is still correct.
417.Pp
418If the
419.Fa errcode
420given to
421.Fn regerror
422is first
423.Tn OR Ns 'ed
424with
425.Dv REG_ITOA ,
426the
427.Dq message
428that results is the printable name of the error code,
429e.g.,
430.Dq REG_NOMATCH ,
431rather than an explanation thereof.
432If
433.Fa errcode
434is
435.Dv REG_ATOI ,
436then
437.Fa preg
438shall be non-null and the
439.Fa re_endp
440member of the structure it points to
441must point to the printable name of an error code;
442in this case, the result in
443.Fa errbuf
444is the decimal digits of
445the numeric value of the error code
446(0 if the name is not recognized).
447.Dv REG_ITOA
448and
449.Dv REG_ATOI
450are intended primarily as debugging facilities;
451they are extensions,
452compatible with but not specified by
453.St -p1003.2
454and should be used with
455caution in software intended to be portable to other systems.
456Be warned also that they are considered experimental and changes are possible.
457.Pp
458.Fn regfree
459frees any dynamically allocated storage associated with the compiled RE
460pointed to by
461.Fa preg .
462The remaining
463.Li regex_t
464is no longer a valid compiled RE
465and the effect of supplying it to
466.Fn regexec
467or
468.Fn regerror
469is undefined.
470.Pp
471None of these functions references global variables except for tables
472of constants;
473all are safe for use from multiple threads if the arguments are safe.
474.Sh IMPLEMENTATION CHOICES
475There are a number of decisions that
476.St -p1003.2
477leaves up to the implementor,
478either by explicitly saying
479.Dq undefined
480or by virtue of them being
481forbidden by the RE grammar.
482This implementation treats them as follows.
483.Pp
484See
485.Xr re_format 7
486for a discussion of the definition of case-independent matching.
487.Pp
488There is no particular limit on the length of REs,
489except insofar as memory is limited.
490Memory usage is approximately linear in RE size, and largely insensitive
491to RE complexity, except for bounded repetitions.
492See
493.Sx BUGS
494for one short RE using them
495that will run almost any system out of memory.
496.Pp
497A backslashed character other than one specifically given a magic meaning
498by
499.St -p1003.2
500(such magic meanings occur only in obsolete REs)
501is taken as an ordinary character.
502.Pp
503Any unmatched
504.Ql \&[
505is a
506.Dv REG_EBRACK
507error.
508.Pp
509Equivalence classes cannot begin or end bracket-expression ranges.
510The endpoint of one range cannot begin another.
511.Pp
512RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
513.Pp
514A repetition operator (?, *, +, or bounds) cannot follow another
515repetition operator.
516A repetition operator cannot begin an expression or subexpression
517or follow
518.Ql ^
519or
520.Ql | .
521.Pp
522A
523.Ql |
524cannot appear first or last in a (sub)expression, or after another
525.Ql | ,
526i.e., an operand of
527.Ql |
528cannot be an empty subexpression.
529An empty parenthesized subexpression,
530.Ql \&(\&) ,
531is legal and matches an
532empty (sub)string.
533An empty string is not a legal RE.
534.Pp
535A
536.Ql {
537followed by a digit is considered the beginning of bounds for a
538bounded repetition, which must then follow the syntax for bounds.
539A
540.Ql {
541.Em not
542followed by a digit is considered an ordinary character.
543.Pp
544.Ql ^
545and
546.Ql $
547beginning and ending subexpressions in obsolete
548.Pq Dq basic
549REs are anchors, not ordinary characters.
550.Sh DIAGNOSTICS
551Non-zero error codes from
552.Fn regcomp
553and
554.Fn regexec
555include the following:
556.Pp
557.Bl -tag -compact -width XREG_ECOLLATEX
558.It Er REG_NOMATCH
559regexec() failed to match
560.It Er REG_BADPAT
561invalid regular expression
562.It Er REG_ECOLLATE
563invalid collating element
564.It Er REG_ECTYPE
565invalid character class
566.It Er REG_EESCAPE
567\e applied to unescapable character
568.It Er REG_ESUBREG
569invalid backreference number
570.It Er REG_EBRACK
571brackets [ ] not balanced
572.It Er REG_EPAREN
573parentheses ( ) not balanced
574.It Er REG_EBRACE
575braces { } not balanced
576.It Er REG_BADBR
577invalid repetition count(s) in { }
578.It Er REG_ERANGE
579invalid character range in [ ]
580.It Er REG_ESPACE
581ran out of memory
582.It Er REG_BADRPT
583?, *, or + operand invalid
584.It Er REG_EMPTY
585empty (sub)expression
586.It Er REG_ASSERT
587.Dq can't happen
588\(emyou found a bug
589.It Er REG_INVARG
590invalid argument, e.g., negative-length string
591.El
592.Sh SEE ALSO
593.Xr grep 1 ,
594.Xr re_format 7
595.Pp
596.St -p1003.2 ,
597sections 2.8 (Regular Expression Notation)
598and
599B.5 (C Binding for Regular Expression Matching).
600.Sh HISTORY
601Originally written by Henry Spencer.
602Altered for inclusion in the
603.Bx 4.4
604distribution.
605.Sh BUGS
606This is an alpha release with known defects.
607Please report problems.
608.Pp
609There is one known functionality bug.
610The implementation of internationalization is incomplete:
611the locale is always assumed to be the default one of
612.St -p1003.2 ,
613and only the collating elements etc. of that locale are available.
614.Pp
615The back-reference code is subtle and doubts linger about its correctness
616in complex cases.
617.Pp
618.Fn regexec
619performance is poor.
620This will improve with later releases.
621.Fa nmatch
622exceeding 0 is expensive;
623.Fa nmatch
624exceeding 1 is worse.
625.Fn regexec
626is largely insensitive to RE complexity
627.Em except
628that back references are massively expensive.
629RE length does matter; in particular, there is a strong speed bonus
630for keeping RE length under about 30 characters,
631with most special characters counting roughly double.
632.Pp
633.Fn regcomp
634implements bounded repetitions by macro expansion,
635which is costly in time and space if counts are large
636or bounded repetitions are nested.
637A RE like, say,
638.Dq ((((a{1,100}){1,100}){1,100}){1,100}){1,100}
639will (eventually) run almost any existing machine out of swap space.
640.Pp
641There are suspected problems with response to obscure error conditions.
642Notably,
643certain kinds of internal overflow,
644produced only by truly enormous REs or by multiply nested bounded repetitions,
645are probably not handled well.
646.Pp
647Due to a mistake in
648.St -p1003.2 ,
649things like
650.Ql a)b
651are legal REs because
652.Ql \&)
653is
654a special character only in the presence of a previous unmatched
655.Ql \&( .
656This can't be fixed until the spec is fixed.
657.Pp
658The standard's definition of back references is vague.
659For example, does
660.Dq a\e(\e(b\e)*\e2\e)*d
661match
662.Dq abbbd ?
663Until the standard is clarified,
664behavior in such cases should not be relied on.
665.Pp
666The implementation of word-boundary matching is a bit of a kludge,
667and bugs may lurk in combinations of word-boundary matching and anchoring.
668