xref: /openbsd-src/lib/libc/regex/regex.3 (revision d1ce60253b51ed54cf3ad21b2215bd04afa99b19)
1.\"	$OpenBSD: regex.3,v 1.16 2002/10/16 21:32:06 wcobb Exp $
2.\"
3.\" Copyright (c) 1997, Phillip F Knaack. All rights reserved.
4.\"
5.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
6.\" Copyright (c) 1992, 1993, 1994
7.\"	The Regents of the University of California.  All rights reserved.
8.\"
9.\" This code is derived from software contributed to Berkeley by
10.\" Henry Spencer.
11.\"
12.\" Redistribution and use in source and binary forms, with or without
13.\" modification, are permitted provided that the following conditions
14.\" are met:
15.\" 1. Redistributions of source code must retain the above copyright
16.\"    notice, this list of conditions and the following disclaimer.
17.\" 2. Redistributions in binary form must reproduce the above copyright
18.\"    notice, this list of conditions and the following disclaimer in the
19.\"    documentation and/or other materials provided with the distribution.
20.\" 3. All advertising materials mentioning features or use of this software
21.\"    must display the following acknowledgement:
22.\"	This product includes software developed by the University of
23.\"	California, Berkeley and its contributors.
24.\" 4. Neither the name of the University nor the names of its contributors
25.\"    may be used to endorse or promote products derived from this software
26.\"    without specific prior written permission.
27.\"
28.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
29.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
30.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
31.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
32.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
33.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
34.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
35.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
36.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
37.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
38.\" SUCH DAMAGE.
39.\"
40.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
41.\"
42.Dd March 20, 1994
43.Dt REGEX 3
44.Os
45.Sh NAME
46.Nm regcomp ,
47.Nm regexec ,
48.Nm regerror ,
49.Nm regfree
50.Nd regular-expression library
51.Sh SYNOPSIS
52.Fd #include <sys/types.h>
53.Fd #include <regex.h>
54.Ft int
55.Fn regcomp "regex_t *preg" "const char *pattern" "int cflags"
56.Pp
57.Ft int
58.Fn regexec "const regex_t *preg" "const char *string" "size_t nmatch" \
59            "regmatch_t pmatch[]" "int eflags"
60.Pp
61.Ft size_t
62.Fn regerror "int errcode" "const regex_t *preg" "char *errbuf" \
63             "size_t errbuf_size"
64.Pp
65.Ft void
66.Fn regfree "regex_t *preg"
67.Sh DESCRIPTION
68These routines implement
69.St -p1003.2
70regular expressions
71.Pq Dq REs ;
72see
73.Xr re_format 7 .
74.Fn regcomp
75compiles an RE written as a string into an internal form,
76.Fn regexec
77matches that internal form against a string and reports results,
78.Fn regerror
79transforms error codes from either into human-readable messages, and
80.Fn regfree
81frees any dynamically allocated storage used by the internal form
82of an RE.
83.Pp
84The header
85.Aq Pa regex.h
86declares two structure types,
87.Li regex_t
88and
89.Li regmatch_t ,
90the former for compiled internal forms and the latter for match reporting.
91It also declares the four functions,
92a type
93.Li regoff_t ,
94and a number of constants with names starting with
95.Dv REG_ .
96.Pp
97.Fn regcomp
98compiles the regular expression contained in the
99.Fa pattern
100string,
101subject to the flags in
102.Fa cflags ,
103and places the results in the
104.Li regex_t
105structure pointed to by
106.Fa preg .
107.Fa cflags
108is the bitwise
109.Tn OR
110of zero or more of the following flags:
111.Bl -tag -width XREG_EXTENDEDX
112.It Dv REG_EXTENDED
113Compile modern
114.Pq Dq extended
115REs,
116rather than the obsolete
117.Pq Dq basic
118REs that are the default.
119.It Dv REG_BASIC
120This is a synonym for 0,
121provided as a counterpart to
122.Dv REG_EXTENDED
123to improve readability.
124.It Dv REG_NOSPEC
125Compile with recognition of all special characters turned off.
126All characters are thus considered ordinary,
127so the RE is a literal string.
128This is an extension,
129compatible with but not specified by
130.St -p1003.2 ,
131and should be used with
132caution in software intended to be portable to other systems.
133.Dv REG_EXTENDED
134and
135.Dv REG_NOSPEC
136may not be used in the same call to
137.Fn regcomp .
138.It Dv REG_ICASE
139Compile for matching that ignores upper/lower case distinctions.
140See
141.Xr re_format 7 .
142.It Dv REG_NOSUB
143Compile for matching that need only report success or failure,
144not what was matched.
145.It Dv REG_NEWLINE
146Compile for newline-sensitive matching.
147By default, newline is a completely ordinary character with no special
148meaning in either REs or strings.
149With this flag,
150.Ql \&[^
151bracket expressions and
152.Ql \&.
153never match newline,
154a
155.Ql ^
156anchor matches the null string after any newline in the string
157in addition to its normal function,
158and the
159.Ql $
160anchor matches the null string before any newline in the
161string in addition to its normal function.
162.It Dv REG_PEND
163The regular expression ends,
164not at the first NUL,
165but just before the character pointed to by the
166.Fa re_endp
167member of the structure pointed to by
168.Fa preg .
169The
170.Fa re_endp
171member is of type
172.Fa const\ char\ * .
173This flag permits inclusion of NULs in the RE;
174they are considered ordinary characters.
175This is an extension,
176compatible with but not specified by
177.St -p1003.2 ,
178and should be used with
179caution in software intended to be portable to other systems.
180.El
181.Pp
182When successful,
183.Fn regcomp
184returns 0 and fills in the structure pointed to by
185.Fa preg .
186One member of that structure
187(other than
188.Fa re_endp )
189is publicized:
190.Fa re_nsub ,
191of type
192.Fa size_t ,
193contains the number of parenthesized subexpressions within the RE
194(except that the value of this member is undefined if the
195.Dv REG_NOSUB
196flag was used).
197If
198.Fn regcomp
199fails, it returns a non-zero error code;
200see DIAGNOSTICS.
201.Pp
202.Fn regexec
203matches the compiled RE pointed to by
204.Fa preg
205against the
206.Fa string ,
207subject to the flags in
208.Fa eflags ,
209and reports results using
210.Fa nmatch ,
211.Fa pmatch ,
212and the returned value.
213The RE must have been compiled by a previous invocation of
214.Fn regcomp .
215The compiled form is not altered during execution of
216.Fn regexec ,
217so a single compiled RE can be used simultaneously by multiple threads.
218.Pp
219By default,
220the null-terminated string pointed to by
221.Fa string
222is considered to be the text of an entire line, minus any terminating
223newline.
224The
225.Fa eflags
226argument is the bitwise
227.Tn OR
228of zero or more of the following flags:
229.Bl -tag -width XREG_STARTENDX
230.It Dv REG_NOTBOL
231The first character of
232the string
233is not the beginning of a line, so the
234.Ql ^
235anchor should not match before it.
236This does not affect the behavior of newlines under
237.Dv REG_NEWLINE .
238.It Dv REG_NOTEOL
239The NUL terminating
240the string
241does not end a line, so the
242.Ql $
243anchor should not match before it.
244This does not affect the behavior of newlines under
245.Dv REG_NEWLINE .
246.It Dv REG_STARTEND
247The string is considered to start at
248\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
249and to have a terminating NUL located at
250\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
251(there need not actually be a NUL at that location),
252regardless of the value of
253.Fa nmatch .
254See below for the definition of
255.Fa pmatch
256and
257.Fa nmatch .
258This is an extension,
259compatible with but not specified by
260.St -p1003.2 ,
261and should be used with
262caution in software intended to be portable to other systems.
263Note that a non-zero \fIrm_so\fR does not imply
264.Dv REG_NOTBOL ;
265.Dv REG_STARTEND
266affects only the location of the string,
267not how it is matched.
268.El
269.Pp
270See
271.Xr re_format 7
272for a discussion of what is matched in situations where an RE or a
273portion thereof could match any of several substrings of
274.Fa string .
275.Pp
276Normally,
277.Fn regexec
278returns 0 for success and the non-zero code
279.Dv REG_NOMATCH
280for failure.
281Other non-zero error codes may be returned in exceptional situations;
282see DIAGNOSTICS.
283.Pp
284If
285.Dv REG_NOSUB
286was specified in the compilation of the RE,
287or if
288.Fa nmatch
289is 0,
290.Fn regexec
291ignores the
292.Fa pmatch
293argument (but see below for the case where
294.Dv REG_STARTEND
295is specified).
296Otherwise,
297.Fa pmatch
298points to an array of
299.Fa nmatch
300structures of type
301.Li regmatch_t .
302Such a structure has at least the members
303.Fa rm_so
304and
305.Fa rm_eo ,
306both of type
307.Fa regoff_t
308(a signed arithmetic type at least as large as an
309.Li off_t
310and a
311.Li ssize_t ) ,
312containing respectively the offset of the first character of a substring
313and the offset of the first character after the end of the substring.
314Offsets are measured from the beginning of the
315.Fa string
316argument given to
317.Fn regexec .
318An empty substring is denoted by equal offsets,
319both indicating the character following the empty substring.
320.Pp
321The 0th member of the
322.Fa pmatch
323array is filled in to indicate what substring of
324.Fa string
325was matched by the entire RE.
326Remaining members report what substring was matched by parenthesized
327subexpressions within the RE;
328member
329.Va i
330reports subexpression
331.Va i ,
332with subexpressions counted (starting at 1) by the order of their opening
333parentheses in the RE, left to right.
334Unused entries in the array\(emcorresponding either to subexpressions that
335did not participate in the match at all, or to subexpressions that do not
336exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
337.Fa rm_so
338and
339.Fa rm_eo
340set to \-1.
341If a subexpression participated in the match several times,
342the reported substring is the last one it matched.
343(Note, as an example in particular, that when the RE
344.Dq (b*)+
345matches
346.Dq bbb ,
347the parenthesized subexpression matches each of the three
348.Sq b Ns s
349and then
350an infinite number of empty strings following the last
351.Sq b ,
352so the reported substring is one of the empties.)
353.Pp
354If
355.Dv REG_STARTEND
356is specified,
357.Fa pmatch
358must point to at least one
359.Li regmatch_t
360(even if
361.Fa nmatch
362is 0 or
363.Dv REG_NOSUB
364was specified),
365to hold the input offsets for
366.Dv REG_STARTEND .
367Use for output is still entirely controlled by
368.Fa nmatch ;
369if
370.Fa nmatch
371is 0 or
372.Dv REG_NOSUB
373was specified,
374the value of
375.Fa pmatch[0]
376will not be changed by a successful
377.Fn regexec .
378.Pp
379.Fn regerror
380maps a non-zero
381.Va errcode
382from either
383.Fn regcomp
384or
385.Fn regexec
386to a human-readable, printable message.
387If
388.Fa preg
389is non-NULL,
390the error code should have arisen from use of
391the
392.Li regex_t
393pointed to by
394.Fa preg ,
395and if the error code came from
396.Fn regcomp ,
397it should have been the result from the most recent
398.Fn regcomp
399using that
400.Li regex_t .
401.Pf ( Fn regerror
402may be able to supply a more detailed message using information
403from the
404.Li regex_t . )
405.Fn regerror
406places the null-terminated message into the buffer pointed to by
407.Fa errbuf ,
408limiting the length (including the NUL) to at most
409.Fa errbuf_size
410bytes.
411If the whole message won't fit,
412as much of it as will fit before the terminating NUL is supplied.
413In any case,
414the returned value is the size of buffer needed to hold the whole
415message (including the terminating NUL).
416If
417.Fa errbuf_size
418is 0,
419.Fa errbuf
420is ignored but the return value is still correct.
421.Pp
422If the
423.Fa errcode
424given to
425.Fn regerror
426is first
427.Tn OR Ns 'ed
428with
429.Dv REG_ITOA ,
430the
431.Dq message
432that results is the printable name of the error code,
433e.g.,
434.Dq REG_NOMATCH ,
435rather than an explanation thereof.
436If
437.Fa errcode
438is
439.Dv REG_ATOI ,
440then
441.Fa preg
442shall be non-null and the
443.Fa re_endp
444member of the structure it points to
445must point to the printable name of an error code;
446in this case, the result in
447.Fa errbuf
448is the decimal digits of
449the numeric value of the error code
450(0 if the name is not recognized).
451.Dv REG_ITOA
452and
453.Dv REG_ATOI
454are intended primarily as debugging facilities;
455they are extensions,
456compatible with but not specified by
457.St -p1003.2
458and should be used with
459caution in software intended to be portable to other systems.
460Be warned also that they are considered experimental and changes are possible.
461.Pp
462.Fn regfree
463frees any dynamically allocated storage associated with the compiled RE
464pointed to by
465.Fa preg .
466The remaining
467.Li regex_t
468is no longer a valid compiled RE
469and the effect of supplying it to
470.Fn regexec
471or
472.Fn regerror
473is undefined.
474.Pp
475None of these functions references global variables except for tables
476of constants;
477all are safe for use from multiple threads if the arguments are safe.
478.Sh IMPLEMENTATION CHOICES
479There are a number of decisions that
480.St -p1003.2
481leaves up to the implementor,
482either by explicitly saying
483.Dq undefined
484or by virtue of them being
485forbidden by the RE grammar.
486This implementation treats them as follows.
487.Pp
488See
489.Xr re_format 7
490for a discussion of the definition of case-independent matching.
491.Pp
492There is no particular limit on the length of REs,
493except insofar as memory is limited.
494Memory usage is approximately linear in RE size, and largely insensitive
495to RE complexity, except for bounded repetitions.
496See
497.Sx BUGS
498for one short RE using them
499that will run almost any system out of memory.
500.Pp
501A backslashed character other than one specifically given a magic meaning
502by
503.St -p1003.2
504(such magic meanings occur only in obsolete REs)
505is taken as an ordinary character.
506.Pp
507Any unmatched
508.Ql \&[
509is a
510.Dv REG_EBRACK
511error.
512.Pp
513Equivalence classes cannot begin or end bracket-expression ranges.
514The endpoint of one range cannot begin another.
515.Pp
516RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
517.Pp
518A repetition operator (?, *, +, or bounds) cannot follow another
519repetition operator.
520A repetition operator cannot begin an expression or subexpression
521or follow
522.Ql ^
523or
524.Ql | .
525.Pp
526A
527.Ql |
528cannot appear first or last in a (sub)expression, or after another
529.Ql | ,
530i.e., an operand of
531.Ql |
532cannot be an empty subexpression.
533An empty parenthesized subexpression,
534.Ql \&(\&) ,
535is legal and matches an
536empty (sub)string.
537An empty string is not a legal RE.
538.Pp
539A
540.Ql {
541followed by a digit is considered the beginning of bounds for a
542bounded repetition, which must then follow the syntax for bounds.
543A
544.Ql {
545.Em not
546followed by a digit is considered an ordinary character.
547.Pp
548.Ql ^
549and
550.Ql $
551beginning and ending subexpressions in obsolete
552.Pq Dq basic
553REs are anchors, not ordinary characters.
554.Sh SEE ALSO
555.Xr grep 1 ,
556.Xr re_format 7
557.Pp
558.St -p1003.2 ,
559sections 2.8 (Regular Expression Notation)
560and
561B.5 (C Binding for Regular Expression Matching).
562.Sh DIAGNOSTICS
563Non-zero error codes from
564.Fn regcomp
565and
566.Fn regexec
567include the following:
568.Pp
569.Bl -tag -compact -width XREG_ECOLLATEX
570.It Er REG_NOMATCH
571regexec() failed to match
572.It Er REG_BADPAT
573invalid regular expression
574.It Er REG_ECOLLATE
575invalid collating element
576.It Er REG_ECTYPE
577invalid character class
578.It Er REG_EESCAPE
579\e applied to unescapable character
580.It Er REG_ESUBREG
581invalid backreference number
582.It Er REG_EBRACK
583brackets [ ] not balanced
584.It Er REG_EPAREN
585parentheses ( ) not balanced
586.It Er REG_EBRACE
587braces { } not balanced
588.It Er REG_BADBR
589invalid repetition count(s) in { }
590.It Er REG_ERANGE
591invalid character range in [ ]
592.It Er REG_ESPACE
593ran out of memory
594.It Er REG_BADRPT
595?, *, or + operand invalid
596.It Er REG_EMPTY
597empty (sub)expression
598.It Er REG_ASSERT
599.Dq can't happen
600\(emyou found a bug
601.It Er REG_INVARG
602invalid argument, e.g., negative-length string
603.El
604.Sh HISTORY
605Originally written by Henry Spencer.
606Altered for inclusion in the
607.Bx 4.4
608distribution.
609.Sh BUGS
610This is an alpha release with known defects.
611Please report problems.
612.Pp
613There is one known functionality bug.
614The implementation of internationalization is incomplete:
615the locale is always assumed to be the default one of
616.St -p1003.2 ,
617and only the collating elements etc. of that locale are available.
618.Pp
619The back-reference code is subtle and doubts linger about its correctness
620in complex cases.
621.Pp
622.Fn regexec
623performance is poor.
624This will improve with later releases.
625.Fa nmatch
626exceeding 0 is expensive;
627.Fa nmatch
628exceeding 1 is worse.
629.Fn regexec
630is largely insensitive to RE complexity
631.Em except
632that back references are massively expensive.
633RE length does matter; in particular, there is a strong speed bonus
634for keeping RE length under about 30 characters,
635with most special characters counting roughly double.
636.Pp
637.Fn regcomp
638implements bounded repetitions by macro expansion,
639which is costly in time and space if counts are large
640or bounded repetitions are nested.
641A RE like, say,
642.Dq ((((a{1,100}){1,100}){1,100}){1,100}){1,100}
643will (eventually) run almost any existing machine out of swap space.
644.Pp
645There are suspected problems with response to obscure error conditions.
646Notably,
647certain kinds of internal overflow,
648produced only by truly enormous REs or by multiply nested bounded repetitions,
649are probably not handled well.
650.Pp
651Due to a mistake in
652.St -p1003.2 ,
653things like
654.Ql a)b
655are legal REs because
656.Ql \&)
657is
658a special character only in the presence of a previous unmatched
659.Ql \&( .
660This can't be fixed until the spec is fixed.
661.Pp
662The standard's definition of back references is vague.
663For example, does
664.Dq a\e(\e(b\e)*\e2\e)*d
665match
666.Dq abbbd ?
667Until the standard is clarified,
668behavior in such cases should not be relied on.
669.Pp
670The implementation of word-boundary matching is a bit of a kludge,
671and bugs may lurk in combinations of word-boundary matching and anchoring.
672