xref: /openbsd-src/lib/libc/regex/regex.3 (revision 2777ee89d0e541ec819d05abee114837837abbec)
1.\"	$OpenBSD: regex.3,v 1.26 2015/11/10 23:48:18 jmc Exp $
2.\"
3.\" Copyright (c) 1997, Phillip F Knaack. All rights reserved.
4.\"
5.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
6.\" Copyright (c) 1992, 1993, 1994
7.\"	The Regents of the University of California.  All rights reserved.
8.\"
9.\" This code is derived from software contributed to Berkeley by
10.\" Henry Spencer.
11.\"
12.\" Redistribution and use in source and binary forms, with or without
13.\" modification, are permitted provided that the following conditions
14.\" are met:
15.\" 1. Redistributions of source code must retain the above copyright
16.\"    notice, this list of conditions and the following disclaimer.
17.\" 2. Redistributions in binary form must reproduce the above copyright
18.\"    notice, this list of conditions and the following disclaimer in the
19.\"    documentation and/or other materials provided with the distribution.
20.\" 3. Neither the name of the University nor the names of its contributors
21.\"    may be used to endorse or promote products derived from this software
22.\"    without specific prior written permission.
23.\"
24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
27.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
34.\" SUCH DAMAGE.
35.\"
36.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
37.\"
38.Dd $Mdocdate: November 10 2015 $
39.Dt REGEXEC 3
40.Os
41.Sh NAME
42.Nm regcomp ,
43.Nm regexec ,
44.Nm regerror ,
45.Nm regfree
46.Nd regular expression routines
47.Sh SYNOPSIS
48.In sys/types.h
49.In regex.h
50.Ft int
51.Fn regcomp "regex_t *preg" "const char *pattern" "int cflags"
52.Pp
53.Ft int
54.Fn regexec "const regex_t *preg" "const char *string" "size_t nmatch" \
55            "regmatch_t pmatch[]" "int eflags"
56.Pp
57.Ft size_t
58.Fn regerror "int errcode" "const regex_t *preg" "char *errbuf" \
59             "size_t errbuf_size"
60.Pp
61.Ft void
62.Fn regfree "regex_t *preg"
63.Sh DESCRIPTION
64These routines implement
65.St -p1003.2
66regular expressions
67.Pq Dq REs ;
68see
69.Xr re_format 7 .
70.Fn regcomp
71compiles an RE written as a string into an internal form,
72.Fn regexec
73matches that internal form against a string and reports results,
74.Fn regerror
75transforms error codes from either into human-readable messages, and
76.Fn regfree
77frees any dynamically allocated storage used by the internal form
78of an RE.
79.Pp
80The header
81.In regex.h
82declares two structure types,
83.Li regex_t
84and
85.Li regmatch_t ,
86the former for compiled internal forms and the latter for match reporting.
87It also declares the four functions,
88a type
89.Li regoff_t ,
90and a number of constants with names starting with
91.Dv REG_ .
92.Pp
93.Fn regcomp
94compiles the regular expression contained in the
95.Fa pattern
96string,
97subject to the flags in
98.Fa cflags ,
99and places the results in the
100.Li regex_t
101structure pointed to by
102.Fa preg .
103The
104.Fa cflags
105argument is the bitwise
106.Tn OR
107of zero or more of the following values:
108.Bl -tag -width XREG_EXTENDEDX
109.It Dv REG_EXTENDED
110Compile modern
111.Pq Dq extended
112REs,
113rather than the obsolete
114.Pq Dq basic
115REs that are the default.
116.It Dv REG_BASIC
117This is a synonym for 0,
118provided as a counterpart to
119.Dv REG_EXTENDED
120to improve readability.
121.It Dv REG_NOSPEC
122Compile with recognition of all special characters turned off.
123All characters are thus considered ordinary,
124so the RE is a literal string.
125This is an extension,
126compatible with but not specified by
127.St -p1003.2 ,
128and should be used with
129caution in software intended to be portable to other systems.
130.Dv REG_EXTENDED
131and
132.Dv REG_NOSPEC
133may not be used in the same call to
134.Fn regcomp .
135.It Dv REG_ICASE
136Compile for matching that ignores upper/lower case distinctions.
137See
138.Xr re_format 7 .
139.It Dv REG_NOSUB
140Compile for matching that need only report success or failure,
141not what was matched.
142.It Dv REG_NEWLINE
143Compile for newline-sensitive matching.
144By default, newline is a completely ordinary character with no special
145meaning in either REs or strings.
146With this flag,
147.Ql \&[^
148bracket expressions and
149.Ql \&.
150never match newline,
151a
152.Ql ^
153anchor matches the null string after any newline in the string
154in addition to its normal function,
155and the
156.Ql $
157anchor matches the null string before any newline in the
158string in addition to its normal function.
159.It Dv REG_PEND
160The regular expression ends,
161not at the first NUL,
162but just before the character pointed to by the
163.Fa re_endp
164member of the structure pointed to by
165.Fa preg .
166The
167.Fa re_endp
168member is of type
169.Fa const\ char\ * .
170This flag permits inclusion of NULs in the RE;
171they are considered ordinary characters.
172This is an extension,
173compatible with but not specified by
174.St -p1003.2 ,
175and should be used with
176caution in software intended to be portable to other systems.
177.El
178.Pp
179When successful,
180.Fn regcomp
181returns 0 and fills in the structure pointed to by
182.Fa preg .
183One member of that structure
184(other than
185.Fa re_endp )
186is publicized:
187.Fa re_nsub ,
188of type
189.Fa size_t ,
190contains the number of parenthesized subexpressions within the RE
191(except that the value of this member is undefined if the
192.Dv REG_NOSUB
193flag was used).
194If
195.Fn regcomp
196fails, it returns a non-zero error code;
197see DIAGNOSTICS.
198.Pp
199.Fn regexec
200matches the compiled RE pointed to by
201.Fa preg
202against the
203.Fa string ,
204subject to the flags in
205.Fa eflags ,
206and reports results using
207.Fa nmatch ,
208.Fa pmatch ,
209and the returned value.
210The RE must have been compiled by a previous invocation of
211.Fn regcomp .
212The compiled form is not altered during execution of
213.Fn regexec ,
214so a single compiled RE can be used simultaneously by multiple threads.
215.Pp
216By default,
217the NUL-terminated string pointed to by
218.Fa string
219is considered to be the text of an entire line, minus any terminating
220newline.
221The
222.Fa eflags
223argument is the bitwise
224.Tn OR
225of zero or more of the following values:
226.Bl -tag -width XREG_STARTENDX
227.It Dv REG_NOTBOL
228The first character of
229the string
230is not the beginning of a line, so the
231.Ql ^
232anchor should not match before it.
233This does not affect the behavior of newlines under
234.Dv REG_NEWLINE .
235.It Dv REG_NOTEOL
236The NUL terminating
237the string
238does not end a line, so the
239.Ql $
240anchor should not match before it.
241This does not affect the behavior of newlines under
242.Dv REG_NEWLINE .
243.It Dv REG_STARTEND
244The string is considered to start at
245\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
246and to have a terminating NUL located at
247\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
248(there need not actually be a NUL at that location),
249regardless of the value of
250.Fa nmatch .
251See below for the definition of
252.Fa pmatch
253and
254.Fa nmatch .
255This is an extension,
256compatible with but not specified by
257.St -p1003.2 ,
258and should be used with
259caution in software intended to be portable to other systems.
260Note that a non-zero \fIrm_so\fR does not imply
261.Dv REG_NOTBOL ;
262.Dv REG_STARTEND
263affects only the location of the string,
264not how it is matched.
265.El
266.Pp
267See
268.Xr re_format 7
269for a discussion of what is matched in situations where an RE or a
270portion thereof could match any of several substrings of
271.Fa string .
272.Pp
273Normally,
274.Fn regexec
275returns 0 for success and the non-zero code
276.Dv REG_NOMATCH
277for failure.
278Other non-zero error codes may be returned in exceptional situations;
279see DIAGNOSTICS.
280.Pp
281If
282.Dv REG_NOSUB
283was specified in the compilation of the RE,
284or if
285.Fa nmatch
286is 0,
287.Fn regexec
288ignores the
289.Fa pmatch
290argument (but see below for the case where
291.Dv REG_STARTEND
292is specified).
293Otherwise,
294.Fa pmatch
295points to an array of
296.Fa nmatch
297structures of type
298.Li regmatch_t .
299Such a structure has at least the members
300.Fa rm_so
301and
302.Fa rm_eo ,
303both of type
304.Fa regoff_t
305(a signed arithmetic type at least as large as an
306.Li off_t
307and a
308.Li ssize_t ) ,
309containing respectively the offset of the first character of a substring
310and the offset of the first character after the end of the substring.
311Offsets are measured from the beginning of the
312.Fa string
313argument given to
314.Fn regexec .
315An empty substring is denoted by equal offsets,
316both indicating the character following the empty substring.
317.Pp
318The 0th member of the
319.Fa pmatch
320array is filled in to indicate what substring of
321.Fa string
322was matched by the entire RE.
323Remaining members report what substring was matched by parenthesized
324subexpressions within the RE;
325member
326.Va i
327reports subexpression
328.Va i ,
329with subexpressions counted (starting at 1) by the order of their opening
330parentheses in the RE, left to right.
331Unused entries in the array\(emcorresponding either to subexpressions that
332did not participate in the match at all, or to subexpressions that do not
333exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
334.Fa rm_so
335and
336.Fa rm_eo
337set to \-1.
338If a subexpression participated in the match several times,
339the reported substring is the last one it matched.
340(Note, as an example in particular, that when the RE
341.Dq (b*)+
342matches
343.Dq bbb ,
344the parenthesized subexpression matches each of the three
345.Sq b Ns s
346and then
347an infinite number of empty strings following the last
348.Sq b ,
349so the reported substring is one of the empties.)
350.Pp
351If
352.Dv REG_STARTEND
353is specified,
354.Fa pmatch
355must point to at least one
356.Li regmatch_t
357(even if
358.Fa nmatch
359is 0 or
360.Dv REG_NOSUB
361was specified),
362to hold the input offsets for
363.Dv REG_STARTEND .
364Use for output is still entirely controlled by
365.Fa nmatch ;
366if
367.Fa nmatch
368is 0 or
369.Dv REG_NOSUB
370was specified,
371the value of
372.Fa pmatch[0]
373will not be changed by a successful
374.Fn regexec .
375.Pp
376.Fn regerror
377maps a non-zero
378.Va errcode
379from either
380.Fn regcomp
381or
382.Fn regexec
383to a human-readable, printable message.
384If
385.Fa preg
386is non-NULL,
387the error code should have arisen from use of
388the
389.Li regex_t
390pointed to by
391.Fa preg ,
392and if the error code came from
393.Fn regcomp ,
394it should have been the result from the most recent
395.Fn regcomp
396using that
397.Li regex_t .
398.Pf ( Fn regerror
399may be able to supply a more detailed message using information
400from the
401.Li regex_t . )
402.Fn regerror
403places the NUL-terminated message into the buffer pointed to by
404.Fa errbuf ,
405limiting the length (including the NUL) to at most
406.Fa errbuf_size
407bytes.
408If the whole message won't fit,
409as much of it as will fit before the terminating NUL is supplied.
410In any case,
411the returned value is the size of buffer needed to hold the whole
412message (including the terminating NUL).
413If
414.Fa errbuf_size
415is 0,
416.Fa errbuf
417is ignored but the return value is still correct.
418.Pp
419If the
420.Fa errcode
421given to
422.Fn regerror
423is first
424.Tn OR Ns 'ed
425with
426.Dv REG_ITOA ,
427the
428.Dq message
429that results is the printable name of the error code,
430e.g.,
431.Dq REG_NOMATCH ,
432rather than an explanation thereof.
433If
434.Fa errcode
435is
436.Dv REG_ATOI ,
437then
438.Fa preg
439shall be non-null and the
440.Fa re_endp
441member of the structure it points to
442must point to the printable name of an error code;
443in this case, the result in
444.Fa errbuf
445is the decimal digits of
446the numeric value of the error code
447(0 if the name is not recognized).
448.Dv REG_ITOA
449and
450.Dv REG_ATOI
451are intended primarily as debugging facilities;
452they are extensions,
453compatible with but not specified by
454.St -p1003.2
455and should be used with
456caution in software intended to be portable to other systems.
457Be warned also that they are considered experimental and changes are possible.
458.Pp
459.Fn regfree
460frees any dynamically allocated storage associated with the compiled RE
461pointed to by
462.Fa preg .
463The remaining
464.Li regex_t
465is no longer a valid compiled RE
466and the effect of supplying it to
467.Fn regexec
468or
469.Fn regerror
470is undefined.
471.Pp
472None of these functions references global variables except for tables
473of constants;
474all are safe for use from multiple threads if the arguments are safe.
475.Sh IMPLEMENTATION CHOICES
476There are a number of decisions that
477.St -p1003.2
478leaves up to the implementor,
479either by explicitly saying
480.Dq undefined
481or by virtue of them being
482forbidden by the RE grammar.
483This implementation treats them as follows.
484.Pp
485See
486.Xr re_format 7
487for a discussion of the definition of case-independent matching.
488.Pp
489There is no particular limit on the length of REs,
490except insofar as memory is limited.
491Memory usage is approximately linear in RE size, and largely insensitive
492to RE complexity, except for bounded repetitions.
493See
494.Sx BUGS
495for one short RE using them
496that will run almost any system out of memory.
497.Pp
498A backslashed character other than one specifically given a magic meaning
499by
500.St -p1003.2
501(such magic meanings occur only in obsolete REs)
502is taken as an ordinary character.
503.Pp
504Any unmatched
505.Ql \&[
506is a
507.Dv REG_EBRACK
508error.
509.Pp
510Equivalence classes cannot begin or end bracket-expression ranges.
511The endpoint of one range cannot begin another.
512.Pp
513RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
514.Pp
515A repetition operator (?, *, +, or bounds) cannot follow another
516repetition operator.
517A repetition operator cannot begin an expression or subexpression
518or follow
519.Ql ^
520or
521.Ql | .
522.Pp
523A
524.Ql |
525cannot appear first or last in a (sub)expression, or after another
526.Ql | ,
527i.e., an operand of
528.Ql |
529cannot be an empty subexpression.
530An empty parenthesized subexpression,
531.Ql \&(\&) ,
532is legal and matches an
533empty (sub)string.
534An empty string is not a legal RE.
535.Pp
536A
537.Ql {
538followed by a digit is considered the beginning of bounds for a
539bounded repetition, which must then follow the syntax for bounds.
540A
541.Ql {
542.Em not
543followed by a digit is considered an ordinary character.
544.Pp
545.Ql ^
546and
547.Ql $
548beginning and ending subexpressions in obsolete
549.Pq Dq basic
550REs are anchors, not ordinary characters.
551.Sh DIAGNOSTICS
552Non-zero error codes from
553.Fn regcomp
554and
555.Fn regexec
556include the following:
557.Pp
558.Bl -tag -compact -width XREG_ECOLLATEX
559.It Er REG_NOMATCH
560regexec() failed to match
561.It Er REG_BADPAT
562invalid regular expression
563.It Er REG_ECOLLATE
564invalid collating element
565.It Er REG_ECTYPE
566invalid character class
567.It Er REG_EESCAPE
568\e applied to unescapable character
569.It Er REG_ESUBREG
570invalid backreference number
571.It Er REG_EBRACK
572brackets [ ] not balanced
573.It Er REG_EPAREN
574parentheses ( ) not balanced
575.It Er REG_EBRACE
576braces { } not balanced
577.It Er REG_BADBR
578invalid repetition count(s) in { }
579.It Er REG_ERANGE
580invalid character range in [ ]
581.It Er REG_ESPACE
582ran out of memory
583.It Er REG_BADRPT
584?, *, or + operand invalid
585.It Er REG_EMPTY
586empty (sub)expression
587.It Er REG_ASSERT
588.Dq can't happen
589\(emyou found a bug
590.It Er REG_INVARG
591invalid argument, e.g., negative-length string
592.El
593.Sh SEE ALSO
594.Xr grep 1 ,
595.Xr re_format 7
596.Pp
597.St -p1003.2 ,
598sections 2.8 (Regular Expression Notation)
599and
600B.5 (C Binding for Regular Expression Matching).
601.Sh HISTORY
602Predecessors to
603.Fn regcomp
604and
605.Fn regexec ,
606.Fn regcmp
607and
608.Fn regex ,
609first appeared in the Programmer's Workbench (PWB/UNIX).
610.Pp
611The present functions were originally written by Henry Spencer
612and altered for inclusion in the
613.Bx 4.4
614distribution.
615.Sh BUGS
616This is an alpha release with known defects.
617Please report problems.
618.Pp
619There is one known functionality bug.
620The implementation of internationalization is incomplete:
621the locale is always assumed to be the default one of
622.St -p1003.2 ,
623and only the collating elements etc. of that locale are available.
624.Pp
625The back-reference code is subtle and doubts linger about its correctness
626in complex cases.
627.Pp
628.Fn regexec
629performance is poor.
630This will improve with later releases.
631.Fa nmatch
632exceeding 0 is expensive;
633.Fa nmatch
634exceeding 1 is worse.
635.Fn regexec
636is largely insensitive to RE complexity
637.Em except
638that back references are massively expensive.
639RE length does matter; in particular, there is a strong speed bonus
640for keeping RE length under about 30 characters,
641with most special characters counting roughly double.
642.Pp
643.Fn regcomp
644implements bounded repetitions by macro expansion,
645which is costly in time and space if counts are large
646or bounded repetitions are nested.
647A RE like, say,
648.Dq ((((a{1,100}){1,100}){1,100}){1,100}){1,100}
649will (eventually) run almost any existing machine out of swap space.
650.Pp
651There are suspected problems with response to obscure error conditions.
652Notably,
653certain kinds of internal overflow,
654produced only by truly enormous REs or by multiply nested bounded repetitions,
655are probably not handled well.
656.Pp
657Due to a mistake in
658.St -p1003.2 ,
659things like
660.Ql a)b
661are legal REs because
662.Ql \&)
663is
664a special character only in the presence of a previous unmatched
665.Ql \&( .
666This can't be fixed until the spec is fixed.
667.Pp
668The standard's definition of back references is vague.
669For example, does
670.Dq a\e(\e(b\e)*\e2\e)*d
671match
672.Dq abbbd ?
673Until the standard is clarified,
674behavior in such cases should not be relied on.
675.Pp
676The implementation of word-boundary matching is a bit of a kludge,
677and bugs may lurk in combinations of word-boundary matching and anchoring.
678