xref: /openbsd-src/lib/libc/regex/regex.3 (revision d32639f6ddc06f615d8e07a47d97091b3531a48f)
1.\"	$OpenBSD: regex.3,v 1.30 2022/09/11 06:38:10 jmc Exp $
2.\"
3.\" Copyright (c) 1997, Phillip F Knaack. All rights reserved.
4.\"
5.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
6.\" Copyright (c) 1992, 1993, 1994
7.\"	The Regents of the University of California.  All rights reserved.
8.\"
9.\" This code is derived from software contributed to Berkeley by
10.\" Henry Spencer.
11.\"
12.\" Redistribution and use in source and binary forms, with or without
13.\" modification, are permitted provided that the following conditions
14.\" are met:
15.\" 1. Redistributions of source code must retain the above copyright
16.\"    notice, this list of conditions and the following disclaimer.
17.\" 2. Redistributions in binary form must reproduce the above copyright
18.\"    notice, this list of conditions and the following disclaimer in the
19.\"    documentation and/or other materials provided with the distribution.
20.\" 3. Neither the name of the University nor the names of its contributors
21.\"    may be used to endorse or promote products derived from this software
22.\"    without specific prior written permission.
23.\"
24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
27.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
34.\" SUCH DAMAGE.
35.\"
36.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
37.\"
38.Dd $Mdocdate: September 11 2022 $
39.Dt REGEXEC 3
40.Os
41.Sh NAME
42.Nm regcomp ,
43.Nm regexec ,
44.Nm regerror ,
45.Nm regfree
46.Nd regular expression routines
47.Sh SYNOPSIS
48.In sys/types.h
49.In regex.h
50.Ft int
51.Fn regcomp "regex_t *preg" "const char *pattern" "int cflags"
52.Pp
53.Ft int
54.Fn regexec "const regex_t *preg" "const char *string" "size_t nmatch" \
55            "regmatch_t pmatch[]" "int eflags"
56.Pp
57.Ft size_t
58.Fn regerror "int errcode" "const regex_t *preg" "char *errbuf" \
59             "size_t errbuf_size"
60.Pp
61.Ft void
62.Fn regfree "regex_t *preg"
63.Sh DESCRIPTION
64These routines implement
65.St -p1003.2
66regular expressions
67.Pq Dq REs ;
68see
69.Xr re_format 7 .
70.Fn regcomp
71compiles an RE written as a string into an internal form,
72.Fn regexec
73matches that internal form against a string and reports results,
74.Fn regerror
75transforms error codes from either into human-readable messages, and
76.Fn regfree
77frees any dynamically allocated storage used by the internal form
78of an RE.
79.Pp
80The header
81.In regex.h
82declares two structure types,
83.Vt regex_t
84and
85.Vt regmatch_t ,
86the former for compiled internal forms and the latter for match reporting.
87It also declares the four functions,
88a type
89.Vt regoff_t ,
90and a number of constants with names starting with
91.Dv REG_ .
92.Pp
93.Fn regcomp
94compiles the regular expression contained in the
95.Fa pattern
96string,
97subject to the flags in
98.Fa cflags ,
99and places the results in the
100.Vt regex_t
101structure pointed to by
102.Fa preg .
103The
104.Fa cflags
105argument is the bitwise OR of zero or more of the following values:
106.Bl -tag -width XREG_EXTENDEDX
107.It Dv REG_EXTENDED
108Compile modern
109.Pq Dq extended
110REs,
111rather than the obsolete
112.Pq Dq basic
113REs that are the default.
114.It Dv REG_BASIC
115This is a synonym for 0,
116provided as a counterpart to
117.Dv REG_EXTENDED
118to improve readability.
119.It Dv REG_NOSPEC
120Compile with recognition of all special characters turned off.
121All characters are thus considered ordinary,
122so the RE is a literal string.
123This is an extension,
124compatible with but not specified by
125.St -p1003.2 ,
126and should be used with
127caution in software intended to be portable to other systems.
128.Dv REG_EXTENDED
129and
130.Dv REG_NOSPEC
131may not be used in the same call to
132.Fn regcomp .
133.It Dv REG_ICASE
134Compile for matching that ignores upper/lower case distinctions.
135See
136.Xr re_format 7 .
137.It Dv REG_NOSUB
138Compile for matching that need only report success or failure,
139not what was matched.
140.It Dv REG_NEWLINE
141Compile for newline-sensitive matching.
142By default, newline is a completely ordinary character with no special
143meaning in either REs or strings.
144With this flag,
145.Ql \&[^
146bracket expressions and
147.Ql \&.
148never match newline,
149a
150.Ql ^
151anchor matches the null string after any newline in the string
152in addition to its normal function,
153and the
154.Ql $
155anchor matches the null string before any newline in the
156string in addition to its normal function.
157.It Dv REG_PEND
158The regular expression ends,
159not at the first NUL,
160but just before the character pointed to by the
161.Fa re_endp
162member of the structure pointed to by
163.Fa preg .
164The
165.Fa re_endp
166member is of type
167.Fa const\ char\ * .
168This flag permits inclusion of NULs in the RE;
169they are considered ordinary characters.
170This is an extension,
171compatible with but not specified by
172.St -p1003.2 ,
173and should be used with
174caution in software intended to be portable to other systems.
175.El
176.Pp
177When successful,
178.Fn regcomp
179returns 0 and fills in the structure pointed to by
180.Fa preg .
181One member of that structure
182(other than
183.Fa re_endp )
184is publicized:
185.Fa re_nsub ,
186of type
187.Fa size_t ,
188contains the number of parenthesized subexpressions within the RE
189(except that the value of this member is undefined if the
190.Dv REG_NOSUB
191flag was used).
192If
193.Fn regcomp
194fails, it returns a non-zero error code;
195see DIAGNOSTICS.
196.Pp
197.Fn regexec
198matches the compiled RE pointed to by
199.Fa preg
200against the
201.Fa string ,
202subject to the flags in
203.Fa eflags ,
204and reports results using
205.Fa nmatch ,
206.Fa pmatch ,
207and the returned value.
208The RE must have been compiled by a previous invocation of
209.Fn regcomp .
210The compiled form is not altered during execution of
211.Fn regexec ,
212so a single compiled RE can be used simultaneously by multiple threads.
213.Pp
214By default,
215the NUL-terminated string pointed to by
216.Fa string
217is considered to be the text of an entire line, minus any terminating
218newline.
219The
220.Fa eflags
221argument is the bitwise OR of zero or more of the following values:
222.Bl -tag -width XREG_STARTENDX
223.It Dv REG_NOTBOL
224The first character of the string is treated as the continuation
225of a line.
226This means that the anchors
227.Ql ^ ,
228.Ql [[:<:]] ,
229and
230.Ql \e<
231do not match before it; but see
232.Dv REG_STARTEND
233below.
234This does not affect the behavior of newlines under
235.Dv REG_NEWLINE .
236.It Dv REG_NOTEOL
237The NUL terminating
238the string
239does not end a line, so the
240.Ql $
241anchor does not match before it.
242This does not affect the behavior of newlines under
243.Dv REG_NEWLINE .
244.It Dv REG_STARTEND
245The string is considered to start at
246.Fa string No +
247.Fa pmatch Ns [0]. Ns Fa rm_so
248and to end before the byte located at
249.Fa string No +
250.Fa pmatch Ns [0]. Ns Fa rm_eo ,
251regardless of the value of
252.Fa nmatch .
253See below for the definition of
254.Fa pmatch
255and
256.Fa nmatch .
257This is an extension,
258compatible with but not specified by
259.St -p1003.2 ,
260and should be used with
261caution in software intended to be portable to other systems.
262.Pp
263Without
264.Dv REG_NOTBOL ,
265the position
266.Fa rm_so
267is considered the beginning of a line, such that
268.Ql ^
269matches before it, and the beginning of a word if there is a word
270character at this position, such that
271.Ql [[:<:]]
272and
273.Ql \e<
274match before it.
275.Pp
276With
277.Dv REG_NOTBOL ,
278the character at position
279.Fa rm_so
280is treated as the continuation of a line, and if
281.Fa rm_so
282is greater than 0, the preceding character is taken into consideration.
283If the preceding character is a newline and the regular expression was compiled
284with
285.Dv REG_NEWLINE ,
286.Ql ^
287matches before the string; if the preceding character is not a word character
288but the string starts with a word character,
289.Ql [[:<:]]
290and
291.Ql \e<
292match before the string.
293.El
294.Pp
295See
296.Xr re_format 7
297for a discussion of what is matched in situations where an RE or a
298portion thereof could match any of several substrings of
299.Fa string .
300.Pp
301Normally,
302.Fn regexec
303returns 0 for success and the non-zero code
304.Dv REG_NOMATCH
305for failure.
306Other non-zero error codes may be returned in exceptional situations;
307see DIAGNOSTICS.
308.Pp
309If
310.Dv REG_NOSUB
311was specified in the compilation of the RE,
312or if
313.Fa nmatch
314is 0,
315.Fn regexec
316ignores the
317.Fa pmatch
318argument (but see below for the case where
319.Dv REG_STARTEND
320is specified).
321Otherwise,
322.Fa pmatch
323points to an array of
324.Fa nmatch
325structures of type
326.Vt regmatch_t .
327Such a structure has at least the members
328.Fa rm_so
329and
330.Fa rm_eo ,
331both of type
332.Fa regoff_t
333(a signed arithmetic type at least as large as an
334.Vt off_t
335and a
336.Vt ssize_t ) ,
337containing respectively the offset of the first character of a substring
338and the offset of the first character after the end of the substring.
339Offsets are measured from the beginning of the
340.Fa string
341argument given to
342.Fn regexec .
343An empty substring is denoted by equal offsets,
344both indicating the character following the empty substring.
345.Pp
346The 0th member of the
347.Fa pmatch
348array is filled in to indicate what substring of
349.Fa string
350was matched by the entire RE.
351Remaining members report what substring was matched by parenthesized
352subexpressions within the RE;
353member
354.Va i
355reports subexpression
356.Va i ,
357with subexpressions counted (starting at 1) by the order of their opening
358parentheses in the RE, left to right.
359Unused entries in the array\(emcorresponding either to subexpressions that
360did not participate in the match at all, or to subexpressions that do not
361exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
362.Fa rm_so
363and
364.Fa rm_eo
365set to \-1.
366If a subexpression participated in the match several times,
367the reported substring is the last one it matched.
368(Note, as an example in particular, that when the RE
369.Dq (b*)+
370matches
371.Dq bbb ,
372the parenthesized subexpression matches each of the three
373.Sq b Ns s
374and then
375an infinite number of empty strings following the last
376.Sq b ,
377so the reported substring is one of the empties.)
378.Pp
379If
380.Dv REG_STARTEND
381is specified,
382.Fa pmatch
383must point to at least one
384.Vt regmatch_t
385(even if
386.Fa nmatch
387is 0 or
388.Dv REG_NOSUB
389was specified),
390to hold the input offsets for
391.Dv REG_STARTEND .
392Use for output is still entirely controlled by
393.Fa nmatch ;
394if
395.Fa nmatch
396is 0 or
397.Dv REG_NOSUB
398was specified,
399the value of
400.Fa pmatch[0]
401will not be changed by a successful
402.Fn regexec .
403.Pp
404.Fn regerror
405maps a non-zero
406.Va errcode
407from either
408.Fn regcomp
409or
410.Fn regexec
411to a human-readable, printable message.
412If
413.Fa preg
414is non-NULL,
415the error code should have arisen from use of
416the
417.Vt regex_t
418pointed to by
419.Fa preg ,
420and if the error code came from
421.Fn regcomp ,
422it should have been the result from the most recent
423.Fn regcomp
424using that
425.Vt regex_t .
426.Pf ( Fn regerror
427may be able to supply a more detailed message using information
428from the
429.Vt regex_t . )
430.Fn regerror
431places the NUL-terminated message into the buffer pointed to by
432.Fa errbuf ,
433limiting the length (including the NUL) to at most
434.Fa errbuf_size
435bytes.
436If the whole message won't fit,
437as much of it as will fit before the terminating NUL is supplied.
438In any case,
439the returned value is the size of buffer needed to hold the whole
440message (including the terminating NUL).
441If
442.Fa errbuf_size
443is 0,
444.Fa errbuf
445is ignored but the return value is still correct.
446.Pp
447If the
448.Fa errcode
449given to
450.Fn regerror
451is first OR'ed with
452.Dv REG_ITOA ,
453the
454.Dq message
455that results is the printable name of the error code,
456e.g.,
457.Dq REG_NOMATCH ,
458rather than an explanation thereof.
459If
460.Fa errcode
461is
462.Dv REG_ATOI ,
463then
464.Fa preg
465shall be non-null and the
466.Fa re_endp
467member of the structure it points to
468must point to the printable name of an error code;
469in this case, the result in
470.Fa errbuf
471is the decimal digits of
472the numeric value of the error code
473(0 if the name is not recognized).
474.Dv REG_ITOA
475and
476.Dv REG_ATOI
477are intended primarily as debugging facilities;
478they are extensions,
479compatible with but not specified by
480.St -p1003.2
481and should be used with
482caution in software intended to be portable to other systems.
483Be warned also that they are considered experimental and changes are possible.
484.Pp
485.Fn regfree
486frees any dynamically allocated storage associated with the compiled RE
487pointed to by
488.Fa preg .
489The remaining
490.Vt regex_t
491is no longer a valid compiled RE
492and the effect of supplying it to
493.Fn regexec
494or
495.Fn regerror
496is undefined.
497.Pp
498None of these functions references global variables except for tables
499of constants;
500all are safe for use from multiple threads if the arguments are safe.
501.Sh IMPLEMENTATION CHOICES
502There are a number of decisions that
503.St -p1003.2
504leaves up to the implementor,
505either by explicitly saying
506.Dq undefined
507or by virtue of them being
508forbidden by the RE grammar.
509This implementation treats them as follows.
510.Pp
511See
512.Xr re_format 7
513for a discussion of the definition of case-independent matching.
514.Pp
515There is no particular limit on the length of REs,
516except insofar as memory is limited.
517Memory usage is approximately linear in RE size, and largely insensitive
518to RE complexity, except for bounded repetitions.
519See
520.Sx BUGS
521for one short RE using them
522that will run almost any system out of memory.
523.Pp
524A backslashed character other than one specifically given a magic meaning
525by
526.St -p1003.2
527(such magic meanings occur only in obsolete REs)
528is taken as an ordinary character.
529.Pp
530Any unmatched
531.Ql \&[
532is a
533.Dv REG_EBRACK
534error.
535.Pp
536Equivalence classes cannot begin or end bracket-expression ranges.
537The endpoint of one range cannot begin another.
538.Pp
539RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
540.Pp
541A repetition operator (?, *, +, or bounds) cannot follow another
542repetition operator.
543A repetition operator cannot begin an expression or subexpression
544or follow
545.Ql ^
546or
547.Ql | .
548.Pp
549A
550.Ql |
551cannot appear first or last in a (sub)expression, or after another
552.Ql | ,
553i.e., an operand of
554.Ql |
555cannot be an empty subexpression.
556An empty parenthesized subexpression,
557.Ql \&(\&) ,
558is legal and matches an
559empty (sub)string.
560An empty string is not a legal RE.
561.Pp
562A
563.Ql {
564followed by a digit is considered the beginning of bounds for a
565bounded repetition, which must then follow the syntax for bounds.
566A
567.Ql {
568.Em not
569followed by a digit is considered an ordinary character.
570.Pp
571.Ql ^
572and
573.Ql $
574beginning and ending subexpressions in obsolete
575.Pq Dq basic
576REs are anchors, not ordinary characters.
577.Sh DIAGNOSTICS
578Non-zero error codes from
579.Fn regcomp
580and
581.Fn regexec
582include the following:
583.Pp
584.Bl -tag -compact -width XREG_ECOLLATEX
585.It Er REG_NOMATCH
586.Fn regexec
587failed to match
588.It Er REG_BADPAT
589invalid regular expression
590.It Er REG_ECOLLATE
591invalid collating element
592.It Er REG_ECTYPE
593invalid character class
594.It Er REG_EESCAPE
595\e applied to unescapable character
596.It Er REG_ESUBREG
597invalid backreference number
598.It Er REG_EBRACK
599brackets [ ] not balanced
600.It Er REG_EPAREN
601parentheses ( ) not balanced
602.It Er REG_EBRACE
603braces { } not balanced
604.It Er REG_BADBR
605invalid repetition count(s) in { }
606.It Er REG_ERANGE
607invalid character range in [ ]
608.It Er REG_ESPACE
609ran out of memory
610.It Er REG_BADRPT
611?, *, or + operand invalid
612.It Er REG_EMPTY
613empty (sub)expression
614.It Er REG_ASSERT
615.Dq can't happen
616\(emyou found a bug
617.It Er REG_INVARG
618invalid argument, e.g., negative-length string
619.El
620.Sh SEE ALSO
621.Xr grep 1 ,
622.Xr re_format 7
623.Pp
624.St -p1003.2 ,
625sections 2.8 (Regular Expression Notation)
626and
627B.5 (C Binding for Regular Expression Matching).
628.Sh HISTORY
629Predecessors called
630.Fn regcmp
631and
632.Fn regex
633first appeared in PWB/UNIX 1.0.
634.Pp
635Predecessors
636.Fn re_comp
637and
638.Fn re_exec
639first appeared in
640.Bx 4.0 ,
641became part of
642.In unistd.h
643in
644.Bx 4.4 ,
645and were deleted after
646.Ox 5.4 .
647.Pp
648Functions called
649.Fn regcomp ,
650.Fn regexec ,
651.Fn regerror ,
652and
653.Fn regsub
654first appeared in Version\~8
655.At ,
656were reimplemented and declared in
657.In regexp.h
658for
659.Bx 4.3 Tahoe ,
660and were also deleted after
661.Ox 5.4 .
662.Pp
663Taking different arguments, the POSIX
664.In regex.h
665functions
666.Fn regcomp ,
667.Fn regexec ,
668.Fn regerror ,
669and
670.Fn regfree
671appeared in
672.Bx 4.4 .
673.Sh AUTHORS
674.An -nosplit
675The
676Version\~8
677.At
678code was implemented by
679.An Rob Pike
680and extracted into a library by
681.An Dave Presotto .
682The
683.Bx 4.3 Tahoe
684and
685.Bx 4.4
686versions were both written by
687.An Henry Spencer .
688.Sh BUGS
689The implementation of internationalization is incomplete:
690the locale is always assumed to be the default one of
691.St -p1003.2 ,
692and only the collating elements etc. of that locale are available.
693.Pp
694The back-reference code is subtle and doubts linger about its correctness
695in complex cases.
696.Pp
697.Fn regexec
698performance is poor.
699This will improve with later releases.
700.Fa nmatch
701exceeding 0 is expensive;
702.Fa nmatch
703exceeding 1 is worse.
704.Fn regexec
705is largely insensitive to RE complexity
706.Em except
707that back references are massively expensive.
708RE length does matter; in particular, there is a strong speed bonus
709for keeping RE length under about 30 characters,
710with most special characters counting roughly double.
711.Pp
712.Fn regcomp
713implements bounded repetitions by macro expansion,
714which is costly in time and space if counts are large
715or bounded repetitions are nested.
716A RE like, say,
717.Dq ((((a{1,100}){1,100}){1,100}){1,100}){1,100}
718will (eventually) run almost any existing machine out of swap space.
719.Pp
720There are suspected problems with response to obscure error conditions.
721Notably,
722certain kinds of internal overflow,
723produced only by truly enormous REs or by multiply nested bounded repetitions,
724are probably not handled well.
725.Pp
726Due to a mistake in
727.St -p1003.2 ,
728things like
729.Ql a)b
730are legal REs because
731.Ql \&)
732is
733a special character only in the presence of a previous unmatched
734.Ql \&( .
735This can't be fixed until the spec is fixed.
736.Pp
737The standard's definition of back references is vague.
738For example, does
739.Dq a\e(\e(b\e)*\e2\e)*d
740match
741.Dq abbbd ?
742Until the standard is clarified,
743behavior in such cases should not be relied on.
744.Pp
745The implementation of word-boundary matching is a bit of a kludge,
746and bugs may lurk in combinations of word-boundary matching and anchoring.
747