xref: /netbsd-src/lib/libc/regex/regex.3 (revision b1e838363e3c6fc78a55519254d99869742dd33c)
1.\" $NetBSD: regex.3,v 1.33 2022/12/04 01:29:32 uwe Exp $
2.\"
3.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
4.\" Copyright (c) 1992, 1993, 1994
5.\"	The Regents of the University of California.  All rights reserved.
6.\"
7.\" This code is derived from software contributed to Berkeley by
8.\" Henry Spencer.
9.\"
10.\" Redistribution and use in source and binary forms, with or without
11.\" modification, are permitted provided that the following conditions
12.\" are met:
13.\" 1. Redistributions of source code must retain the above copyright
14.\"    notice, this list of conditions and the following disclaimer.
15.\" 2. Redistributions in binary form must reproduce the above copyright
16.\"    notice, this list of conditions and the following disclaimer in the
17.\"    documentation and/or other materials provided with the distribution.
18.\" 3. Neither the name of the University nor the names of its contributors
19.\"    may be used to endorse or promote products derived from this software
20.\"    without specific prior written permission.
21.\"
22.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
23.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
24.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
25.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
26.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
27.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
28.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
29.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
30.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
31.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
32.\" SUCH DAMAGE.
33.\"
34.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
35.\" $FreeBSD: head/lib/libc/regex/regex.3 363817 2020-08-04 02:06:49Z kevans $
36.\"
37.Dd March 11, 2021
38.Dt REGEX 3
39.Os
40.Sh NAME
41.Nm regcomp ,
42.Nm regexec ,
43.Nm regerror ,
44.Nm regfree ,
45.Nm regasub ,
46.Nm regnsub
47.Nd regular-expression library
48.Sh LIBRARY
49.Lb libc
50.Sh SYNOPSIS
51.In regex.h
52.Ft int
53.Fo regcomp
54.Fa "regex_t * restrict preg" "const char * restrict pattern" "int cflags"
55.Fc
56.Ft int
57.Fo regexec
58.Fa "const regex_t * restrict preg" "const char * restrict string"
59.Fa "size_t nmatch" "regmatch_t pmatch[restrict]" "int eflags"
60.Fc
61.Ft size_t
62.Fo regerror
63.Fa "int errcode" "const regex_t * restrict preg"
64.Fa "char * restrict errbuf" "size_t errbuf_size"
65.Fc
66.Ft void
67.Fn regfree "regex_t *preg"
68.Ft ssize_t
69.Fn regnsub "char *buf" "size_t bufsiz" "const char *sub" "const regmatch_t *rm" "const char *str"
70.Ft ssize_t
71.Fn regasub "char **buf" "const char *sub" "const regmatch_t *rm" "const char *sstr"
72.Sh DESCRIPTION
73These routines implement
74.St -p1003.2
75regular expressions
76.Pq Do RE Dc Ns s ;
77see
78.Xr re_format 7 .
79The
80.Fn regcomp
81function
82compiles an RE written as a string into an internal form,
83.Fn regexec
84matches that internal form against a string and reports results,
85.Fn regerror
86transforms error codes from either into human-readable messages,
87and
88.Fn regfree
89frees any dynamically-allocated storage used by the internal form
90of an RE.
91.Pp
92The header
93.In regex.h
94declares two structure types,
95.Ft regex_t
96and
97.Ft regmatch_t ,
98the former for compiled internal forms and the latter for match reporting.
99It also declares the four functions,
100a type
101.Ft regoff_t ,
102and a number of constants with names starting with
103.Dq Dv REG_ .
104.Pp
105The
106.Fn regcomp
107function
108compiles the regular expression contained in the
109.Fa pattern
110string,
111subject to the flags in
112.Fa cflags ,
113and places the results in the
114.Ft regex_t
115structure pointed to by
116.Fa preg .
117The
118.Fa cflags
119argument
120is the bitwise OR of zero or more of the following flags:
121.Bl -tag -width REG_EXTENDED
122.It Dv REG_EXTENDED
123Compile modern
124.Pq Dq extended
125REs,
126rather than the obsolete
127.Pq Dq basic
128REs that
129are the default.
130.It Dv REG_BASIC
131This is a synonym for 0,
132provided as a counterpart to
133.Dv REG_EXTENDED
134to improve readability.
135.It Dv REG_NOSPEC
136Compile with recognition of all special characters turned off.
137All characters are thus considered ordinary,
138so the
139.Dq RE
140is a literal string.
141This is an extension,
142compatible with but not specified by
143.St -p1003.2 ,
144and should be used with
145caution in software intended to be portable to other systems.
146.Dv REG_EXTENDED
147and
148.Dv REG_NOSPEC
149may not be used
150in the same call to
151.Fn regcomp .
152.It Dv REG_ICASE
153Compile for matching that ignores upper/lower case distinctions.
154See
155.Xr re_format 7 .
156.It Dv REG_NOSUB
157Compile for matching that need only report success or failure,
158not what was matched.
159.It Dv REG_NEWLINE
160Compile for newline-sensitive matching.
161By default, newline is a completely ordinary character with no special
162meaning in either REs or strings.
163With this flag,
164.Ql [^
165bracket expressions and
166.Ql .\&
167never match newline,
168a
169.Ql ^\&
170anchor matches the null string after any newline in the string
171in addition to its normal function,
172and the
173.Ql $\&
174anchor matches the null string before any newline in the
175string in addition to its normal function.
176.It Dv REG_PEND
177The regular expression ends,
178not at the first NUL,
179but just before the character pointed to by the
180.Va re_endp
181member of the structure pointed to by
182.Fa preg .
183The
184.Va re_endp
185member is of type
186.Ft "const char *" .
187This flag permits inclusion of NULs in the RE;
188they are considered ordinary characters.
189This is an extension,
190compatible with but not specified by
191.St -p1003.2 ,
192and should be used with
193caution in software intended to be portable to other systems.
194.It Dv REG_GNU
195Include GNU-inspired extensions:
196.Pp
197.Bl -tag -offset indent -width XX -compact
198.It \eN
199Use backreference
200.Dv N
201where
202.Dv N
203is a single digit number between
204.Dv 1
205and
206.Dv 9 .
207.It \ea
208Visual Bell
209.It \eb
210Match a position that is a word boundary.
211.It \eB
212Match a position that is a not word boundary.
213.It \ef
214Form Feed
215.It \en
216Line Feed
217.It \er
218Carriage return
219.It \es
220Alias for [[:space:]]
221.It \eS
222Alias for [^[:space:]]
223.It \et
224Horizontal Tab
225.It \ev
226Vertical Tab
227.It \ew
228Alias for [[:alnum:]_]
229.It \eW
230Alias for [^[:alnum:]_]
231.It \e'
232Matches the end of the subject string (the string to be matched).
233.It \e`
234Matches the beginning of the subject string.
235.El
236.Pp
237This is an extension,
238compatible with but not specified by
239.St -p1003.2 ,
240and should be used with
241caution in software intended to be portable to other systems.
242.El
243.Pp
244When successful,
245.Fn regcomp
246returns 0 and fills in the structure pointed to by
247.Fa preg .
248One member of that structure
249(other than
250.Va re_endp )
251is publicized:
252.Va re_nsub ,
253of type
254.Ft size_t ,
255contains the number of parenthesized subexpressions within the RE
256(except that the value of this member is undefined if the
257.Dv REG_NOSUB
258flag was used).
259If
260.Fn regcomp
261fails, it returns a non-zero error code;
262see
263.Sx DIAGNOSTICS .
264.Pp
265The
266.Fn regexec
267function
268matches the compiled RE pointed to by
269.Fa preg
270against the
271.Fa string ,
272subject to the flags in
273.Fa eflags ,
274and reports results using
275.Fa nmatch ,
276.Fa pmatch ,
277and the returned value.
278The RE must have been compiled by a previous invocation of
279.Fn regcomp .
280The compiled form is not altered during execution of
281.Fn regexec ,
282so a single compiled RE can be used simultaneously by multiple threads.
283.Pp
284By default,
285the NUL-terminated string pointed to by
286.Fa string
287is considered to be the text of an entire line, minus any terminating
288newline.
289The
290.Fa eflags
291argument is the bitwise OR of zero or more of the following flags:
292.Bl -tag -width REG_STARTEND
293.It Dv REG_NOTBOL
294The first character of the string is treated as the continuation
295of a line.
296This means that the anchors
297.Ql ^\& ,
298.Ql [[:<:]] ,
299and
300.Ql \e<
301do not match before it; but see
302.Dv REG_STARTEND
303below.
304This does not affect the behavior of newlines under
305.Dv REG_NEWLINE .
306.It Dv REG_NOTEOL
307The NUL terminating
308the string
309does not end a line, so the
310.Ql $\&
311anchor does not match before it.
312This does not affect the behavior of newlines under
313.Dv REG_NEWLINE .
314.It Dv REG_STARTEND
315The string is considered to start at
316.Fa string No +
317.Fa pmatch Ns [0]. Ns Fa rm_so
318and to end before the byte located at
319.Fa string No +
320.Fa pmatch Ns [0]. Ns Fa rm_eo ,
321regardless of the value of
322.Fa nmatch .
323See below for the definition of
324.Fa pmatch
325and
326.Fa nmatch .
327This is an extension,
328compatible with but not specified by
329.St -p1003.2 ,
330and should be used with
331caution in software intended to be portable to other systems.
332.Pp
333Without
334.Dv REG_NOTBOL ,
335the position
336.Fa rm_so
337is considered the beginning of a line, such that
338.Ql ^
339matches before it, and the beginning of a word if there is a word
340character at this position, such that
341.Ql [[:<:]]
342and
343.Ql \e<
344match before it.
345.Pp
346With
347.Dv REG_NOTBOL ,
348the character at position
349.Fa rm_so
350is treated as the continuation of a line, and if
351.Fa rm_so
352is greater than 0, the preceding character is taken into consideration.
353If the preceding character is a newline and the regular expression was compiled
354with
355.Dv REG_NEWLINE ,
356.Ql ^
357matches before the string; if the preceding character is not a word character
358but the string starts with a word character,
359.Ql [[:<:]]
360and
361.Ql \e<
362match before the string.
363.El
364.Pp
365See
366.Xr re_format 7
367for a discussion of what is matched in situations where an RE or a
368portion thereof could match any of several substrings of
369.Fa string .
370.Pp
371Normally,
372.Fn regexec
373returns 0 for success and the non-zero code
374.Dv REG_NOMATCH
375for failure.
376Other non-zero error codes may be returned in exceptional situations;
377see
378.Sx DIAGNOSTICS .
379.Pp
380If
381.Dv REG_NOSUB
382was specified in the compilation of the RE,
383or if
384.Fa nmatch
385is 0,
386.Fn regexec
387ignores the
388.Fa pmatch
389argument (but see below for the case where
390.Dv REG_STARTEND
391is specified).
392Otherwise,
393.Fa pmatch
394points to an array of
395.Fa nmatch
396structures of type
397.Ft regmatch_t .
398Such a structure has at least the members
399.Va rm_so
400and
401.Va rm_eo ,
402both of type
403.Ft regoff_t
404(a signed arithmetic type at least as large as an
405.Ft off_t
406and a
407.Ft ssize_t ) ,
408containing respectively the offset of the first character of a substring
409and the offset of the first character after the end of the substring.
410Offsets are measured from the beginning of the
411.Fa string
412argument given to
413.Fn regexec .
414An empty substring is denoted by equal offsets,
415both indicating the character following the empty substring.
416.Pp
417The 0th member of the
418.Fa pmatch
419array is filled in to indicate what substring of
420.Fa string
421was matched by the entire RE.
422Remaining members report what substring was matched by parenthesized
423subexpressions within the RE;
424member
425.Va i
426reports subexpression
427.Va i ,
428with subexpressions counted (starting at 1) by the order of their opening
429parentheses in the RE, left to right.
430Unused entries in the array (corresponding either to subexpressions that
431did not participate in the match at all, or to subexpressions that do not
432exist in the RE (that is,
433.Va i
434>
435.Fa preg Ns -> Ns Va re_nsub ) )
436have both
437.Va rm_so
438and
439.Va rm_eo
440set to -1.
441If a subexpression participated in the match several times,
442the reported substring is the last one it matched.
443(Note, as an example in particular, that when the RE
444.Ql "(b*)+"
445matches
446.Ql bbb ,
447the parenthesized subexpression matches each of the three
448.So Li b Sc Ns s
449and then
450an infinite number of empty strings following the last
451.Ql b ,
452so the reported substring is one of the empties.)
453.Pp
454If
455.Dv REG_STARTEND
456is specified,
457.Fa pmatch
458must point to at least one
459.Ft regmatch_t
460(even if
461.Fa nmatch
462is 0 or
463.Dv REG_NOSUB
464was specified),
465to hold the input offsets for
466.Dv REG_STARTEND .
467Use for output is still entirely controlled by
468.Fa nmatch ;
469if
470.Fa nmatch
471is 0 or
472.Dv REG_NOSUB
473was specified,
474the value of
475.Fa pmatch Ns [0]
476will not be changed by a successful
477.Fn regexec .
478.Pp
479The
480.Fn regerror
481function
482maps a non-zero
483.Fa errcode
484from either
485.Fn regcomp
486or
487.Fn regexec
488to a human-readable, printable message.
489If
490.Fa preg
491is
492.No non\- Ns Dv NULL ,
493the error code should have arisen from use of
494the
495.Ft regex_t
496pointed to by
497.Fa preg ,
498and if the error code came from
499.Fn regcomp ,
500it should have been the result from the most recent
501.Fn regcomp
502using that
503.Ft regex_t .
504The
505.Po
506.Fn regerror
507may be able to supply a more detailed message using information
508from the
509.Ft regex_t .
510.Pc
511The
512.Fn regerror
513function
514places the NUL-terminated message into the buffer pointed to by
515.Fa errbuf ,
516limiting the length (including the NUL) to at most
517.Fa errbuf_size
518bytes.
519If the whole message will not fit,
520as much of it as will fit before the terminating NUL is supplied.
521In any case,
522the returned value is the size of buffer needed to hold the whole
523message (including terminating NUL).
524If
525.Fa errbuf_size
526is 0,
527.Fa errbuf
528is ignored but the return value is still correct.
529.Pp
530If the
531.Fa errcode
532given to
533.Fn regerror
534is first ORed with
535.Dv REG_ITOA ,
536the
537.Dq message
538that results is the printable name of the error code,
539e.g.\&
540.Dq Dv REG_NOMATCH ,
541rather than an explanation thereof.
542If
543.Fa errcode
544is
545.Dv REG_ATOI ,
546then
547.Fa preg
548shall be
549.No non\- Ns Dv NULL
550and the
551.Va re_endp
552member of the structure it points to
553must point to the printable name of an error code;
554in this case, the result in
555.Fa errbuf
556is the decimal digits of
557the numeric value of the error code
558(0 if the name is not recognized).
559.Dv REG_ITOA
560and
561.Dv REG_ATOI
562are intended primarily as debugging facilities;
563they are extensions,
564compatible with but not specified by
565.St -p1003.2 ,
566and should be used with
567caution in software intended to be portable to other systems.
568Be warned also that they are considered experimental and changes are possible.
569.Pp
570The
571.Fn regfree
572function
573frees any dynamically-allocated storage associated with the compiled RE
574pointed to by
575.Fa preg .
576The remaining
577.Ft regex_t
578is no longer a valid compiled RE
579and the effect of supplying it to
580.Fn regexec
581or
582.Fn regerror
583is undefined.
584.Pp
585None of these functions references global variables except for tables
586of constants;
587all are safe for use from multiple threads if the arguments are safe.
588.Pp
589The
590.Fn regnsub
591and
592.Fn regasub
593functions perform substitutions using
594.Xr sed 1
595like syntax.
596They return the length of the string that would have been created
597if there was enough space or
598.Dv \-1
599on error, setting
600.Dv errno .
601The result
602is being placed in
603.Fa buf
604which is user-supplied in
605.Fn regnsub
606and dynamically allocated in
607.Fn regasub .
608The
609.Fa sub
610argument contains a substitution string which might refer to the first
6119 regular expression strings using
612.Dq \e<n>
613to refer to the nth matched
614item, or
615.Dq &
616(which is equivalent to
617.Dq \e0 )
618to refer to the full match.
619The
620.Fa rm
621array must be at least 10 elements long, and should contain the result
622of the matches from a previous
623.Fn regexec
624call.
625Only 10 elements of the
626.Fa rm
627array can be used.
628The
629.Fa str
630argument contains the source string to apply the transformation to.
631.Sh IMPLEMENTATION CHOICES
632There are a number of decisions that
633.St -p1003.2
634leaves up to the implementor,
635either by explicitly saying
636.Dq undefined
637or by virtue of them being
638forbidden by the RE grammar.
639This implementation treats them as follows.
640.Pp
641See
642.Xr re_format 7
643for a discussion of the definition of case-independent matching.
644.Pp
645There is no particular limit on the length of REs,
646except insofar as memory is limited.
647Memory usage is approximately linear in RE size, and largely insensitive
648to RE complexity, except for bounded repetitions.
649See
650.Sx BUGS
651for one short RE using them
652that will run almost any system out of memory.
653.Pp
654A backslashed character other than one specifically given a magic meaning
655by
656.St -p1003.2
657(such magic meanings occur only in obsolete
658.Bq Dq basic
659REs)
660is taken as an ordinary character.
661.Pp
662Any unmatched
663.Ql [\&
664is a
665.Dv REG_EBRACK
666error.
667.Pp
668Equivalence classes cannot begin or end bracket-expression ranges.
669The endpoint of one range cannot begin another.
670.Pp
671.Dv RE_DUP_MAX ,
672the limit on repetition counts in bounded repetitions, is 255.
673.Pp
674A repetition operator
675.Ql ( ?\& ,
676.Ql *\& ,
677.Ql +\& ,
678or bounds)
679cannot follow another
680repetition operator.
681A repetition operator cannot begin an expression or subexpression
682or follow
683.Ql ^\&
684or
685.Ql |\& .
686.Pp
687.Ql |\&
688cannot appear first or last in a (sub)expression or after another
689.Ql |\& ,
690i.e., an operand of
691.Ql |\&
692cannot be an empty subexpression.
693An empty parenthesized subexpression,
694.Ql "()" ,
695is legal and matches an
696empty (sub)string.
697An empty string is not a legal RE.
698.Pp
699A
700.Ql {\&
701followed by a digit is considered the beginning of bounds for a
702bounded repetition, which must then follow the syntax for bounds.
703A
704.Ql {\&
705.Em not
706followed by a digit is considered an ordinary character.
707.Pp
708.Ql ^\&
709and
710.Ql $\&
711beginning and ending subexpressions in obsolete
712.Pq Dq basic
713REs are anchors, not ordinary characters.
714.Sh RETURN VALUES
715Non-zero error codes from
716.Fn regcomp
717and
718.Fn regexec
719include the following:
720.Pp
721.Bl -tag -width REG_ECOLLATE -compact
722.It Dv REG_NOMATCH
723The
724.Fn regexec
725function
726failed to match
727.It Dv REG_BADPAT
728invalid regular expression
729.It Dv REG_ECOLLATE
730invalid collating element
731.It Dv REG_ECTYPE
732invalid character class
733.It Dv REG_EESCAPE
734.Ql \e
735applied to unescapable character
736.It Dv REG_ESUBREG
737invalid backreference number
738.It Dv REG_EBRACK
739brackets
740.Ql "[ ]"
741not balanced
742.It Dv REG_EPAREN
743parentheses
744.Ql "( )"
745not balanced
746.It Dv REG_EBRACE
747braces
748.Ql "{ }"
749not balanced
750.It Dv REG_BADBR
751invalid repetition count(s) in
752.Ql "{ }"
753.It Dv REG_ERANGE
754invalid character range in
755.Ql "[ ]"
756.It Dv REG_ESPACE
757ran out of memory
758.It Dv REG_BADRPT
759.Ql ?\& ,
760.Ql *\& ,
761or
762.Ql +\&
763operand invalid
764.It Dv REG_EMPTY
765empty (sub)expression
766.It Dv REG_ASSERT
767cannot happen - you found a bug
768.It Dv REG_INVARG
769invalid argument, e.g.\& negative-length string
770.It Dv REG_ILLSEQ
771illegal byte sequence (bad multibyte character)
772.El
773.Sh SEE ALSO
774.Xr grep 1 ,
775.Xr re_format 7
776.Pp
777.St -p1003.2 ,
778sections 2.8 (Regular Expression Notation)
779and
780B.5 (C Binding for Regular Expression Matching).
781.Sh HISTORY
782Originally written by
783.An Henry Spencer .
784Altered for inclusion in the
785.Bx 4.4
786distribution.
787.Pp
788The
789.Fn regnsub
790and
791.Fn regasub
792functions appeared in
793.Nx 8 .
794.Sh BUGS
795This is an alpha release with known defects.
796Please report problems.
797.Pp
798The back-reference code is subtle and doubts linger about its correctness
799in complex cases.
800.Pp
801The
802.Fn regexec
803function
804performance is poor.
805This will improve with later releases.
806The
807.Fa nmatch
808argument
809exceeding 0 is expensive;
810.Fa nmatch
811exceeding 1 is worse.
812The
813.Fn regexec
814function
815is largely insensitive to RE complexity
816.Em except
817that back
818references are massively expensive.
819RE length does matter; in particular, there is a strong speed bonus
820for keeping RE length under about 30 characters,
821with most special characters counting roughly double.
822.Pp
823The
824.Fn regcomp
825function
826implements bounded repetitions by macro expansion,
827which is costly in time and space if counts are large
828or bounded repetitions are nested.
829An RE like, say,
830.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
831will (eventually) run almost any existing machine out of swap space.
832.Pp
833There are suspected problems with response to obscure error conditions.
834Notably,
835certain kinds of internal overflow,
836produced only by truly enormous REs or by multiply nested bounded repetitions,
837are probably not handled well.
838.Pp
839Due to a mistake in
840.St -p1003.2 ,
841things like
842.Ql "a)b"
843are legal REs because
844.Ql )\&
845is
846a special character only in the presence of a previous unmatched
847.Ql (\& .
848This cannot be fixed until the spec is fixed.
849.Pp
850The standard's definition of back references is vague.
851For example, does
852.Ql "a\e(\e(b\e)*\e2\e)*d"
853match
854.Ql "abbbd" ?
855Until the standard is clarified,
856behavior in such cases should not be relied on.
857.Pp
858The implementation of word-boundary matching is a bit of a kludge,
859and bugs may lurk in combinations of word-boundary matching and anchoring.
860.Pp
861Word-boundary matching does not work properly in multibyte locales.
862