xref: /netbsd-src/lib/libc/regex/regex.3 (revision f3cfa6f6ce31685c6c4a758bc430e69eb99f50a4)
1.\"	$NetBSD: regex.3,v 1.26 2018/02/26 14:04:43 kamil Exp $
2.\"
3.\" Copyright (c) 1992, 1993, 1994
4.\"	The Regents of the University of California.  All rights reserved.
5.\"
6.\" This code is derived from software contributed to Berkeley by
7.\" Henry Spencer.
8.\"
9.\" Redistribution and use in source and binary forms, with or without
10.\" modification, are permitted provided that the following conditions
11.\" are met:
12.\" 1. Redistributions of source code must retain the above copyright
13.\"    notice, this list of conditions and the following disclaimer.
14.\" 2. Redistributions in binary form must reproduce the above copyright
15.\"    notice, this list of conditions and the following disclaimer in the
16.\"    documentation and/or other materials provided with the distribution.
17.\" 3. Neither the name of the University nor the names of its contributors
18.\"    may be used to endorse or promote products derived from this software
19.\"    without specific prior written permission.
20.\"
21.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
22.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
23.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
24.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
25.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
26.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
27.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
28.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
29.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
30.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
31.\" SUCH DAMAGE.
32.\"
33.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
34.\"
35.\" This code is derived from software contributed to Berkeley by
36.\" Henry Spencer.
37.\"
38.\" Redistribution and use in source and binary forms, with or without
39.\" modification, are permitted provided that the following conditions
40.\" are met:
41.\" 1. Redistributions of source code must retain the above copyright
42.\"    notice, this list of conditions and the following disclaimer.
43.\" 2. Redistributions in binary form must reproduce the above copyright
44.\"    notice, this list of conditions and the following disclaimer in the
45.\"    documentation and/or other materials provided with the distribution.
46.\" 3. All advertising materials mentioning features or use of this software
47.\"    must display the following acknowledgement:
48.\"	This product includes software developed by the University of
49.\"	California, Berkeley and its contributors.
50.\" 4. Neither the name of the University nor the names of its contributors
51.\"    may be used to endorse or promote products derived from this software
52.\"    without specific prior written permission.
53.\"
54.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
55.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
56.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
57.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
58.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
59.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
60.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
61.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
62.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
63.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
64.\" SUCH DAMAGE.
65.\"
66.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
67.\"
68.Dd February 26, 2018
69.Dt REGEX 3
70.Os
71.Sh NAME
72.Nm regex ,
73.Nm regcomp ,
74.Nm regexec ,
75.Nm regerror ,
76.Nm regfree ,
77.Nm regasub ,
78.Nm regnsub
79.Nd regular-expression library
80.Sh LIBRARY
81.Lb libc
82.Sh SYNOPSIS
83.In regex.h
84.Ft int
85.Fn regcomp "regex_t * restrict preg" "const char * restrict pattern" "int cflags"
86.Ft int
87.Fn regexec "const regex_t * restrict preg" "const char * restrict string" "size_t nmatch" "regmatch_t pmatch[]" "int eflags"
88.Ft size_t
89.Fn regerror "int errcode" "const regex_t * restrict preg" "char * restrict errbuf" "size_t errbuf_size"
90.Ft void
91.Fn regfree "regex_t *preg"
92.Ft ssize_t
93.Fn regnsub "char *buf" "size_t bufsiz" "const char *sub" "const regmatch_t *rm" "const char *str"
94.Ft ssize_t
95.Fn regasub "char **buf" "const char *sub" "const regmatch_t *rm" "const char *sstr"
96.Sh DESCRIPTION
97These routines implement
98.St -p1003.2-92
99regular expressions (``RE''s);
100see
101.Xr re_format 7 .
102.Fn regcomp
103compiles an RE written as a string into an internal form,
104.Fn regexec
105matches that internal form against a string and reports results,
106.Fn regerror
107transforms error codes from either into human-readable messages,
108and
109.Fn regfree
110frees any dynamically-allocated storage used by the internal form
111of an RE.
112.Pp
113The header
114.In regex.h
115declares two structure types,
116.Fa regex_t
117and
118.Fa regmatch_t ,
119the former for compiled internal forms and the latter for match reporting.
120It also declares the four functions,
121a type
122.Fa regoff_t ,
123and a number of constants with names starting with ``REG_''.
124.Pp
125.Fn regcomp
126compiles the regular expression contained in the
127.Fa pattern
128string,
129subject to the flags in
130.Fa cflags ,
131and places the results in the
132.Fa regex_t
133structure pointed to by
134.Fa preg .
135.Fa cflags
136is the bitwise OR of zero or more of the following flags:
137.Bl -tag -width XXXREG_EXTENDED
138.It Dv REG_EXTENDED
139Compile modern (``extended'') REs, rather than the obsolete
140(``basic'') REs that are the default.
141.It Dv REG_BASIC
142This is a synonym for 0,
143provided as a counterpart to REG_EXTENDED to improve readability.
144.It Dv REG_NOSPEC
145Compile with recognition of all special characters turned off.
146All characters are thus considered ordinary, so the ``RE'' is a literal
147string.
148This is an extension, compatible with but not specified by
149.St -p1003.2-92 ,
150and should be used with caution in software intended to be portable to
151other systems.
152.Dv REG_EXTENDED
153and
154.Dv REG_NOSPEC
155may not be used in the same call to
156.Fn regcomp .
157.It Dv REG_ICASE
158Compile for matching that ignores upper/lower case distinctions.
159See
160.Xr re_format 7 .
161.It Dv REG_NOSUB
162Compile for matching that need only report success or failure, not
163what was matched.
164.It Dv REG_NEWLINE
165Compile for newline-sensitive matching.
166By default, newline is a completely ordinary character with no special
167meaning in either REs or strings.
168With this flag,
169`[^' bracket expressions and `.' never match newline,
170a `^' anchor matches the null string after any newline in the string
171in addition to its normal function,
172and the `$' anchor matches the null string before any newline in the
173string in addition to its normal function.
174.It Dv REG_PEND
175The regular expression ends, not at the first NUL, but just before the
176character pointed to by the
177.Fa re_endp
178member of the structure pointed to by
179.Fa preg .
180The
181.Fa re_endp
182member is of type
183.Fa "const\ char\ *" .
184This flag permits inclusion of NULs in the RE; they are considered
185ordinary characters.
186This is an extension, compatible with but not specified by
187.St -p1003.2-92 ,
188and should be used with caution in software intended to be portable to
189other systems.
190.El
191.Pp
192When successful,
193.Fn regcomp
194returns 0 and fills in the structure pointed to by
195.Fa preg .
196One member of that structure (other than
197.Fa re_endp )
198is publicized:
199.Fa re_nsub ,
200of type
201.Fa size_t ,
202contains the number of parenthesized subexpressions within the RE
203(except that the value of this member is undefined if the
204.Dv REG_NOSUB
205flag was used).
206If
207.Fn regcomp
208fails, it returns a non-zero error code;
209see
210.Sx DIAGNOSTICS .
211.Pp
212.Fn regexec
213matches the compiled RE pointed to by
214.Fa preg
215against the
216.Fa string ,
217subject to the flags in
218.Fa eflags ,
219and reports results using
220.Fa nmatch ,
221.Fa pmatch ,
222and the returned value.
223The RE must have been compiled by a previous invocation of
224.Fn regcomp .
225The compiled form is not altered during execution of
226.Fn regexec ,
227so a single compiled RE can be used simultaneously by multiple threads.
228.Pp
229By default,
230the NUL-terminated string pointed to by
231.Fa string
232is considered to be the text of an entire line, minus any terminating
233newline.
234The
235.Fa eflags
236argument is the bitwise OR of zero or more of the following flags:
237.Bl -tag -width XXXREG_NOTBOL
238.It Dv REG_NOTBOL
239The first character of the string
240is not the beginning of a line, so the `^' anchor should not match before it.
241This does not affect the behavior of newlines under
242.Dv REG_NEWLINE .
243.It Dv REG_NOTEOL
244The NUL terminating the string does not end a line, so the `$' anchor
245should not match before it.
246This does not affect the behavior of newlines under
247.Dv REG_NEWLINE .
248.It Dv REG_STARTEND
249The string is considered to start at
250.Fa string
251+
252.Fa pmatch[0].rm_so
253and to have a terminating NUL located at
254.Fa string
255+
256.Fa pmatch[0].rm_eo
257(there need not actually be a NUL at that location),
258regardless of the value of
259.Fa nmatch .
260See below for the definition of
261.Fa pmatch
262and
263.Fa nmatch .
264This is an extension, compatible with but not specified by
265.St -p1003.2-92 ,
266and should be used with caution in software intended to be portable to
267other systems.
268Note that a non-zero
269.Fa rm_so
270does not imply
271.Dv REG_NOTBOL ;
272.Dv REG_STARTEND
273affects only the location of the string, not how it is matched.
274.El
275.Pp
276See
277.Xr re_format 7
278for a discussion of what is matched in situations where an RE or a
279portion thereof could match any of several substrings of
280.Fa string .
281.Pp
282Normally,
283.Fn regexec
284returns 0 for success and the non-zero code
285.Dv REG_NOMATCH
286for failure.
287Other non-zero error codes may be returned in exceptional situations;
288see
289.Sx DIAGNOSTICS .
290.Pp
291If
292.Dv REG_NOSUB
293was specified in the compilation of the RE, or if
294.Fa nmatch
295is 0,
296.Fn regexec
297ignores the
298.Fa pmatch
299argument (but see below for the case where
300.Dv REG_STARTEND
301is specified).
302Otherwise,
303.Fa pmatch
304points to an array of
305.Fa nmatch
306structures of type
307.Fa regmatch_t .
308Such a structure has at least the members
309.Fa rm_so
310and
311.Fa rm_eo ,
312both of type
313.Fa regoff_t
314(a signed arithmetic type at least as large as an
315.Fa off_t
316and a
317.Fa ssize_t ) ,
318containing respectively the offset of the first character of a substring
319and the offset of the first character after the end of the substring.
320Offsets are measured from the beginning of the
321.Fa string
322argument given to
323.Fn regexec .
324An empty substring is denoted by equal offsets,
325both indicating the character following the empty substring.
326.Pp
327The 0th member of the
328.Fa pmatch
329array is filled in to indicate what substring of
330.Fa string
331was matched by the entire RE.
332Remaining members report what substring was matched by parenthesized
333subexpressions within the RE;
334member
335.Fa i
336reports subexpression
337.Fa i ,
338with subexpressions counted (starting at 1) by the order of their
339opening parentheses in the RE, left to right.
340Unused entries in the array\(emcorresponding either to subexpressions that
341did not participate in the match at all, or to subexpressions that do not
342exist in the RE (that is,
343.Fa i
344>
345.Fa preg->re_nsub )
346\(emhave both
347.Fa rm_so
348and
349.Fa rm_eo
350set to -1.
351If a subexpression participated in the match several times,
352the reported substring is the last one it matched.
353(Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
354the parenthesized subexpression matches each of the three `b's and then
355an infinite number of empty strings following the last `b',
356so the reported substring is one of the empties.)
357.Pp
358If
359.Dv REG_STARTEND
360is specified,
361.Fa pmatch
362must point to at least one
363.Fa regmatch_t
364(even if
365.Fa nmatch
366is 0 or
367.Dv REG_NOSUB
368was specified),
369to hold the input offsets for
370.Dv REG_STARTEND .
371Use for output is still entirely controlled by
372.Fa nmatch ;
373if
374.Fa nmatch
375is 0 or
376.Dv REG_NOSUB
377was specified,
378the value of
379.Fa pmatch [0]
380will not be changed by a successful
381.Fn regexec .
382.Pp
383.Fn regerror
384maps a non-zero
385.Fa errcode
386from either
387.Fn regcomp
388or
389.Fn regexec
390to a human-readable, printable message.
391If
392.Fa preg
393is non-NULL,
394the error code should have arisen from use of the
395.Fa regex_t
396pointed to by
397.Fa preg ,
398and if the error code came from
399.Fn regcomp ,
400it should have been the result from the most recent
401.Fn regcomp
402using that
403.Fa regex_t .
404.Po Fn regerror
405may be able to supply a more detailed message using information
406from the
407.Fa regex_t . Pc
408.Fn regerror
409places the NUL-terminated message into the buffer pointed to by
410.Fa errbuf ,
411limiting the length (including the NUL) to at most
412.Fa errbuf_size
413bytes.
414If the whole message won't fit,
415as much of it as will fit before the terminating NUL is supplied.
416In any case,
417the returned value is the size of buffer needed to hold the whole
418message (including terminating NUL).
419If
420.Fa errbuf_size
421is 0,
422.Fa errbuf
423is ignored but the return value is still correct.
424.Pp
425If the
426.Fa errcode
427given to
428.Fn regerror
429is first ORed with
430.Dv REG_ITOA ,
431the ``message'' that results is the printable name of the error code,
432e.g. ``REG_NOMATCH'',
433rather than an explanation thereof.
434If
435.Fa errcode
436is
437.Dv REG_ATOI ,
438then
439.Fa preg
440shall be non-NULL and the
441.Fa re_endp
442member of the structure it points to
443must point to the printable name of an error code;
444in this case, the result in
445.Fa errbuf
446is the decimal digits of
447the numeric value of the error code
448(0 if the name is not recognized).
449.Dv REG_ITOA
450and
451.Dv REG_ATOI
452are intended primarily as debugging facilities;
453they are extensions, compatible with but not specified by
454.St -p1003.2-92 ,
455and should be used with caution in software intended to be portable to
456other systems.
457Be warned also that they are considered experimental and changes are possible.
458.Pp
459.Fn regfree
460frees any dynamically-allocated storage associated with the compiled RE
461pointed to by
462.Fa preg .
463The remaining
464.Fa regex_t
465is no longer a valid compiled RE
466and the effect of supplying it to
467.Fn regexec
468or
469.Fn regerror
470is undefined.
471.Pp
472None of these functions references global variables except for tables
473of constants;
474all are safe for use from multiple threads if the arguments are safe.
475.Pp
476The
477.Fn regnsub
478and
479.Fn regasub
480functions perform substitutions using
481.Xr sed 1
482like syntax.
483They return the length of the string that would have been created
484if there was enough space or
485.Dv \-1
486on error, setting
487.Dv errno .
488The result
489is being placed in
490.Fa buf
491which is user-supplied in
492.Fn regnsub
493and dynamically allocated in
494.Fn regasub .
495The
496.Fa sub
497argument contains a substitution string which might refer to the first
4989 regular expression strings using
499.Dq \e<n>
500to refer to the nth matched
501item, or
502.Dq &
503(which is equivalent to
504.Dq \e0 )
505to refer to the full match.
506The
507.Fa rm
508array must be at least 10 elements long, and should contain the result
509of the matches from a previous
510.Fn regexec
511call.
512Only 10 elements of the
513.Fa rm
514array can be used.
515The
516.Fa str
517argument contains the source string to apply the transformation to.
518.Sh IMPLEMENTATION CHOICES
519There are a number of decisions that
520.St -p1003.2-92
521leaves up to the implementor,
522either by explicitly saying ``undefined'' or by virtue of them being
523forbidden by the RE grammar.
524This implementation treats them as follows.
525.Pp
526See
527.Xr re_format 7
528for a discussion of the definition of case-independent matching.
529.Pp
530There is no particular limit on the length of REs,
531except insofar as memory is limited.
532Memory usage is approximately linear in RE size, and largely insensitive
533to RE complexity, except for bounded repetitions.
534See BUGS for one short RE using them
535that will run almost any system out of memory.
536.Pp
537A backslashed character other than one specifically given a magic meaning
538by
539.St -p1003.2-92
540(such magic meanings occur only in obsolete [``basic''] REs)
541is taken as an ordinary character.
542.Pp
543Any unmatched [ is a
544.Dv REG_EBRACK
545error.
546.Pp
547Equivalence classes cannot begin or end bracket-expression ranges.
548The endpoint of one range cannot begin another.
549.Pp
550.Dv RE_DUP_MAX ,
551the limit on repetition counts in bounded repetitions, is 255.
552.Pp
553A repetition operator (?, *, +, or bounds) cannot follow another
554repetition operator.
555A repetition operator cannot begin an expression or subexpression
556or follow `^' or `|'.
557.Pp
558`|' cannot appear first or last in a (sub)expression or after another `|',
559i.e. an operand of `|' cannot be an empty subexpression.
560An empty parenthesized subexpression, `()', is legal and matches an
561empty (sub)string.
562An empty string is not a legal RE.
563.Pp
564A `{' followed by a digit is considered the beginning of bounds for a
565bounded repetition, which must then follow the syntax for bounds.
566A `{'
567.Em not
568followed by a digit is considered an ordinary character.
569.Pp
570`^' and `$' beginning and ending subexpressions in obsolete (``basic'')
571REs are anchors, not ordinary characters.
572.Sh DIAGNOSTICS
573Non-zero error codes from
574.Fn regcomp
575and
576.Fn regexec
577include the following:
578.Pp
579.Bl -tag -width XXXREG_ECOLLATE -compact
580.It Dv REG_NOMATCH
581.Fn regexec
582failed to match
583.It Dv REG_BADPAT
584invalid regular expression
585.It Dv REG_ECOLLATE
586invalid collating element
587.It Dv REG_ECTYPE
588invalid character class
589.It Dv REG_EESCAPE
590\e applied to unescapable character
591.It Dv REG_ESUBREG
592invalid backreference number
593.It Dv REG_EBRACK
594brackets [ ] not balanced
595.It Dv REG_EPAREN
596parentheses ( ) not balanced
597.It Dv REG_EBRACE
598braces { } not balanced
599.It Dv REG_BADBR
600invalid repetition count(s) in { }
601.It Dv REG_ERANGE
602invalid character range in [ ]
603.It Dv REG_ESPACE
604ran out of memory
605.It Dv REG_BADRPT
606?, *, or + operand invalid
607.It Dv REG_EMPTY
608empty (sub)expression
609.It Dv REG_ASSERT
610``can't happen''\(emyou found a bug
611.It Dv REG_INVARG
612invalid argument, e.g. negative-length string
613.El
614.Sh SEE ALSO
615.Xr grep 1 ,
616.Xr sed 1 ,
617.Xr re_format 7
618.Pp
619.St -p1003.2-92 ,
620sections 2.8 (Regular Expression Notation)
621and
622B.5 (C Binding for Regular Expression Matching).
623.Sh HISTORY
624Originally written by Henry Spencer.
625Altered for inclusion in the
626.Bx 4.4
627distribution.
628.Pp
629The
630.Fn regnsub
631and
632.Fn regasub
633functions appeared in
634.Nx 8 .
635.Sh BUGS
636There is one known functionality bug.
637The implementation of internationalization is incomplete:
638the locale is always assumed to be the default one of
639.St -p1003.2-92 ,
640and only the collating elements etc. of that locale are available.
641.Pp
642The back-reference code is subtle and doubts linger about its correctness
643in complex cases.
644.Pp
645.Fn regexec
646performance is poor.
647This will improve with later releases.
648.Fa nmatch
649exceeding 0 is expensive;
650.Fa nmatch
651exceeding 1 is worse.
652.Fa regexec
653is largely insensitive to RE complexity
654.Em except
655that back references are massively expensive.
656RE length does matter; in particular, there is a strong speed bonus
657for keeping RE length under about 30 characters,
658with most special characters counting roughly double.
659.Pp
660.Fn regcomp
661implements bounded repetitions by macro expansion,
662which is costly in time and space if counts are large
663or bounded repetitions are nested.
664An RE like, say,
665`((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
666will (eventually) run almost any existing machine out of swap space.
667.Pp
668There are suspected problems with response to obscure error conditions.
669Notably,
670certain kinds of internal overflow,
671produced only by truly enormous REs or by multiply nested bounded repetitions,
672are probably not handled well.
673.Pp
674Due to a mistake in
675.St -p1003.2-92 ,
676things like `a)b' are legal REs because `)' is a special character
677only in the presence of a previous unmatched `('.
678This can't be fixed until the spec is fixed.
679.Pp
680The standard's definition of back references is vague.
681For example, does
682`a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
683Until the standard is clarified, behavior in such cases should not be
684relied on.
685.Pp
686The implementation of word-boundary matching is a bit of a kludge,
687and bugs may lurk in combinations of word-boundary matching and anchoring.
688