xref: /netbsd-src/lib/libc/regex/regex.3 (revision 3feacbccbd4c8b6fd5f19c8e34474c4c3f8923d7)
1.\"	$NetBSD: regex.3,v 1.24 2016/01/14 22:06:42 christos Exp $
2.\"
3.\" Copyright (c) 1992, 1993, 1994
4.\"	The Regents of the University of California.  All rights reserved.
5.\"
6.\" This code is derived from software contributed to Berkeley by
7.\" Henry Spencer.
8.\"
9.\" Redistribution and use in source and binary forms, with or without
10.\" modification, are permitted provided that the following conditions
11.\" are met:
12.\" 1. Redistributions of source code must retain the above copyright
13.\"    notice, this list of conditions and the following disclaimer.
14.\" 2. Redistributions in binary form must reproduce the above copyright
15.\"    notice, this list of conditions and the following disclaimer in the
16.\"    documentation and/or other materials provided with the distribution.
17.\" 3. Neither the name of the University nor the names of its contributors
18.\"    may be used to endorse or promote products derived from this software
19.\"    without specific prior written permission.
20.\"
21.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
22.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
23.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
24.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
25.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
26.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
27.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
28.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
29.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
30.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
31.\" SUCH DAMAGE.
32.\"
33.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
34.\"
35.\" This code is derived from software contributed to Berkeley by
36.\" Henry Spencer.
37.\"
38.\" Redistribution and use in source and binary forms, with or without
39.\" modification, are permitted provided that the following conditions
40.\" are met:
41.\" 1. Redistributions of source code must retain the above copyright
42.\"    notice, this list of conditions and the following disclaimer.
43.\" 2. Redistributions in binary form must reproduce the above copyright
44.\"    notice, this list of conditions and the following disclaimer in the
45.\"    documentation and/or other materials provided with the distribution.
46.\" 3. All advertising materials mentioning features or use of this software
47.\"    must display the following acknowledgement:
48.\"	This product includes software developed by the University of
49.\"	California, Berkeley and its contributors.
50.\" 4. Neither the name of the University nor the names of its contributors
51.\"    may be used to endorse or promote products derived from this software
52.\"    without specific prior written permission.
53.\"
54.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
55.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
56.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
57.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
58.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
59.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
60.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
61.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
62.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
63.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
64.\" SUCH DAMAGE.
65.\"
66.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
67.\"
68.Dd January 8, 2016
69.Dt REGEX 3
70.Os
71.Sh NAME
72.Nm regex ,
73.Nm regcomp ,
74.Nm regexec ,
75.Nm regerror ,
76.Nm regfree ,
77.Nm regasub ,
78.Nm regnsub
79.Nd regular-expression library
80.Sh LIBRARY
81.Lb libc
82.Sh SYNOPSIS
83.In regex.h
84.Ft int
85.Fn regcomp "regex_t * restrict preg" "const char * restrict pattern" "int cflags"
86.Ft int
87.Fn regexec "const regex_t * restrict preg" "const char * restrict string" "size_t nmatch" "regmatch_t pmatch[]" "int eflags"
88.Ft size_t
89.Fn regerror "int errcode" "const regex_t * restrict preg" "char * restrict errbuf" "size_t errbuf_size"
90.Ft void
91.Fn regfree "regex_t *preg"
92.Ft ssize_t
93.Fn regnsub "char *buf" "size_t bufsiz" "const char *sub" "const regmatch_t *rm" "const char *str"
94.Ft ssize_t
95.Fn regasub "char **buf" "const char *sub" "const regmatch_t *rm" "const char *sstr"
96.Sh DESCRIPTION
97These routines implement
98.St -p1003.2-92
99regular expressions (``RE''s);
100see
101.Xr re_format 7 .
102.Fn regcomp
103compiles an RE written as a string into an internal form,
104.Fn regexec
105matches that internal form against a string and reports results,
106.Fn regerror
107transforms error codes from either into human-readable messages,
108and
109.Fn regfree
110frees any dynamically-allocated storage used by the internal form
111of an RE.
112.Pp
113The header
114.In regex.h
115declares two structure types,
116.Fa regex_t
117and
118.Fa regmatch_t ,
119the former for compiled internal forms and the latter for match reporting.
120It also declares the four functions,
121a type
122.Fa regoff_t ,
123and a number of constants with names starting with ``REG_''.
124.Pp
125.Fn regcomp
126compiles the regular expression contained in the
127.Fa pattern
128string,
129subject to the flags in
130.Fa cflags ,
131and places the results in the
132.Fa regex_t
133structure pointed to by
134.Fa preg .
135.Fa cflags
136is the bitwise OR of zero or more of the following flags:
137.Bl -tag -width XXXREG_EXTENDED
138.It Dv REG_EXTENDED
139Compile modern (``extended'') REs, rather than the obsolete
140(``basic'') REs that are the default.
141.It Dv REG_BASIC
142This is a synonym for 0,
143provided as a counterpart to REG_EXTENDED to improve readability.
144.It Dv REG_NOSPEC
145Compile with recognition of all special characters turned off.
146All characters are thus considered ordinary, so the ``RE'' is a literal
147string.
148This is an extension, compatible with but not specified by
149.St -p1003.2-92 ,
150and should be used with caution in software intended to be portable to
151other systems.
152.Dv REG_EXTENDED
153and
154.Dv REG_NOSPEC
155may not be used in the same call to
156.Fn regcomp .
157.It Dv REG_ICASE
158Compile for matching that ignores upper/lower case distinctions.
159See
160.Xr re_format 7 .
161.It Dv REG_NOSUB
162Compile for matching that need only report success or failure, not
163what was matched.
164.It Dv REG_NEWLINE
165Compile for newline-sensitive matching.
166By default, newline is a completely ordinary character with no special
167meaning in either REs or strings.
168With this flag,
169`[^' bracket expressions and `.' never match newline,
170a `^' anchor matches the null string after any newline in the string
171in addition to its normal function,
172and the `$' anchor matches the null string before any newline in the
173string in addition to its normal function.
174.It Dv REG_PEND
175The regular expression ends, not at the first NUL, but just before the
176character pointed to by the
177.Fa re_endp
178member of the structure pointed to by
179.Fa preg .
180The
181.Fa re_endp
182member is of type
183.Fa "const\ char\ *" .
184This flag permits inclusion of NULs in the RE; they are considered
185ordinary characters.
186This is an extension, compatible with but not specified by
187.St -p1003.2-92 ,
188and should be used with caution in software intended to be portable to
189other systems.
190.El
191.Pp
192When successful,
193.Fn regcomp
194returns 0 and fills in the structure pointed to by
195.Fa preg .
196One member of that structure (other than
197.Fa re_endp )
198is publicized:
199.Fa re_nsub ,
200of type
201.Fa size_t ,
202contains the number of parenthesized subexpressions within the RE
203(except that the value of this member is undefined if the
204.Dv REG_NOSUB
205flag was used).
206If
207.Fn regcomp
208fails, it returns a non-zero error code;
209see
210.Sx DIAGNOSTICS .
211.Pp
212.Fn regexec
213matches the compiled RE pointed to by
214.Fa preg
215against the
216.Fa string ,
217subject to the flags in
218.Fa eflags ,
219and reports results using
220.Fa nmatch ,
221.Fa pmatch ,
222and the returned value.
223The RE must have been compiled by a previous invocation of
224.Fn regcomp .
225The compiled form is not altered during execution of
226.Fn regexec ,
227so a single compiled RE can be used simultaneously by multiple threads.
228.Pp
229By default,
230the NUL-terminated string pointed to by
231.Fa string
232is considered to be the text of an entire line, minus any terminating
233newline.
234The
235.Fa eflags
236argument is the bitwise OR of zero or more of the following flags:
237.Bl -tag -width XXXREG_NOTBOL
238.It Dv REG_NOTBOL
239The first character of the string
240is not the beginning of a line, so the `^' anchor should not match before it.
241This does not affect the behavior of newlines under
242.Dv REG_NEWLINE .
243.It Dv REG_NOTEOL
244The NUL terminating the string does not end a line, so the `$' anchor
245should not match before it.
246This does not affect the behavior of newlines under
247.Dv REG_NEWLINE .
248.It Dv REG_STARTEND
249The string is considered to start at
250.Fa string
251+
252.Fa pmatch[0].rm_so
253and to have a terminating NUL located at
254.Fa string
255+
256.Fa pmatch[0].rm_eo
257(there need not actually be a NUL at that location),
258regardless of the value of
259.Fa nmatch .
260See below for the definition of
261.Fa pmatch
262and
263.Fa nmatch .
264This is an extension, compatible with but not specified by
265.St -p1003.2-92 ,
266and should be used with caution in software intended to be portable to
267other systems.
268Note that a non-zero
269.Fa rm_so
270does not imply
271.Dv REG_NOTBOL ;
272.Dv REG_STARTEND
273affects only the location of the string, not how it is matched.
274.El
275.Pp
276See
277.Xr re_format 7
278for a discussion of what is matched in situations where an RE or a
279portion thereof could match any of several substrings of
280.Fa string .
281.Pp
282Normally,
283.Fn regexec
284returns 0 for success and the non-zero code
285.Dv REG_NOMATCH
286for failure.
287Other non-zero error codes may be returned in exceptional situations;
288see
289.Sx DIAGNOSTICS .
290.Pp
291If
292.Dv REG_NOSUB
293was specified in the compilation of the RE, or if
294.Fa nmatch
295is 0,
296.Fn regexec
297ignores the
298.Fa pmatch
299argument (but see below for the case where
300.Dv REG_STARTEND
301is specified).
302Otherwise,
303.Fa pmatch
304points to an array of
305.Fa nmatch
306structures of type
307.Fa regmatch_t .
308Such a structure has at least the members
309.Fa rm_so
310and
311.Fa rm_eo ,
312both of type
313.Fa regoff_t
314(a signed arithmetic type at least as large as an
315.Fa off_t
316and a
317.Fa ssize_t ) ,
318containing respectively the offset of the first character of a substring
319and the offset of the first character after the end of the substring.
320Offsets are measured from the beginning of the
321.Fa string
322argument given to
323.Fn regexec .
324An empty substring is denoted by equal offsets,
325both indicating the character following the empty substring.
326.Pp
327The 0th member of the
328.Fa pmatch
329array is filled in to indicate what substring of
330.Fa string
331was matched by the entire RE.
332Remaining members report what substring was matched by parenthesized
333subexpressions within the RE;
334member
335.Fa i
336reports subexpression
337.Fa i ,
338with subexpressions counted (starting at 1) by the order of their
339opening parentheses in the RE, left to right.
340Unused entries in the array\(emcorresponding either to subexpressions that
341did not participate in the match at all, or to subexpressions that do not
342exist in the RE (that is,
343.Fa i
344\*[Gt]
345.Fa preg-\*[Gt]re_nsub )
346\(emhave both
347.Fa rm_so
348and
349.Fa rm_eo
350set to -1.
351If a subexpression participated in the match several times,
352the reported substring is the last one it matched.
353(Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
354the parenthesized subexpression matches each of the three `b's and then
355an infinite number of empty strings following the last `b',
356so the reported substring is one of the empties.)
357.Pp
358If
359.Dv REG_STARTEND
360is specified,
361.Fa pmatch
362must point to at least one
363.Fa regmatch_t
364(even if
365.Fa nmatch
366is 0 or
367.Dv REG_NOSUB
368was specified),
369to hold the input offsets for
370.Dv REG_STARTEND .
371Use for output is still entirely controlled by
372.Fa nmatch ;
373if
374.Fa nmatch
375is 0 or
376.Dv REG_NOSUB
377was specified,
378the value of
379.Fa pmatch [0]
380will not be changed by a successful
381.Fn regexec .
382.Pp
383.Fn regerror
384maps a non-zero
385.Fa errcode
386from either
387.Fn regcomp
388or
389.Fn regexec
390to a human-readable, printable message.
391If
392.Fa preg
393is non-NULL,
394the error code should have arisen from use of the
395.Fa regex_t
396pointed to by
397.Fa preg ,
398and if the error code came from
399.Fn regcomp ,
400it should have been the result from the most recent
401.Fn regcomp
402using that
403.Fa regex_t .
404.Po Fn regerror
405may be able to supply a more detailed message using information
406from the
407.Fa regex_t . Pc
408.Fn regerror
409places the NUL-terminated message into the buffer pointed to by
410.Fa errbuf ,
411limiting the length (including the NUL) to at most
412.Fa errbuf_size
413bytes.
414If the whole message won't fit,
415as much of it as will fit before the terminating NUL is supplied.
416In any case,
417the returned value is the size of buffer needed to hold the whole
418message (including terminating NUL).
419If
420.Fa errbuf_size
421is 0,
422.Fa errbuf
423is ignored but the return value is still correct.
424.Pp
425If the
426.Fa errcode
427given to
428.Fn regerror
429is first ORed with
430.Dv REG_ITOA ,
431the ``message'' that results is the printable name of the error code,
432e.g. ``REG_NOMATCH'',
433rather than an explanation thereof.
434If
435.Fa errcode
436is
437.Dv REG_ATOI ,
438then
439.Fa preg
440shall be non-NULL and the
441.Fa re_endp
442member of the structure it points to
443must point to the printable name of an error code;
444in this case, the result in
445.Fa errbuf
446is the decimal digits of
447the numeric value of the error code
448(0 if the name is not recognized).
449.Dv REG_ITOA
450and
451.Dv REG_ATOI
452are intended primarily as debugging facilities;
453they are extensions, compatible with but not specified by
454.St -p1003.2-92 ,
455and should be used with caution in software intended to be portable to
456other systems.
457Be warned also that they are considered experimental and changes are possible.
458.Pp
459.Fn regfree
460frees any dynamically-allocated storage associated with the compiled RE
461pointed to by
462.Fa preg .
463The remaining
464.Fa regex_t
465is no longer a valid compiled RE
466and the effect of supplying it to
467.Fn regexec
468or
469.Fn regerror
470is undefined.
471.Pp
472None of these functions references global variables except for tables
473of constants;
474all are safe for use from multiple threads if the arguments are safe.
475.Pp
476The
477.Fn regnsub
478and
479.Fn regasub
480functions perform substitutions using
481.Xr sed 1
482like syntax.
483They return the length of the string that would have been created
484if there was enough space or
485.Dv \-1
486on error, setting
487.Dv errno .
488The result
489is being placed in
490.Fa buf
491which is user-supplied in
492.Fn regnsub
493and dynamically allocated in
494.Fn regasub .
495The
496.Fa sub
497argument contains a substitution string which might refer to the first
4989 regular expression strings using
499.Dq \e<n>
500to refer to the nth matched
501item, or
502.Dq &
503(which is equivalent to
504.Dq \e0 )
505to refer to the full match.
506The
507.Fa rm
508array must be at least 10 elements long, and should contain the result
509of the matches from a previous
510.Fn regexec
511call.
512The
513.Fa str
514argument contains the source string to apply the transformation to.
515.Sh IMPLEMENTATION CHOICES
516There are a number of decisions that
517.St -p1003.2-92
518leaves up to the implementor,
519either by explicitly saying ``undefined'' or by virtue of them being
520forbidden by the RE grammar.
521This implementation treats them as follows.
522.Pp
523See
524.Xr re_format 7
525for a discussion of the definition of case-independent matching.
526.Pp
527There is no particular limit on the length of REs,
528except insofar as memory is limited.
529Memory usage is approximately linear in RE size, and largely insensitive
530to RE complexity, except for bounded repetitions.
531See BUGS for one short RE using them
532that will run almost any system out of memory.
533.Pp
534A backslashed character other than one specifically given a magic meaning
535by
536.St -p1003.2-92
537(such magic meanings occur only in obsolete [``basic''] REs)
538is taken as an ordinary character.
539.Pp
540Any unmatched [ is a
541.Dv REG_EBRACK
542error.
543.Pp
544Equivalence classes cannot begin or end bracket-expression ranges.
545The endpoint of one range cannot begin another.
546.Pp
547.Dv RE_DUP_MAX ,
548the limit on repetition counts in bounded repetitions, is 255.
549.Pp
550A repetition operator (?, *, +, or bounds) cannot follow another
551repetition operator.
552A repetition operator cannot begin an expression or subexpression
553or follow `^' or `|'.
554.Pp
555`|' cannot appear first or last in a (sub)expression or after another `|',
556i.e. an operand of `|' cannot be an empty subexpression.
557An empty parenthesized subexpression, `()', is legal and matches an
558empty (sub)string.
559An empty string is not a legal RE.
560.Pp
561A `{' followed by a digit is considered the beginning of bounds for a
562bounded repetition, which must then follow the syntax for bounds.
563A `{'
564.Em not
565followed by a digit is considered an ordinary character.
566.Pp
567`^' and `$' beginning and ending subexpressions in obsolete (``basic'')
568REs are anchors, not ordinary characters.
569.Sh DIAGNOSTICS
570Non-zero error codes from
571.Fn regcomp
572and
573.Fn regexec
574include the following:
575.Pp
576.Bl -tag -width XXXREG_ECOLLATE -compact
577.It Dv REG_NOMATCH
578.Fn regexec
579failed to match
580.It Dv REG_BADPAT
581invalid regular expression
582.It Dv REG_ECOLLATE
583invalid collating element
584.It Dv REG_ECTYPE
585invalid character class
586.It Dv REG_EESCAPE
587\e applied to unescapable character
588.It Dv REG_ESUBREG
589invalid backreference number
590.It Dv REG_EBRACK
591brackets [ ] not balanced
592.It Dv REG_EPAREN
593parentheses ( ) not balanced
594.It Dv REG_EBRACE
595braces { } not balanced
596.It Dv REG_BADBR
597invalid repetition count(s) in { }
598.It Dv REG_ERANGE
599invalid character range in [ ]
600.It Dv REG_ESPACE
601ran out of memory
602.It Dv REG_BADRPT
603?, *, or + operand invalid
604.It Dv REG_EMPTY
605empty (sub)expression
606.It Dv REG_ASSERT
607``can't happen''\(emyou found a bug
608.It Dv REG_INVARG
609invalid argument, e.g. negative-length string
610.El
611.Sh SEE ALSO
612.Xr grep 1 ,
613.Xr sed 1 ,
614.Xr re_format 7
615.Pp
616.St -p1003.2-92 ,
617sections 2.8 (Regular Expression Notation)
618and
619B.5 (C Binding for Regular Expression Matching).
620.Sh HISTORY
621Originally written by Henry Spencer.
622Altered for inclusion in the
623.Bx 4.4
624distribution.
625.Pp
626The
627.Fn regnsub
628and
629.Fn regasub
630functions appeared in
631.Nx 8 .
632.Sh BUGS
633There is one known functionality bug.
634The implementation of internationalization is incomplete:
635the locale is always assumed to be the default one of
636.St -p1003.2-92 ,
637and only the collating elements etc. of that locale are available.
638.Pp
639The back-reference code is subtle and doubts linger about its correctness
640in complex cases.
641.Pp
642.Fn regexec
643performance is poor.
644This will improve with later releases.
645.Fa nmatch
646exceeding 0 is expensive;
647.Fa nmatch
648exceeding 1 is worse.
649.Fa regexec
650is largely insensitive to RE complexity
651.Em except
652that back references are massively expensive.
653RE length does matter; in particular, there is a strong speed bonus
654for keeping RE length under about 30 characters,
655with most special characters counting roughly double.
656.Pp
657.Fn regcomp
658implements bounded repetitions by macro expansion,
659which is costly in time and space if counts are large
660or bounded repetitions are nested.
661An RE like, say,
662`((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
663will (eventually) run almost any existing machine out of swap space.
664.Pp
665There are suspected problems with response to obscure error conditions.
666Notably,
667certain kinds of internal overflow,
668produced only by truly enormous REs or by multiply nested bounded repetitions,
669are probably not handled well.
670.Pp
671Due to a mistake in
672.St -p1003.2-92 ,
673things like `a)b' are legal REs because `)' is a special character
674only in the presence of a previous unmatched `('.
675This can't be fixed until the spec is fixed.
676.Pp
677The standard's definition of back references is vague.
678For example, does
679`a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
680Until the standard is clarified, behavior in such cases should not be
681relied on.
682.Pp
683The implementation of word-boundary matching is a bit of a kludge,
684and bugs may lurk in combinations of word-boundary matching and anchoring.
685