xref: /openbsd-src/lib/libc/regex/re_format.7 (revision 1d207ba9cd8403bdb560750d93c4eb060c75d6ee)
1.\"	$OpenBSD: re_format.7,v 1.23 2021/07/07 11:21:55 martijn Exp $
2.\"
3.\" Copyright (c) 1997, Phillip F Knaack. All rights reserved.
4.\"
5.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
6.\" Copyright (c) 1992, 1993, 1994
7.\"	The Regents of the University of California.  All rights reserved.
8.\"
9.\" This code is derived from software contributed to Berkeley by
10.\" Henry Spencer.
11.\"
12.\" Redistribution and use in source and binary forms, with or without
13.\" modification, are permitted provided that the following conditions
14.\" are met:
15.\" 1. Redistributions of source code must retain the above copyright
16.\"    notice, this list of conditions and the following disclaimer.
17.\" 2. Redistributions in binary form must reproduce the above copyright
18.\"    notice, this list of conditions and the following disclaimer in the
19.\"    documentation and/or other materials provided with the distribution.
20.\" 3. Neither the name of the University nor the names of its contributors
21.\"    may be used to endorse or promote products derived from this software
22.\"    without specific prior written permission.
23.\"
24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
27.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
34.\" SUCH DAMAGE.
35.\"
36.\"	@(#)re_format.7	8.3 (Berkeley) 3/20/94
37.\"
38.Dd $Mdocdate: July 7 2021 $
39.Dt RE_FORMAT 7
40.Os
41.Sh NAME
42.Nm re_format
43.Nd POSIX regular expressions
44.Sh DESCRIPTION
45Regular expressions (REs),
46as defined in
47.St -p1003.1-2004 ,
48come in two forms:
49basic regular expressions
50(BREs)
51and extended regular expressions
52(EREs).
53Both forms of regular expressions are supported
54by the interfaces described in
55.Xr regex 3 .
56Applications dealing with regular expressions
57may use one or the other form
58(or indeed both).
59For example,
60.Xr ed 1
61uses BREs,
62whilst
63.Xr egrep 1
64talks EREs.
65Consult the manual page for the specific application to find out which
66it uses.
67.Pp
68POSIX leaves some aspects of RE syntax and semantics open;
69.Sq **
70marks decisions on these aspects that
71may not be fully portable to other POSIX implementations.
72.Pp
73This manual page first describes regular expressions in general,
74specifically extended regular expressions,
75and then discusses differences between them and basic regular expressions.
76.Sh EXTENDED REGULAR EXPRESSIONS
77An ERE is one** or more non-empty**
78.Em branches ,
79separated by
80.Sq | .
81It matches anything that matches one of the branches.
82.Pp
83A branch is one** or more
84.Em pieces ,
85concatenated.
86It matches a match for the first, followed by a match for the second, etc.
87.Pp
88A piece is an
89.Em atom
90possibly followed by a single**
91.Sq * ,
92.Sq + ,
93.Sq ?\& ,
94or
95.Em bound .
96An atom followed by
97.Sq *
98matches a sequence of 0 or more matches of the atom.
99An atom followed by
100.Sq +
101matches a sequence of 1 or more matches of the atom.
102An atom followed by
103.Sq ?\&
104matches a sequence of 0 or 1 matches of the atom.
105.Pp
106A bound is
107.Sq {
108followed by an unsigned decimal integer,
109possibly followed by
110.Sq ,\&
111possibly followed by another unsigned decimal integer,
112always followed by
113.Sq } .
114The integers must lie between 0 and
115.Dv RE_DUP_MAX
116(255**) inclusive,
117and if there are two of them, the first may not exceed the second.
118An atom followed by a bound containing one integer
119.Ar i
120and no comma matches
121a sequence of exactly
122.Ar i
123matches of the atom.
124An atom followed by a bound
125containing one integer
126.Ar i
127and a comma matches
128a sequence of
129.Ar i
130or more matches of the atom.
131An atom followed by a bound
132containing two integers
133.Ar i
134and
135.Ar j
136matches a sequence of
137.Ar i
138through
139.Ar j
140(inclusive) matches of the atom.
141.Pp
142An atom is a regular expression enclosed in
143.Sq ()
144(matching a part of the regular expression),
145an empty set of
146.Sq ()
147(matching the null string)**,
148a
149.Em bracket expression
150(see below),
151.Sq .\&
152(matching any single character),
153.Sq ^
154(matching the null string at the beginning of a line),
155.Sq $
156(matching the null string at the end of a line),
157a
158.Sq \e
159followed by one of the characters
160.Sq ^.[$()|*+?{\e
161(matching that character taken as an ordinary character),
162a
163.Sq \e
164followed by any other character**
165(matching that character taken as an ordinary character,
166as if the
167.Sq \e
168had not been present**),
169or a single character with no other significance (matching that character).
170A
171.Sq {
172followed by a character other than a digit is an ordinary character,
173not the beginning of a bound**.
174It is illegal to end an RE with
175.Sq \e .
176.Pp
177A bracket expression is a list of characters enclosed in
178.Sq [] .
179It normally matches any single character from the list (but see below).
180If the list begins with
181.Sq ^ ,
182it matches any single character
183.Em not
184from the rest of the list
185(but see below).
186If two characters in the list are separated by
187.Sq - ,
188this is shorthand for the full
189.Em range
190of characters between those two (inclusive) in the
191collating sequence, e.g.\&
192.Sq [0-9]
193in ASCII matches any decimal digit.
194It is illegal** for two ranges to share an endpoint, e.g.\&
195.Sq a-c-e .
196Ranges are very collating-sequence-dependent,
197and portable programs should avoid relying on them.
198.Pp
199To include a literal
200.Sq ]\&
201in the list, make it the first character
202(following a possible
203.Sq ^ ) .
204To include a literal
205.Sq - ,
206make it the first or last character,
207or the second endpoint of a range.
208To use a literal
209.Sq -
210as the first endpoint of a range,
211enclose it in
212.Sq [.
213and
214.Sq .]
215to make it a collating element (see below).
216With the exception of these and some combinations using
217.Sq \&[
218(see next paragraphs),
219all other special characters, including
220.Sq \e ,
221lose their special significance within a bracket expression.
222.Pp
223Within a bracket expression, a collating element
224(a character,
225a multi-character sequence that collates as if it were a single character,
226or a collating-sequence name for either)
227enclosed in
228.Sq [.
229and
230.Sq .]
231stands for the sequence of characters of that collating element.
232The sequence is a single element of the bracket expression's list.
233A bracket expression containing a multi-character collating element
234can thus match more than one character,
235e.g. if the collating sequence includes a
236.Sq ch
237collating element,
238then the RE
239.Sq [[.ch.]]*c
240matches the first five characters of
241.Sq chchcc .
242.Pp
243Within a bracket expression, a collating element enclosed in
244.Sq [=
245and
246.Sq =]
247is an equivalence class, standing for the sequences of characters
248of all collating elements equivalent to that one, including itself.
249(If there are no other equivalent collating elements,
250the treatment is as if the enclosing delimiters were
251.Sq [.
252and
253.Sq .] . )
254For example, if
255.Sq x
256and
257.Sq y
258are the members of an equivalence class,
259then
260.Sq [[=x=]] ,
261.Sq [[=y=]] ,
262and
263.Sq [xy]
264are all synonymous.
265An equivalence class may not** be an endpoint of a range.
266.Pp
267Within a bracket expression, the name of a
268.Em character class
269enclosed
270in
271.Sq [:
272and
273.Sq :]
274stands for the list of all characters belonging to that class.
275Standard character class names are:
276.Bd -literal -offset indent
277alnum	digit	punct
278alpha	graph	space
279blank	lower	upper
280cntrl	print	xdigit
281.Ed
282.Pp
283These stand for the character classes defined in
284.Xr isalnum 3 ,
285.Xr isalpha 3 ,
286and so on.
287A character class may not be used as an endpoint of a range.
288.Pp
289There are two special cases** of bracket expressions:
290the bracket expressions
291.Sq [[:<:]]
292and
293.Sq [[:>:]]
294match the null string at the beginning and end of a word, respectively.
295A word is defined as a sequence of
296characters starting and ending with a word character
297which is neither preceded nor followed by
298word characters.
299A word character is an
300.Em alnum
301character (as defined by
302.Xr isalnum 3 )
303or an underscore.
304This is an extension,
305compatible with but not specified by POSIX,
306and should be used with
307caution in software intended to be portable to other systems.
308The additional word delimiters
309.Ql \e<
310and
311.Ql \e>
312are provided to ease compatibility with traditional SVR4
313systems but are not portable and should be avoided.
314.Pp
315In the event that an RE could match more than one substring of a given
316string,
317the RE matches the one starting earliest in the string.
318If the RE could match more than one substring starting at that point,
319it matches the longest.
320Subexpressions also match the longest possible substrings, subject to
321the constraint that the whole match be as long as possible,
322with subexpressions starting earlier in the RE taking priority over
323ones starting later.
324Note that higher-level subexpressions thus take priority over
325their lower-level component subexpressions.
326.Pp
327Match lengths are measured in characters, not collating elements.
328A null string is considered longer than no match at all.
329For example,
330.Sq bb*
331matches the three middle characters of
332.Sq abbbc ;
333.Sq (wee|week)(knights|nights)
334matches all ten characters of
335.Sq weeknights ;
336when
337.Sq (.*).*
338is matched against
339.Sq abc ,
340the parenthesized subexpression matches all three characters;
341and when
342.Sq (a*)*
343is matched against
344.Sq bc ,
345both the whole RE and the parenthesized subexpression match the null string.
346.Pp
347If case-independent matching is specified,
348the effect is much as if all case distinctions had vanished from the
349alphabet.
350When an alphabetic that exists in multiple cases appears as an
351ordinary character outside a bracket expression, it is effectively
352transformed into a bracket expression containing both cases,
353e.g.\&
354.Sq x
355becomes
356.Sq [xX] .
357When it appears inside a bracket expression,
358all case counterparts of it are added to the bracket expression,
359so that, for example,
360.Sq [x]
361becomes
362.Sq [xX]
363and
364.Sq [^x]
365becomes
366.Sq [^xX] .
367.Pp
368No particular limit is imposed on the length of REs**.
369Programs intended to be portable should not employ REs longer
370than 256 bytes,
371as an implementation can refuse to accept such REs and remain
372POSIX-compliant.
373.Pp
374The following is a list of extended regular expressions:
375.Bl -tag -width Ds
376.It Ar c
377Any character
378.Ar c
379not listed below matches itself.
380.It \e Ns Ar c
381Any backslash-escaped character
382.Ar c
383matches itself.
384.It \&.
385Matches any single character that is not a newline
386.Pq Sq \en .
387.It Bq Ar char-class
388Matches any single character in
389.Ar char-class .
390To include a
391.Ql \&]
392in
393.Ar char-class ,
394it must be the first character.
395A range of characters may be specified by separating the end characters
396of the range with a
397.Ql - ;
398e.g.\&
399.Ar a-z
400specifies the lower case characters.
401The following literal expressions can also be used in
402.Ar char-class
403to specify sets of characters:
404.Bd -unfilled -offset indent
405[:alnum:] [:cntrl:] [:lower:] [:space:]
406[:alpha:] [:digit:] [:print:] [:upper:]
407[:blank:] [:graph:] [:punct:] [:xdigit:]
408.Ed
409.Pp
410If
411.Ql -
412appears as the first or last character of
413.Ar char-class ,
414then it matches itself.
415All other characters in
416.Ar char-class
417match themselves.
418.Pp
419Patterns in
420.Ar char-class
421of the form
422.Eo [.
423.Ar col-elm
424.Ec .]\&
425or
426.Eo [=
427.Ar col-elm
428.Ec =]\& ,
429where
430.Ar col-elm
431is a collating element, are interpreted according to
432.Xr setlocale 3
433.Pq not currently supported .
434.It Bq ^ Ns Ar char-class
435Matches any single character, other than newline, not in
436.Ar char-class .
437.Ar char-class
438is defined as above.
439.It ^
440If
441.Sq ^
442is the first character of a regular expression, then it
443anchors the regular expression to the beginning of a line.
444Otherwise, it matches itself.
445.It $
446If
447.Sq $
448is the last character of a regular expression,
449it anchors the regular expression to the end of a line.
450Otherwise, it matches itself.
451.It [[:<:]]
452Anchors the single character regular expression or subexpression
453immediately following it to the beginning of a word.
454.It [[:>:]]
455Anchors the single character regular expression or subexpression
456immediately preceding it to the end of a word.
457.It Pq Ar re
458Defines a subexpression
459.Ar re .
460Any set of characters enclosed in parentheses
461matches whatever the set of characters without parentheses matches
462(that is a long-winded way of saying the constructs
463.Sq (re)
464and
465.Sq re
466match identically).
467.It *
468Matches the single character regular expression or subexpression
469immediately preceding it zero or more times.
470If
471.Sq *
472is the first character of a regular expression or subexpression,
473then it matches itself.
474The
475.Sq *
476operator sometimes yields unexpected results.
477For example, the regular expression
478.Ar b*
479matches the beginning of the string
480.Qq abbb
481(as opposed to the substring
482.Qq bbb ) ,
483since a null match is the only leftmost match.
484.It +
485Matches the singular character regular expression
486or subexpression immediately preceding it
487one or more times.
488.It ?
489Matches the singular character regular expression
490or subexpression immediately preceding it
4910 or 1 times.
492.Sm off
493.It Xo
494.Pf { Ar n , m No }\ \&
495.Pf { Ar n , No }\ \&
496.Pf { Ar n No }
497.Xc
498.Sm on
499Matches the single character regular expression or subexpression
500immediately preceding it at least
501.Ar n
502and at most
503.Ar m
504times.
505If
506.Ar m
507is omitted, then it matches at least
508.Ar n
509times.
510If the comma is also omitted, then it matches exactly
511.Ar n
512times.
513.It |
514Used to separate patterns.
515For example,
516the pattern
517.Sq cat|dog
518matches either
519.Sq cat
520or
521.Sq dog .
522.El
523.Sh BASIC REGULAR EXPRESSIONS
524Basic regular expressions differ in several respects:
525.Bl -bullet -offset 3n
526.It
527The delimiters for bounds are
528.Sq \e{
529and
530.Sq \e} ,
531with
532.Sq {
533and
534.Sq }
535by themselves ordinary characters.
536.It
537.Sq | ,
538.Sq + ,
539and
540.Sq ?\&
541are ordinary characters.
542.Sq \e{1,\e}
543is equivalent to
544.Sq + .
545.Sq \e{0,1\e}
546is equivalent to
547.Sq ?\& .
548There is no equivalent for
549.Sq | .
550.It
551The parentheses for nested subexpressions are
552.Sq \e(
553and
554.Sq \e) ,
555with
556.Sq \&(
557and
558.Sq )\&
559by themselves ordinary characters.
560.It
561.Sq ^
562is an ordinary character except at the beginning of the
563RE or** the beginning of a parenthesized subexpression.
564.It
565.Sq $
566is an ordinary character except at the end of the
567RE or** the end of a parenthesized subexpression.
568.It
569.Sq *
570is an ordinary character if it appears at the beginning of the
571RE or the beginning of a parenthesized subexpression
572(after a possible leading
573.Sq ^ ) .
574.It
575Finally, there is one new type of atom, a
576.Em back-reference :
577.Sq \e
578followed by a non-zero decimal digit
579.Ar d
580matches the same sequence of characters matched by the
581.Ar d Ns th
582parenthesized subexpression
583(numbering subexpressions by the positions of their opening parentheses,
584left to right),
585so that, for example,
586.Sq \e([bc]\e)\e1
587matches
588.Sq bb\&
589or
590.Sq cc
591but not
592.Sq bc .
593.El
594.Pp
595The following is a list of basic regular expressions:
596.Bl -tag -width Ds
597.It Ar c
598Any character
599.Ar c
600not listed below matches itself.
601.It \e Ns Ar c
602Any backslash-escaped character
603.Ar c ,
604except for
605.Sq { ,
606.Sq } ,
607.Sq \&( ,
608and
609.Sq \&) ,
610matches itself.
611.It \&.
612Matches any single character that is not a newline
613.Pq Sq \en .
614.It Bq Ar char-class
615Matches any single character in
616.Ar char-class .
617To include a
618.Ql \&]
619in
620.Ar char-class ,
621it must be the first character.
622A range of characters may be specified by separating the end characters
623of the range with a
624.Ql - ;
625e.g.\&
626.Ar a-z
627specifies the lower case characters.
628The following literal expressions can also be used in
629.Ar char-class
630to specify sets of characters:
631.Bd -unfilled -offset indent
632[:alnum:] [:cntrl:] [:lower:] [:space:]
633[:alpha:] [:digit:] [:print:] [:upper:]
634[:blank:] [:graph:] [:punct:] [:xdigit:]
635.Ed
636.Pp
637If
638.Ql -
639appears as the first or last character of
640.Ar char-class ,
641then it matches itself.
642All other characters in
643.Ar char-class
644match themselves.
645.Pp
646Patterns in
647.Ar char-class
648of the form
649.Eo [.
650.Ar col-elm
651.Ec .]\&
652or
653.Eo [=
654.Ar col-elm
655.Ec =]\& ,
656where
657.Ar col-elm
658is a collating element, are interpreted according to
659.Xr setlocale 3
660.Pq not currently supported .
661.It Bq ^ Ns Ar char-class
662Matches any single character, other than newline, not in
663.Ar char-class .
664.Ar char-class
665is defined as above.
666.It ^
667If
668.Sq ^
669is the first character of a regular expression, then it
670anchors the regular expression to the beginning of a line.
671Otherwise, it matches itself.
672.It $
673If
674.Sq $
675is the last character of a regular expression,
676it anchors the regular expression to the end of a line.
677Otherwise, it matches itself.
678.It [[:<:]]
679Anchors the single character regular expression or subexpression
680immediately following it to the beginning of a word.
681.It [[:>:]]
682Anchors the single character regular expression or subexpression
683immediately following it to the end of a word.
684.It \e( Ns Ar re Ns \e)
685Defines a subexpression
686.Ar re .
687Subexpressions may be nested.
688A subsequent backreference of the form
689.Pf \e Ar n ,
690where
691.Ar n
692is a number in the range [1,9], expands to the text matched by the
693.Ar n Ns th
694subexpression.
695For example, the regular expression
696.Ar \e(.*\e)\e1
697matches any string consisting of identical adjacent substrings.
698Subexpressions are ordered relative to their left delimiter.
699.It *
700Matches the single character regular expression or subexpression
701immediately preceding it zero or more times.
702If
703.Sq *
704is the first character of a regular expression or subexpression,
705then it matches itself.
706The
707.Sq *
708operator sometimes yields unexpected results.
709For example, the regular expression
710.Ar b*
711matches the beginning of the string
712.Qq abbb
713(as opposed to the substring
714.Qq bbb ) ,
715since a null match is the only leftmost match.
716.Sm off
717.It Xo
718.Pf \e{ Ar n , m No \e}\ \&
719.Pf \e{ Ar n , No \e}\ \&
720.Pf \e{ Ar n No \e}
721.Xc
722.Sm on
723Matches the single character regular expression or subexpression
724immediately preceding it at least
725.Ar n
726and at most
727.Ar m
728times.
729If
730.Ar m
731is omitted, then it matches at least
732.Ar n
733times.
734If the comma is also omitted, then it matches exactly
735.Ar n
736times.
737.El
738.Sh SEE ALSO
739.Xr regex 3
740.Sh STANDARDS
741.St -p1003.1-2004 :
742Base Definitions, Chapter 9 (Regular Expressions).
743.Sh BUGS
744Having two kinds of REs is a botch.
745.Pp
746The current POSIX spec says that
747.Sq )\&
748is an ordinary character in the absence of an unmatched
749.Sq \&( ;
750this was an unintentional result of a wording error,
751and change is likely.
752Avoid relying on it.
753.Pp
754Back-references are a dreadful botch,
755posing major problems for efficient implementations.
756They are also somewhat vaguely defined
757(does
758.Sq a\e(\e(b\e)*\e2\e)*d
759match
760.Sq abbbd ? ) .
761Avoid using them.
762.Pp
763POSIX's specification of case-independent matching is vague.
764The
765.Dq one case implies all cases
766definition given above
767is the current consensus among implementors as to the right interpretation.
768.Pp
769The syntax for word boundaries is incredibly ugly.
770