xref: /openbsd-src/lib/libc/regex/re_format.7 (revision 62a742911104f98b9185b2c6b6007d9b1c36396c)
1.\"	$OpenBSD: re_format.7,v 1.5 1997/07/29 02:29:09 flipk Exp $
2.\"
3.\" Copyright (c) 1997, Phillip F Knaack. All rights reserved.
4.\"
5.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
6.\" Copyright (c) 1992, 1993, 1994
7.\"	The Regents of the University of California.  All rights reserved.
8.\"
9.\" This code is derived from software contributed to Berkeley by
10.\" Henry Spencer.
11.\"
12.\" Redistribution and use in source and binary forms, with or without
13.\" modification, are permitted provided that the following conditions
14.\" are met:
15.\" 1. Redistributions of source code must retain the above copyright
16.\"    notice, this list of conditions and the following disclaimer.
17.\" 2. Redistributions in binary form must reproduce the above copyright
18.\"    notice, this list of conditions and the following disclaimer in the
19.\"    documentation and/or other materials provided with the distribution.
20.\" 3. All advertising materials mentioning features or use of this software
21.\"    must display the following acknowledgement:
22.\"	This product includes software developed by the University of
23.\"	California, Berkeley and its contributors.
24.\" 4. Neither the name of the University nor the names of its contributors
25.\"    may be used to endorse or promote products derived from this software
26.\"    without specific prior written permission.
27.\"
28.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
29.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
30.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
31.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
32.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
33.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
34.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
35.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
36.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
37.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
38.\" SUCH DAMAGE.
39.\"
40.\"	@(#)re_format.7	8.3 (Berkeley) 3/20/94
41.\"
42.Dd March 20, 1994
43.Dt RE_FORMAT 7
44.Os OpenBSD
45.Sh NAME
46.Nm re_format
47.Nd POSIX 1003.2 regular expressions
48.Sh DESCRIPTION
49Regular expressions (``RE''s),
50as defined in POSIX 1003.2, come in two forms:
51modern REs (roughly those of
52.Xr egrep 1 ;
531003.2 calls these ``extended'' REs)
54and obsolete REs (roughly those of
55.Xr ed 1 ;
561003.2 ``basic'' REs).
57Obsolete REs mostly exist for backward compatibility in some old programs;
58they will be discussed at the end.
591003.2 leaves some aspects of RE syntax and semantics open;
60`\(dg' marks decisions on these aspects that
61may not be fully portable to other 1003.2 implementations.
62.Pp
63A (modern) RE is one\(dg or more non-empty\(dg
64.Em branches ,
65separated by `|'. It matches anything that matches one of the branches.
66.Pp
67A branch is one\(dg or more
68.Em pieces ,
69concatenated.
70It matches a match for the first, followed by a match for the second, etc.
71.Pp
72A piece is an
73.Em atom
74possibly followed by a single\(dg `*', `+', `?', or
75.Em bound .
76An atom followed by `*' matches a sequence of 0 or more matches of the atom.
77An atom followed by `+' matches a sequence of 1 or more matches of the atom.
78An atom followed by `?' matches a sequence of 0 or 1 matches of the atom.
79.Pp
80A
81.Em bound
82is `{' followed by an unsigned decimal integer,
83possibly followed by `,'
84possibly followed by another unsigned decimal integer,
85always followed by `}'.
86The integers must lie between 0 and RE_DUP_MAX (255\(dg) inclusive,
87and if there are two of them, the first may not exceed the second.
88An atom followed by a bound containing one integer \fIi\fR
89and no comma matches
90a sequence of exactly \fIi\fR matches of the atom.
91An atom followed by a bound
92containing one integer \fIi\fR and a comma matches
93a sequence of \fIi\fR or more matches of the atom.
94An atom followed by a bound
95containing two integers \fIi\fR and \fIj\fR matches
96a sequence of \fIi\fR through \fIj\fR (inclusive) matches of the atom.
97.Pp
98An
99.Em atom
100is a regular expression enclosed in `()'
101(matching a match for the regular expression),
102an empty set of `()' (matching the null string)\(dg,
103a
104.Em "bracket expression"
105(see below), `.'
106(matching any single character), `^' (matching the null string at the
107beginning of a line), `$' (matching the null string at the
108end of a line), a `\e' followed by one of the characters
109`^.[$()|*+?{\e'
110(matching that character taken as an ordinary character),
111a `\e' followed by any other character\(dg
112(matching that character taken as an ordinary character,
113as if the `\e' had not been present\(dg),
114or a single character with no other significance (matching that character).
115A `{' followed by a character other than a digit is an ordinary
116character, not the beginning of a bound\(dg.
117It is illegal to end an RE with `\e'.
118.Pp
119A
120.Em "bracket expression"
121is a list of characters enclosed in `[]'.
122It normally matches any single character from the list (but see below).
123If the list begins with `^',
124it matches any single character
125(but see below)
126.Em not
127from the rest of the list.
128If two characters in the list are separated by `\-', this is shorthand
129for the full
130.Em range
131of characters between those two (inclusive) in the
132collating sequence,
133e.g. `[0-9]' in ASCII matches any decimal digit.
134It is illegal\(dg for two ranges to share an
135endpoint, e.g. `a-c-e'.
136Ranges are very collating-sequence-dependent,
137and portable programs should avoid relying on them.
138.Pp
139To include a literal `]' in the list, make it the first character
140(following a possible `^').
141To include a literal `\-', make it the first or last character,
142or the second endpoint of a range.
143To use a literal `\-' as the first endpoint of a range,
144enclose it in `[.' and `.]' to make it a collating element (see below).
145With the exception of these and some combinations using `[' (see next
146paragraphs), all other special characters, including `\e', lose their
147special significance within a bracket expression.
148.Pp
149Within a bracket expression, a collating element (a character,
150a multi-character sequence that collates as if it were a single character,
151or a collating-sequence name for either)
152enclosed in `[.' and `.]' stands for the
153sequence of characters of that collating element.
154The sequence is a single element of the bracket expression's list.
155A bracket expression containing a multi-character collating element
156can thus match more than one character,
157e.g. if the collating sequence includes a `ch' collating element,
158then the RE `[[.ch.]]*c' matches the first five characters
159of `chchcc'.
160.Pp
161Within a bracket expression, a collating element enclosed in `[=' and
162`=]' is an equivalence class, standing for the sequences of characters
163of all collating elements equivalent to that one, including itself.
164(If there are no other equivalent collating elements,
165the treatment is as if the enclosing delimiters were `[.' and `.]'.)
166For example, if o and \o'o^' are the members of an equivalence class,
167then `[[=o=]]', `[[=\o'o^'=]]', and `[o\o'o^']' are all synonymous.
168An equivalence class may not\(dg be an endpoint
169of a range.
170.Pp
171Within a bracket expression, the name of a
172.Em "character class"
173enclosed
174in `[:' and `:]' stands for the list of all characters belonging to that
175class.
176Standard character class names are:
177.Pp
178.Bl -item -compact -offset indent
179.It
180alnum	digit	punct
181.It
182alpha	graph	space
183.It
184blank	lower	upper
185.It
186cntrl	print	xdigit
187.El
188.Pp
189These stand for the character classes defined in
190.Xr ctype 3 .
191A locale may provide others.
192A character class may not be used as an endpoint of a range.
193.Pp
194There are two special cases\(dg of bracket expressions:
195the bracket expressions `[[:<:]]' and `[[:>:]]' match the null string at
196the beginning and end of a word respectively.
197A word is defined as a sequence of
198word characters
199which is neither preceded nor followed by
200word characters.
201A word character is an
202.Em alnum
203character (as defined by
204.Xr ctype 3 )
205or an underscore.
206This is an extension,
207compatible with but not specified by POSIX 1003.2,
208and should be used with
209caution in software intended to be portable to other systems.
210.Pp
211In the event that an RE could match more than one substring of a given
212string,
213the RE matches the one starting earliest in the string.
214If the RE could match more than one substring starting at that point,
215it matches the longest.
216Subexpressions also match the longest possible substrings, subject to
217the constraint that the whole match be as long as possible,
218with subexpressions starting earlier in the RE taking priority over
219ones starting later.
220Note that higher-level subexpressions thus take priority over
221their lower-level component subexpressions.
222.Pp
223Match lengths are measured in characters, not collating elements.
224A null string is considered longer than no match at all.
225For example,
226`bb*' matches the three middle characters of `abbbc',
227`(wee|week)(knights|nights)' matches all ten characters of `weeknights',
228when `(.*).*' is matched against `abc' the parenthesized subexpression
229matches all three characters, and
230when `(a*)*' is matched against `bc' both the whole RE and the parenthesized
231subexpression match the null string.
232.Pp
233If case-independent matching is specified,
234the effect is much as if all case distinctions had vanished from the
235alphabet.
236When an alphabetic that exists in multiple cases appears as an
237ordinary character outside a bracket expression, it is effectively
238transformed into a bracket expression containing both cases,
239e.g. `x' becomes `[xX]'.
240When it appears inside a bracket expression, all case counterparts
241of it are added to the bracket expression, so that (e.g.) `[x]'
242becomes `[xX]' and `[^x]' becomes `[^xX]'.
243.Pp
244No particular limit is imposed on the length of REs\(dg.
245Programs intended to be portable should not employ REs longer
246than 256 bytes,
247as an implementation can refuse to accept such REs and remain
248POSIX-compliant.
249.Pp
250Obsolete (``basic'') regular expressions differ in several respects.
251`|', `+', and `?' are ordinary characters and there is no equivalent
252for their functionality.
253The delimiters for bounds are `\e{' and `\e}',
254with `{' and `}' by themselves ordinary characters.
255The parentheses for nested subexpressions are `\e(' and `\e)',
256with `(' and `)' by themselves ordinary characters.
257`^' is an ordinary character except at the beginning of the
258RE or\(dg the beginning of a parenthesized subexpression,
259`$' is an ordinary character except at the end of the
260RE or\(dg the end of a parenthesized subexpression,
261and `*' is an ordinary character if it appears at the beginning of the
262RE or the beginning of a parenthesized subexpression
263(after a possible leading `^').
264Finally, there is one new type of atom, a
265.Em "back reference" :
266`\e' followed by a non-zero decimal digit
267.Em d
268matches the same sequence of characters
269matched by the
270.Em d Ns th
271parenthesized subexpression
272(numbering subexpressions by the positions of their opening parentheses,
273left to right),
274so that (e.g.) `\e([bc]\e)\e1' matches `bb' or `cc' but not `bc'.
275.Sh SEE ALSO
276.Xr regex 3
277.Pp
278POSIX 1003.2, section 2.8 (Regular Expression Notation).
279.Sh BUGS
280Having two kinds of REs is a botch.
281.Pp
282The current 1003.2 spec says that `)' is an ordinary character in
283the absence of an unmatched `(';
284this was an unintentional result of a wording error,
285and change is likely.
286Avoid relying on it.
287.Pp
288Back references are a dreadful botch,
289posing major problems for efficient implementations.
290They are also somewhat vaguely defined
291(does
292`a\e(\e(b\e)*\e2\e)*d' match `abbbd'?).
293Avoid using them.
294.Pp
2951003.2's specification of case-independent matching is vague.
296The ``one case implies all cases'' definition given above
297is current consensus among implementors as to the right interpretation.
298.Pp
299The syntax for word boundaries is incredibly ugly.
300