xref: /netbsd-src/lib/libc/regex/re_format.7 (revision 627f7eb200a4419d89b531d55fccd2ee3ffdcde0)
1.\" $NetBSD: re_format.7,v 1.14 2021/02/24 09:10:12 wiz Exp $
2.\"
3.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
4.\" Copyright (c) 1992, 1993, 1994
5.\"	The Regents of the University of California.  All rights reserved.
6.\"
7.\" This code is derived from software contributed to Berkeley by
8.\" Henry Spencer.
9.\"
10.\" Redistribution and use in source and binary forms, with or without
11.\" modification, are permitted provided that the following conditions
12.\" are met:
13.\" 1. Redistributions of source code must retain the above copyright
14.\"    notice, this list of conditions and the following disclaimer.
15.\" 2. Redistributions in binary form must reproduce the above copyright
16.\"    notice, this list of conditions and the following disclaimer in the
17.\"    documentation and/or other materials provided with the distribution.
18.\" 3. All advertising materials mentioning features or use of this software
19.\"    must display the following acknowledgement:
20.\"	This product includes software developed by the University of
21.\"	California, Berkeley and its contributors.
22.\" 4. Neither the name of the University nor the names of its contributors
23.\"    may be used to endorse or promote products derived from this software
24.\"    without specific prior written permission.
25.\"
26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
29.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
36.\" SUCH DAMAGE.
37.\"
38.\"	@(#)re_format.7	8.3 (Berkeley) 3/20/94
39.\" $FreeBSD: head/lib/libc/regex/re_format.7 314373 2017-02-28 05:14:42Z glebius $
40.\"
41.Dd February 22, 2021
42.Dt RE_FORMAT 7
43.Os
44.Sh NAME
45.Nm re_format
46.Nd POSIX 1003.2 regular expressions
47.Sh DESCRIPTION
48Regular expressions
49.Pq Dq RE Ns s ,
50as defined in
51.St -p1003.2 ,
52come in two forms:
53modern REs (roughly those of
54.Xr egrep 1 ;
551003.2 calls these
56.Dq extended
57REs)
58and obsolete REs (roughly those of
59.Xr ed 1 ;
601003.2
61.Dq basic
62REs).
63Obsolete REs mostly exist for backward compatibility in some old programs;
64they will be discussed at the end.
65.St -p1003.2
66leaves some aspects of RE syntax and semantics open;
67`\(dd' marks decisions on these aspects that
68may not be fully portable to other
69.St -p1003.2
70implementations.
71.Pp
72A (modern) RE is one\(dd or more non-empty\(dd
73.Em branches ,
74separated by
75.Ql \&| .
76It matches anything that matches one of the branches.
77.Pp
78A branch is one\(dd or more
79.Em pieces ,
80concatenated.
81It matches a match for the first, followed by a match for the second, etc.
82.Pp
83A piece is an
84.Em atom
85possibly followed
86by a single\(dd
87.Ql \&* ,
88.Ql \&+ ,
89.Ql \&? ,
90or
91.Em bound .
92An atom followed by
93.Ql \&*
94matches a sequence of 0 or more matches of the atom.
95An atom followed by
96.Ql \&+
97matches a sequence of 1 or more matches of the atom.
98An atom followed by
99.Ql ?\&
100matches a sequence of 0 or 1 matches of the atom.
101.Pp
102A
103.Em bound
104is
105.Ql \&{
106followed by an unsigned decimal integer,
107possibly followed by
108.Ql \&,
109possibly followed by another unsigned decimal integer,
110always followed by
111.Ql \&} .
112The integers must lie between 0 and
113.Dv RE_DUP_MAX
114(255\(dd) inclusive,
115and if there are two of them, the first may not exceed the second.
116An atom followed by a bound containing one integer
117.Em i
118and no comma matches
119a sequence of exactly
120.Em i
121matches of the atom.
122An atom followed by a bound
123containing one integer
124.Em i
125and a comma matches
126a sequence of
127.Em i
128or more matches of the atom.
129An atom followed by a bound
130containing two integers
131.Em i
132and
133.Em j
134matches
135a sequence of
136.Em i
137through
138.Em j
139(inclusive) matches of the atom.
140.Pp
141An atom is a regular expression enclosed in
142.Ql ()
143(matching a match for the
144regular expression),
145an empty set of
146.Ql ()
147(matching the null string)\(dd,
148a
149.Em bracket expression
150(see below),
151.Ql .\&
152(matching any single character),
153.Ql \&^
154(matching the null string at the beginning of a line),
155.Ql \&$
156(matching the null string at the end of a line), a
157.Ql \e
158followed by one of the characters
159.Ql ^.[$()|*+?{\e
160(matching that character taken as an ordinary character),
161a
162.Ql \e
163followed by any other character\(dd
164(matching that character taken as an ordinary character,
165as if the
166.Ql \e
167had not been present\(dd),
168or a single character with no other significance (matching that character).
169A
170.Ql \&{
171followed by a character other than a digit is an ordinary
172character, not the beginning of a bound\(dd.
173It is illegal to end an RE with
174.Ql \e .
175.Pp
176A
177.Em bracket expression
178is a list of characters enclosed in
179.Ql [] .
180It normally matches any single character from the list (but see below).
181If the list begins with
182.Ql \&^ ,
183it matches any single character
184(but see below)
185.Em not
186from the rest of the list.
187If two characters in the list are separated by
188.Ql \&- ,
189this is shorthand
190for the full
191.Em range
192of characters between those two (inclusive) in the
193collating sequence,
194.No e.g. Ql [0-9]
195in ASCII matches any decimal digit.
196It is illegal\(dd for two ranges to share an
197endpoint,
198.No e.g. Ql a-c-e .
199Ranges are very collating-sequence-dependent,
200and portable programs should avoid relying on them.
201.Pp
202To include a literal
203.Ql \&]
204in the list, make it the first character
205(following a possible
206.Ql \&^ ) .
207To include a literal
208.Ql \&- ,
209make it the first or last character,
210or the second endpoint of a range.
211To use a literal
212.Ql \&-
213as the first endpoint of a range,
214enclose it in
215.Ql [.\&
216and
217.Ql .]\&
218to make it a collating element (see below).
219With the exception of these and some combinations using
220.Ql \&[
221(see next paragraphs), all other special characters, including
222.Ql \e ,
223lose their special significance within a bracket expression.
224.Pp
225Within a bracket expression, a collating element (a character,
226a multi-character sequence that collates as if it were a single character,
227or a collating-sequence name for either)
228enclosed in
229.Ql [.\&
230and
231.Ql .]\&
232stands for the
233sequence of characters of that collating element.
234The sequence is a single element of the bracket expression's list.
235A bracket expression containing a multi-character collating element
236can thus match more than one character,
237e.g.\& if the collating sequence includes a
238.Ql ch
239collating element,
240then the RE
241.Ql [[.ch.]]*c
242matches the first five characters
243of
244.Ql chchcc .
245.Pp
246Within a bracket expression, a collating element enclosed in
247.Ql [=
248and
249.Ql =]
250is an equivalence class, standing for the sequences of characters
251of all collating elements equivalent to that one, including itself.
252(If there are no other equivalent collating elements,
253the treatment is as if the enclosing delimiters were
254.Ql [.\&
255and
256.Ql .] . )
257For example, if
258.Ql x
259and
260.Ql y
261are the members of an equivalence class,
262then
263.Ql [[=x=]] ,
264.Ql [[=y=]] ,
265and
266.Ql [xy]
267are all synonymous.
268An equivalence class may not\(dd be an endpoint
269of a range.
270.Pp
271Within a bracket expression, the name of a
272.Em character class
273enclosed in
274.Ql [:
275and
276.Ql :]
277stands for the list of all characters belonging to that
278class.
279Standard character class names are:
280.Bl -column "alnum" "digit" "xdigit" -offset indent
281.It Em "alnum	digit	punct"
282.It Em "alpha	graph	space"
283.It Em "blank	lower	upper"
284.It Em "cntrl	print	xdigit"
285.El
286.Pp
287These stand for the character classes defined in
288.Xr ctype 3 .
289A locale may provide others.
290A character class may not be used as an endpoint of a range.
291.Pp
292A bracketed expression like
293.Ql [[:class:]]
294can be used to match a single character that belongs to a character
295class.
296The reverse, matching any character that does not belong to a specific
297class, the negation operator of bracket expressions may be used:
298.Ql [^[:class:]] .
299.Pp
300There are two special cases\(dd of bracket expressions:
301the bracket expressions
302.Ql [[:<:]]
303and
304.Ql [[:>:]]
305match the null string at the beginning and end of a word respectively.
306A word is defined as a sequence of word characters
307which is neither preceded nor followed by
308word characters.
309A word character is an
310.Em alnum
311character (as defined by
312.Xr ctype 3 )
313or an underscore.
314This is an extension,
315compatible with but not specified by
316.St -p1003.2 ,
317and should be used with
318caution in software intended to be portable to other systems.
319The additional word delimiters
320.Ql \e<
321and
322.Ql \e>
323are provided to ease compatibility with traditional
324SVR4
325systems but are not portable and should be avoided.
326.Pp
327In the event that an RE could match more than one substring of a given
328string,
329the RE matches the one starting earliest in the string.
330If the RE could match more than one substring starting at that point,
331it matches the longest.
332Subexpressions also match the longest possible substrings, subject to
333the constraint that the whole match be as long as possible,
334with subexpressions starting earlier in the RE taking priority over
335ones starting later.
336Note that higher-level subexpressions thus take priority over
337their lower-level component subexpressions.
338.Pp
339Match lengths are measured in characters, not collating elements.
340A null string is considered longer than no match at all.
341For example,
342.Ql bb*
343matches the three middle characters of
344.Ql abbbc ,
345.Ql (wee|week)(knights|nights)
346matches all ten characters of
347.Ql weeknights ,
348when
349.Ql (.*).*\&
350is matched against
351.Ql abc
352the parenthesized subexpression
353matches all three characters, and
354when
355.Ql (a*)*
356is matched against
357.Ql bc
358both the whole RE and the parenthesized
359subexpression match the null string.
360.Pp
361If case-independent matching is specified,
362the effect is much as if all case distinctions had vanished from the
363alphabet.
364When an alphabetic that exists in multiple cases appears as an
365ordinary character outside a bracket expression, it is effectively
366transformed into a bracket expression containing both cases,
367.No e.g. Ql x
368becomes
369.Ql [xX] .
370When it appears inside a bracket expression, all case counterparts
371of it are added to the bracket expression, so that (e.g.)
372.Ql [x]
373becomes
374.Ql [xX]
375and
376.Ql [^x]
377becomes
378.Ql [^xX] .
379.Pp
380No particular limit is imposed on the length of REs\(dd.
381Programs intended to be portable should not employ REs longer
382than 256 bytes,
383as an implementation can refuse to accept such REs and remain
384POSIX-compliant.
385.Pp
386Obsolete
387.Pq Dq basic
388regular expressions differ in several respects.
389.Ql \&|
390is an ordinary character and there is no equivalent
391for its functionality.
392.Ql \&+
393and
394.Ql ?\&
395are ordinary characters, and their functionality
396can be expressed using bounds
397.Po
398.Ql {1,}
399or
400.Ql {0,1}
401respectively
402.Pc .
403Also note that
404.Ql x+
405in modern REs is equivalent to
406.Ql xx* .
407The delimiters for bounds are
408.Ql \e{
409and
410.Ql \e} ,
411with
412.Ql \&{
413and
414.Ql \&}
415by themselves ordinary characters.
416The parentheses for nested subexpressions are
417.Ql \e(
418and
419.Ql \e) ,
420with
421.Ql \&(
422and
423.Ql \&)
424by themselves ordinary characters.
425.Ql \&^
426is an ordinary character except at the beginning of the
427RE or\(dd the beginning of a parenthesized subexpression,
428.Ql \&$
429is an ordinary character except at the end of the
430RE or\(dd the end of a parenthesized subexpression,
431and
432.Ql \&*
433is an ordinary character if it appears at the beginning of the
434RE or the beginning of a parenthesized subexpression
435(after a possible leading
436.Ql \&^ ) .
437Finally, there is one new type of atom, a
438.Em back reference :
439.Ql \e
440followed by a non-zero decimal digit
441.Em d
442matches the same sequence of characters
443matched by the
444.Em d Ns th
445parenthesized subexpression
446(numbering subexpressions by the positions of their opening parentheses,
447left to right),
448so that (e.g.)
449.Ql \e([bc]\e)\e1
450matches
451.Ql bb
452or
453.Ql cc
454but not
455.Ql bc .
456.Sh SEE ALSO
457.Xr regex 3
458.Rs
459.%T Regular Expression Notation
460.%R IEEE Std
461.%N 1003.2
462.%P section 2.8
463.Re
464.Sh BUGS
465Having two kinds of REs is a botch.
466.Pp
467The current
468.St -p1003.2
469spec says that
470.Ql \&)
471is an ordinary character in
472the absence of an unmatched
473.Ql \&( ;
474this was an unintentional result of a wording error,
475and change is likely.
476Avoid relying on it.
477.Pp
478Back references are a dreadful botch,
479posing major problems for efficient implementations.
480They are also somewhat vaguely defined
481(does
482.Ql a\e(\e(b\e)*\e2\e)*d
483match
484.Ql abbbd ? ) .
485Avoid using them.
486.Pp
487.St -p1003.2
488specification of case-independent matching is vague.
489The
490.Dq one case implies all cases
491definition given above
492is current consensus among implementors as to the right interpretation.
493.Pp
494The syntax for word boundaries is incredibly ugly.
495