xref: /netbsd-src/lib/libc/regex/re_format.7 (revision 495edbf836afb74ece75e45c00692a25c9b3ca8b)
1.\" $NetBSD: re_format.7,v 1.16 2022/12/04 16:52:48 uwe Exp $
2.\"
3.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
4.\" Copyright (c) 1992, 1993, 1994
5.\"	The Regents of the University of California.  All rights reserved.
6.\"
7.\" This code is derived from software contributed to Berkeley by
8.\" Henry Spencer.
9.\"
10.\" Redistribution and use in source and binary forms, with or without
11.\" modification, are permitted provided that the following conditions
12.\" are met:
13.\" 1. Redistributions of source code must retain the above copyright
14.\"    notice, this list of conditions and the following disclaimer.
15.\" 2. Redistributions in binary form must reproduce the above copyright
16.\"    notice, this list of conditions and the following disclaimer in the
17.\"    documentation and/or other materials provided with the distribution.
18.\" 3. All advertising materials mentioning features or use of this software
19.\"    must display the following acknowledgement:
20.\"	This product includes software developed by the University of
21.\"	California, Berkeley and its contributors.
22.\" 4. Neither the name of the University nor the names of its contributors
23.\"    may be used to endorse or promote products derived from this software
24.\"    without specific prior written permission.
25.\"
26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
29.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
36.\" SUCH DAMAGE.
37.\"
38.\"	@(#)re_format.7	8.3 (Berkeley) 3/20/94
39.\" $FreeBSD: head/lib/libc/regex/re_format.7 314373 2017-02-28 05:14:42Z glebius $
40.\"
41.Dd February 22, 2021
42.Dt RE_FORMAT 7
43.Os
44.Sh NAME
45.Nm re_format
46.Nd POSIX 1003.2 regular expressions
47.Sh DESCRIPTION
48Regular expressions
49.Pq Dq RE Ns s ,
50as defined in
51.St -p1003.2 ,
52come in two forms:
53modern REs (roughly those of
54.Xr egrep 1 ;
551003.2 calls these
56.Dq extended
57REs)
58and obsolete REs (roughly those of
59.Xr ed 1 ;
601003.2
61.Dq basic
62REs).
63Obsolete REs mostly exist for backward compatibility in some old programs;
64they will be discussed at the end.
65.St -p1003.2
66leaves some aspects of RE syntax and semantics open;
67.ds DG \\s-2\\v'-0.4m'\\(dg\\v'0.4m'\\s+2
68`\(dg' marks decisions on these aspects that
69may not be fully portable to other
70.St -p1003.2
71implementations.
72.Ss Extended regular expressions
73A (modern) RE is one\*(DG or more non-empty\*(DG
74.Em branches ,
75separated by
76.Ql \&| .
77It matches anything that matches one of the branches.
78.Pp
79A branch is one\*(DG or more
80.Em pieces ,
81concatenated.
82It matches a match for the first, followed by a match for the second, etc.
83.Pp
84A piece is an
85.Em atom
86possibly followed
87by a single\*(DG
88.Ql \&* ,
89.Ql \&+ ,
90.Ql \&? ,
91or
92.Em bound .
93An atom followed by
94.Ql \&*
95matches a sequence of 0 or more matches of the atom.
96An atom followed by
97.Ql \&+
98matches a sequence of 1 or more matches of the atom.
99An atom followed by
100.Ql ?\&
101matches a sequence of 0 or 1 matches of the atom.
102.Pp
103A
104.Em bound
105is
106.Ql \&{
107followed by an unsigned decimal integer,
108possibly followed by
109.Ql \&,
110possibly followed by another unsigned decimal integer,
111always followed by
112.Ql \&} .
113The integers must lie between 0 and
114.Dv RE_DUP_MAX
115(255\*(DG) inclusive,
116and if there are two of them, the first may not exceed the second.
117An atom followed by a bound containing one integer
118.Em i
119and no comma matches
120a sequence of exactly
121.Em i
122matches of the atom.
123An atom followed by a bound
124containing one integer
125.Em i
126and a comma matches
127a sequence of
128.Em i
129or more matches of the atom.
130An atom followed by a bound
131containing two integers
132.Em i
133and
134.Em j
135matches
136a sequence of
137.Em i
138through
139.Em j
140(inclusive) matches of the atom.
141.Pp
142An atom is a regular expression enclosed in
143.Ql ()
144(matching a match for the
145regular expression),
146an empty set of
147.Ql ()
148(matching the null string)\*(DG,
149a
150.Em bracket expression
151(see below),
152.Ql .\&
153(matching any single character),
154.Ql \&^
155(matching the null string at the beginning of a line),
156.Ql \&$
157(matching the null string at the end of a line), a
158.Ql \e
159followed by one of the characters
160.Ql ^.[$()|*+?{\e
161(matching that character taken as an ordinary character),
162a
163.Ql \e
164followed by any other character\*(DG
165(matching that character taken as an ordinary character,
166as if the
167.Ql \e
168had not been present\*(DG),
169or a single character with no other significance (matching that character).
170A
171.Ql \&{
172followed by a character other than a digit is an ordinary
173character, not the beginning of a bound\*(DG.
174It is illegal to end an RE with
175.Ql \e .
176.Pp
177A
178.Em bracket expression
179is a list of characters enclosed in
180.Ql [] .
181It normally matches any single character from the list (but see below).
182If the list begins with
183.Ql \&^ ,
184it matches any single character
185(but see below)
186.Em not
187from the rest of the list.
188If two characters in the list are separated by
189.Ql \&- ,
190this is shorthand
191for the full
192.Em range
193of characters between those two (inclusive) in the
194collating sequence,
195.No e.g. Ql [0-9]
196in ASCII matches any decimal digit.
197It is illegal\*(DG for two ranges to share an
198endpoint,
199.No e.g. Ql a-c-e .
200Ranges are very collating-sequence-dependent,
201and portable programs should avoid relying on them.
202.Pp
203To include a literal
204.Ql \&]
205in the list, make it the first character
206(following a possible
207.Ql \&^ ) .
208To include a literal
209.Ql \&- ,
210make it the first or last character,
211or the second endpoint of a range.
212To use a literal
213.Ql \&-
214as the first endpoint of a range,
215enclose it in
216.Ql [.\&
217and
218.Ql .]\&
219to make it a collating element (see below).
220With the exception of these and some combinations using
221.Ql \&[
222(see next paragraphs), all other special characters, including
223.Ql \e ,
224lose their special significance within a bracket expression.
225.Pp
226Within a bracket expression, a collating element (a character,
227a multi-character sequence that collates as if it were a single character,
228or a collating-sequence name for either)
229enclosed in
230.Ql [.\&
231and
232.Ql .]\&
233stands for the
234sequence of characters of that collating element.
235The sequence is a single element of the bracket expression's list.
236A bracket expression containing a multi-character collating element
237can thus match more than one character,
238e.g.\& if the collating sequence includes a
239.Ql ch
240collating element,
241then the RE
242.Ql [[.ch.]]*c
243matches the first five characters
244of
245.Ql chchcc .
246.Pp
247Within a bracket expression, a collating element enclosed in
248.Ql [=
249and
250.Ql =]
251is an equivalence class, standing for the sequences of characters
252of all collating elements equivalent to that one, including itself.
253(If there are no other equivalent collating elements,
254the treatment is as if the enclosing delimiters were
255.Ql [.\&
256and
257.Ql .] . )
258For example, if
259.Ql x
260and
261.Ql y
262are the members of an equivalence class,
263then
264.Ql [[=x=]] ,
265.Ql [[=y=]] ,
266and
267.Ql [xy]
268are all synonymous.
269An equivalence class may not\*(DG be an endpoint
270of a range.
271.Pp
272Within a bracket expression, the name of a
273.Em character class
274enclosed in
275.Ql [:
276and
277.Ql :]
278stands for the list of all characters belonging to that
279class.
280Standard character class names are:
281.Bl -column "alnum" "digit" "xdigit" -offset indent
282.It Em "alnum	digit	punct"
283.It Em "alpha	graph	space"
284.It Em "blank	lower	upper"
285.It Em "cntrl	print	xdigit"
286.El
287.Pp
288These stand for the character classes defined in
289.Xr ctype 3 .
290A locale may provide others.
291A character class may not be used as an endpoint of a range.
292.Pp
293A bracketed expression like
294.Ql [[:class:]]
295can be used to match a single character that belongs to a character
296class.
297The reverse, matching any character that does not belong to a specific
298class, the negation operator of bracket expressions may be used:
299.Ql [^[:class:]] .
300.Pp
301There are two special cases\*(DG of bracket expressions:
302the bracket expressions
303.Ql [[:<:]]
304and
305.Ql [[:>:]]
306match the null string at the beginning and end of a word respectively.
307A word is defined as a sequence of word characters
308which is neither preceded nor followed by
309word characters.
310A word character is an
311.Em alnum
312character (as defined by
313.Xr ctype 3 )
314or an underscore.
315This is an extension,
316compatible with but not specified by
317.St -p1003.2 ,
318and should be used with
319caution in software intended to be portable to other systems.
320The additional word delimiters
321.Ql \e<
322and
323.Ql \e>
324are provided to ease compatibility with traditional
325SVR4
326systems but are not portable and should be avoided.
327.Pp
328In the event that an RE could match more than one substring of a given
329string,
330the RE matches the one starting earliest in the string.
331If the RE could match more than one substring starting at that point,
332it matches the longest.
333Subexpressions also match the longest possible substrings, subject to
334the constraint that the whole match be as long as possible,
335with subexpressions starting earlier in the RE taking priority over
336ones starting later.
337Note that higher-level subexpressions thus take priority over
338their lower-level component subexpressions.
339.Pp
340Match lengths are measured in characters, not collating elements.
341A null string is considered longer than no match at all.
342For example,
343.Ql bb*
344matches the three middle characters of
345.Ql abbbc ,
346.Ql (wee|week)(knights|nights)
347matches all ten characters of
348.Ql weeknights ,
349when
350.Ql (.*).*\&
351is matched against
352.Ql abc
353the parenthesized subexpression
354matches all three characters, and
355when
356.Ql (a*)*
357is matched against
358.Ql bc
359both the whole RE and the parenthesized
360subexpression match the null string.
361.Pp
362If case-independent matching is specified,
363the effect is much as if all case distinctions had vanished from the
364alphabet.
365When an alphabetic that exists in multiple cases appears as an
366ordinary character outside a bracket expression, it is effectively
367transformed into a bracket expression containing both cases,
368.No e.g. Ql x
369becomes
370.Ql [xX] .
371When it appears inside a bracket expression, all case counterparts
372of it are added to the bracket expression, so that (e.g.)
373.Ql [x]
374becomes
375.Ql [xX]
376and
377.Ql [^x]
378becomes
379.Ql [^xX] .
380.Pp
381No particular limit is imposed on the length of REs\*(DG.
382Programs intended to be portable should not employ REs longer
383than 256 bytes,
384as an implementation can refuse to accept such REs and remain
385POSIX-compliant.
386.Ss Basic regular expressions
387Obsolete
388.Pq Dq basic
389regular expressions differ in several respects.
390.Ql \&|
391is an ordinary character and there is no equivalent
392for its functionality.
393.Ql \&+
394and
395.Ql ?\&
396are ordinary characters, and their functionality
397can be expressed using bounds
398.Po
399.Ql {1,}
400or
401.Ql {0,1}
402respectively
403.Pc .
404Also note that
405.Ql x+
406in modern REs is equivalent to
407.Ql xx* .
408The delimiters for bounds are
409.Ql \e{
410and
411.Ql \e} ,
412with
413.Ql \&{
414and
415.Ql \&}
416by themselves ordinary characters.
417The parentheses for nested subexpressions are
418.Ql \e(
419and
420.Ql \e) ,
421with
422.Ql \&(
423and
424.Ql \&)
425by themselves ordinary characters.
426.Ql \&^
427is an ordinary character except at the beginning of the
428RE or\*(DG the beginning of a parenthesized subexpression,
429.Ql \&$
430is an ordinary character except at the end of the
431RE or\*(DG the end of a parenthesized subexpression,
432and
433.Ql \&*
434is an ordinary character if it appears at the beginning of the
435RE or the beginning of a parenthesized subexpression
436(after a possible leading
437.Ql \&^ ) .
438Finally, there is one new type of atom, a
439.Em back reference :
440.Ql \e
441followed by a non-zero decimal digit
442.Em d
443matches the same sequence of characters
444matched by the
445.Em d Ns th
446parenthesized subexpression
447(numbering subexpressions by the positions of their opening parentheses,
448left to right),
449so that (e.g.)
450.Ql \e([bc]\e)\e1
451matches
452.Ql bb
453or
454.Ql cc
455but not
456.Ql bc .
457.Sh SEE ALSO
458.Xr regex 3
459.Rs
460.%T Regular Expression Notation
461.%R IEEE Std
462.%N 1003.2
463.%P section 2.8
464.Re
465.Sh BUGS
466Having two kinds of REs is a botch.
467.Pp
468The current
469.St -p1003.2
470spec says that
471.Ql \&)
472is an ordinary character in
473the absence of an unmatched
474.Ql \&( ;
475this was an unintentional result of a wording error,
476and change is likely.
477Avoid relying on it.
478.Pp
479Back references are a dreadful botch,
480posing major problems for efficient implementations.
481They are also somewhat vaguely defined
482(does
483.Ql a\e(\e(b\e)*\e2\e)*d
484match
485.Ql abbbd ? ) .
486Avoid using them.
487.Pp
488.St -p1003.2
489specification of case-independent matching is vague.
490The
491.Dq one case implies all cases
492definition given above
493is current consensus among implementors as to the right interpretation.
494.Pp
495The syntax for word boundaries is incredibly ugly.
496