1.\" $NetBSD: re_format.7,v 1.14 2021/02/24 09:10:12 wiz Exp $ 2.\" 3.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 4.\" Copyright (c) 1992, 1993, 1994 5.\" The Regents of the University of California. All rights reserved. 6.\" 7.\" This code is derived from software contributed to Berkeley by 8.\" Henry Spencer. 9.\" 10.\" Redistribution and use in source and binary forms, with or without 11.\" modification, are permitted provided that the following conditions 12.\" are met: 13.\" 1. Redistributions of source code must retain the above copyright 14.\" notice, this list of conditions and the following disclaimer. 15.\" 2. Redistributions in binary form must reproduce the above copyright 16.\" notice, this list of conditions and the following disclaimer in the 17.\" documentation and/or other materials provided with the distribution. 18.\" 3. All advertising materials mentioning features or use of this software 19.\" must display the following acknowledgement: 20.\" This product includes software developed by the University of 21.\" California, Berkeley and its contributors. 22.\" 4. Neither the name of the University nor the names of its contributors 23.\" may be used to endorse or promote products derived from this software 24.\" without specific prior written permission. 25.\" 26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 29.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 36.\" SUCH DAMAGE. 37.\" 38.\" @(#)re_format.7 8.3 (Berkeley) 3/20/94 39.\" $FreeBSD: head/lib/libc/regex/re_format.7 314373 2017-02-28 05:14:42Z glebius $ 40.\" 41.Dd February 22, 2021 42.Dt RE_FORMAT 7 43.Os 44.Sh NAME 45.Nm re_format 46.Nd POSIX 1003.2 regular expressions 47.Sh DESCRIPTION 48Regular expressions 49.Pq Dq RE Ns s , 50as defined in 51.St -p1003.2 , 52come in two forms: 53modern REs (roughly those of 54.Xr egrep 1 ; 551003.2 calls these 56.Dq extended 57REs) 58and obsolete REs (roughly those of 59.Xr ed 1 ; 601003.2 61.Dq basic 62REs). 63Obsolete REs mostly exist for backward compatibility in some old programs; 64they will be discussed at the end. 65.St -p1003.2 66leaves some aspects of RE syntax and semantics open; 67`\(dd' marks decisions on these aspects that 68may not be fully portable to other 69.St -p1003.2 70implementations. 71.Pp 72A (modern) RE is one\(dd or more non-empty\(dd 73.Em branches , 74separated by 75.Ql \&| . 76It matches anything that matches one of the branches. 77.Pp 78A branch is one\(dd or more 79.Em pieces , 80concatenated. 81It matches a match for the first, followed by a match for the second, etc. 82.Pp 83A piece is an 84.Em atom 85possibly followed 86by a single\(dd 87.Ql \&* , 88.Ql \&+ , 89.Ql \&? , 90or 91.Em bound . 92An atom followed by 93.Ql \&* 94matches a sequence of 0 or more matches of the atom. 95An atom followed by 96.Ql \&+ 97matches a sequence of 1 or more matches of the atom. 98An atom followed by 99.Ql ?\& 100matches a sequence of 0 or 1 matches of the atom. 101.Pp 102A 103.Em bound 104is 105.Ql \&{ 106followed by an unsigned decimal integer, 107possibly followed by 108.Ql \&, 109possibly followed by another unsigned decimal integer, 110always followed by 111.Ql \&} . 112The integers must lie between 0 and 113.Dv RE_DUP_MAX 114(255\(dd) inclusive, 115and if there are two of them, the first may not exceed the second. 116An atom followed by a bound containing one integer 117.Em i 118and no comma matches 119a sequence of exactly 120.Em i 121matches of the atom. 122An atom followed by a bound 123containing one integer 124.Em i 125and a comma matches 126a sequence of 127.Em i 128or more matches of the atom. 129An atom followed by a bound 130containing two integers 131.Em i 132and 133.Em j 134matches 135a sequence of 136.Em i 137through 138.Em j 139(inclusive) matches of the atom. 140.Pp 141An atom is a regular expression enclosed in 142.Ql () 143(matching a match for the 144regular expression), 145an empty set of 146.Ql () 147(matching the null string)\(dd, 148a 149.Em bracket expression 150(see below), 151.Ql .\& 152(matching any single character), 153.Ql \&^ 154(matching the null string at the beginning of a line), 155.Ql \&$ 156(matching the null string at the end of a line), a 157.Ql \e 158followed by one of the characters 159.Ql ^.[$()|*+?{\e 160(matching that character taken as an ordinary character), 161a 162.Ql \e 163followed by any other character\(dd 164(matching that character taken as an ordinary character, 165as if the 166.Ql \e 167had not been present\(dd), 168or a single character with no other significance (matching that character). 169A 170.Ql \&{ 171followed by a character other than a digit is an ordinary 172character, not the beginning of a bound\(dd. 173It is illegal to end an RE with 174.Ql \e . 175.Pp 176A 177.Em bracket expression 178is a list of characters enclosed in 179.Ql [] . 180It normally matches any single character from the list (but see below). 181If the list begins with 182.Ql \&^ , 183it matches any single character 184(but see below) 185.Em not 186from the rest of the list. 187If two characters in the list are separated by 188.Ql \&- , 189this is shorthand 190for the full 191.Em range 192of characters between those two (inclusive) in the 193collating sequence, 194.No e.g. Ql [0-9] 195in ASCII matches any decimal digit. 196It is illegal\(dd for two ranges to share an 197endpoint, 198.No e.g. Ql a-c-e . 199Ranges are very collating-sequence-dependent, 200and portable programs should avoid relying on them. 201.Pp 202To include a literal 203.Ql \&] 204in the list, make it the first character 205(following a possible 206.Ql \&^ ) . 207To include a literal 208.Ql \&- , 209make it the first or last character, 210or the second endpoint of a range. 211To use a literal 212.Ql \&- 213as the first endpoint of a range, 214enclose it in 215.Ql [.\& 216and 217.Ql .]\& 218to make it a collating element (see below). 219With the exception of these and some combinations using 220.Ql \&[ 221(see next paragraphs), all other special characters, including 222.Ql \e , 223lose their special significance within a bracket expression. 224.Pp 225Within a bracket expression, a collating element (a character, 226a multi-character sequence that collates as if it were a single character, 227or a collating-sequence name for either) 228enclosed in 229.Ql [.\& 230and 231.Ql .]\& 232stands for the 233sequence of characters of that collating element. 234The sequence is a single element of the bracket expression's list. 235A bracket expression containing a multi-character collating element 236can thus match more than one character, 237e.g.\& if the collating sequence includes a 238.Ql ch 239collating element, 240then the RE 241.Ql [[.ch.]]*c 242matches the first five characters 243of 244.Ql chchcc . 245.Pp 246Within a bracket expression, a collating element enclosed in 247.Ql [= 248and 249.Ql =] 250is an equivalence class, standing for the sequences of characters 251of all collating elements equivalent to that one, including itself. 252(If there are no other equivalent collating elements, 253the treatment is as if the enclosing delimiters were 254.Ql [.\& 255and 256.Ql .] . ) 257For example, if 258.Ql x 259and 260.Ql y 261are the members of an equivalence class, 262then 263.Ql [[=x=]] , 264.Ql [[=y=]] , 265and 266.Ql [xy] 267are all synonymous. 268An equivalence class may not\(dd be an endpoint 269of a range. 270.Pp 271Within a bracket expression, the name of a 272.Em character class 273enclosed in 274.Ql [: 275and 276.Ql :] 277stands for the list of all characters belonging to that 278class. 279Standard character class names are: 280.Bl -column "alnum" "digit" "xdigit" -offset indent 281.It Em "alnum digit punct" 282.It Em "alpha graph space" 283.It Em "blank lower upper" 284.It Em "cntrl print xdigit" 285.El 286.Pp 287These stand for the character classes defined in 288.Xr ctype 3 . 289A locale may provide others. 290A character class may not be used as an endpoint of a range. 291.Pp 292A bracketed expression like 293.Ql [[:class:]] 294can be used to match a single character that belongs to a character 295class. 296The reverse, matching any character that does not belong to a specific 297class, the negation operator of bracket expressions may be used: 298.Ql [^[:class:]] . 299.Pp 300There are two special cases\(dd of bracket expressions: 301the bracket expressions 302.Ql [[:<:]] 303and 304.Ql [[:>:]] 305match the null string at the beginning and end of a word respectively. 306A word is defined as a sequence of word characters 307which is neither preceded nor followed by 308word characters. 309A word character is an 310.Em alnum 311character (as defined by 312.Xr ctype 3 ) 313or an underscore. 314This is an extension, 315compatible with but not specified by 316.St -p1003.2 , 317and should be used with 318caution in software intended to be portable to other systems. 319The additional word delimiters 320.Ql \e< 321and 322.Ql \e> 323are provided to ease compatibility with traditional 324SVR4 325systems but are not portable and should be avoided. 326.Pp 327In the event that an RE could match more than one substring of a given 328string, 329the RE matches the one starting earliest in the string. 330If the RE could match more than one substring starting at that point, 331it matches the longest. 332Subexpressions also match the longest possible substrings, subject to 333the constraint that the whole match be as long as possible, 334with subexpressions starting earlier in the RE taking priority over 335ones starting later. 336Note that higher-level subexpressions thus take priority over 337their lower-level component subexpressions. 338.Pp 339Match lengths are measured in characters, not collating elements. 340A null string is considered longer than no match at all. 341For example, 342.Ql bb* 343matches the three middle characters of 344.Ql abbbc , 345.Ql (wee|week)(knights|nights) 346matches all ten characters of 347.Ql weeknights , 348when 349.Ql (.*).*\& 350is matched against 351.Ql abc 352the parenthesized subexpression 353matches all three characters, and 354when 355.Ql (a*)* 356is matched against 357.Ql bc 358both the whole RE and the parenthesized 359subexpression match the null string. 360.Pp 361If case-independent matching is specified, 362the effect is much as if all case distinctions had vanished from the 363alphabet. 364When an alphabetic that exists in multiple cases appears as an 365ordinary character outside a bracket expression, it is effectively 366transformed into a bracket expression containing both cases, 367.No e.g. Ql x 368becomes 369.Ql [xX] . 370When it appears inside a bracket expression, all case counterparts 371of it are added to the bracket expression, so that (e.g.) 372.Ql [x] 373becomes 374.Ql [xX] 375and 376.Ql [^x] 377becomes 378.Ql [^xX] . 379.Pp 380No particular limit is imposed on the length of REs\(dd. 381Programs intended to be portable should not employ REs longer 382than 256 bytes, 383as an implementation can refuse to accept such REs and remain 384POSIX-compliant. 385.Pp 386Obsolete 387.Pq Dq basic 388regular expressions differ in several respects. 389.Ql \&| 390is an ordinary character and there is no equivalent 391for its functionality. 392.Ql \&+ 393and 394.Ql ?\& 395are ordinary characters, and their functionality 396can be expressed using bounds 397.Po 398.Ql {1,} 399or 400.Ql {0,1} 401respectively 402.Pc . 403Also note that 404.Ql x+ 405in modern REs is equivalent to 406.Ql xx* . 407The delimiters for bounds are 408.Ql \e{ 409and 410.Ql \e} , 411with 412.Ql \&{ 413and 414.Ql \&} 415by themselves ordinary characters. 416The parentheses for nested subexpressions are 417.Ql \e( 418and 419.Ql \e) , 420with 421.Ql \&( 422and 423.Ql \&) 424by themselves ordinary characters. 425.Ql \&^ 426is an ordinary character except at the beginning of the 427RE or\(dd the beginning of a parenthesized subexpression, 428.Ql \&$ 429is an ordinary character except at the end of the 430RE or\(dd the end of a parenthesized subexpression, 431and 432.Ql \&* 433is an ordinary character if it appears at the beginning of the 434RE or the beginning of a parenthesized subexpression 435(after a possible leading 436.Ql \&^ ) . 437Finally, there is one new type of atom, a 438.Em back reference : 439.Ql \e 440followed by a non-zero decimal digit 441.Em d 442matches the same sequence of characters 443matched by the 444.Em d Ns th 445parenthesized subexpression 446(numbering subexpressions by the positions of their opening parentheses, 447left to right), 448so that (e.g.) 449.Ql \e([bc]\e)\e1 450matches 451.Ql bb 452or 453.Ql cc 454but not 455.Ql bc . 456.Sh SEE ALSO 457.Xr regex 3 458.Rs 459.%T Regular Expression Notation 460.%R IEEE Std 461.%N 1003.2 462.%P section 2.8 463.Re 464.Sh BUGS 465Having two kinds of REs is a botch. 466.Pp 467The current 468.St -p1003.2 469spec says that 470.Ql \&) 471is an ordinary character in 472the absence of an unmatched 473.Ql \&( ; 474this was an unintentional result of a wording error, 475and change is likely. 476Avoid relying on it. 477.Pp 478Back references are a dreadful botch, 479posing major problems for efficient implementations. 480They are also somewhat vaguely defined 481(does 482.Ql a\e(\e(b\e)*\e2\e)*d 483match 484.Ql abbbd ? ) . 485Avoid using them. 486.Pp 487.St -p1003.2 488specification of case-independent matching is vague. 489The 490.Dq one case implies all cases 491definition given above 492is current consensus among implementors as to the right interpretation. 493.Pp 494The syntax for word boundaries is incredibly ugly. 495