1.\" $NetBSD: re_format.7,v 1.16 2022/12/04 16:52:48 uwe Exp $ 2.\" 3.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 4.\" Copyright (c) 1992, 1993, 1994 5.\" The Regents of the University of California. All rights reserved. 6.\" 7.\" This code is derived from software contributed to Berkeley by 8.\" Henry Spencer. 9.\" 10.\" Redistribution and use in source and binary forms, with or without 11.\" modification, are permitted provided that the following conditions 12.\" are met: 13.\" 1. Redistributions of source code must retain the above copyright 14.\" notice, this list of conditions and the following disclaimer. 15.\" 2. Redistributions in binary form must reproduce the above copyright 16.\" notice, this list of conditions and the following disclaimer in the 17.\" documentation and/or other materials provided with the distribution. 18.\" 3. All advertising materials mentioning features or use of this software 19.\" must display the following acknowledgement: 20.\" This product includes software developed by the University of 21.\" California, Berkeley and its contributors. 22.\" 4. Neither the name of the University nor the names of its contributors 23.\" may be used to endorse or promote products derived from this software 24.\" without specific prior written permission. 25.\" 26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 29.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 36.\" SUCH DAMAGE. 37.\" 38.\" @(#)re_format.7 8.3 (Berkeley) 3/20/94 39.\" $FreeBSD: head/lib/libc/regex/re_format.7 314373 2017-02-28 05:14:42Z glebius $ 40.\" 41.Dd February 22, 2021 42.Dt RE_FORMAT 7 43.Os 44.Sh NAME 45.Nm re_format 46.Nd POSIX 1003.2 regular expressions 47.Sh DESCRIPTION 48Regular expressions 49.Pq Dq RE Ns s , 50as defined in 51.St -p1003.2 , 52come in two forms: 53modern REs (roughly those of 54.Xr egrep 1 ; 551003.2 calls these 56.Dq extended 57REs) 58and obsolete REs (roughly those of 59.Xr ed 1 ; 601003.2 61.Dq basic 62REs). 63Obsolete REs mostly exist for backward compatibility in some old programs; 64they will be discussed at the end. 65.St -p1003.2 66leaves some aspects of RE syntax and semantics open; 67.ds DG \\s-2\\v'-0.4m'\\(dg\\v'0.4m'\\s+2 68`\(dg' marks decisions on these aspects that 69may not be fully portable to other 70.St -p1003.2 71implementations. 72.Ss Extended regular expressions 73A (modern) RE is one\*(DG or more non-empty\*(DG 74.Em branches , 75separated by 76.Ql \&| . 77It matches anything that matches one of the branches. 78.Pp 79A branch is one\*(DG or more 80.Em pieces , 81concatenated. 82It matches a match for the first, followed by a match for the second, etc. 83.Pp 84A piece is an 85.Em atom 86possibly followed 87by a single\*(DG 88.Ql \&* , 89.Ql \&+ , 90.Ql \&? , 91or 92.Em bound . 93An atom followed by 94.Ql \&* 95matches a sequence of 0 or more matches of the atom. 96An atom followed by 97.Ql \&+ 98matches a sequence of 1 or more matches of the atom. 99An atom followed by 100.Ql ?\& 101matches a sequence of 0 or 1 matches of the atom. 102.Pp 103A 104.Em bound 105is 106.Ql \&{ 107followed by an unsigned decimal integer, 108possibly followed by 109.Ql \&, 110possibly followed by another unsigned decimal integer, 111always followed by 112.Ql \&} . 113The integers must lie between 0 and 114.Dv RE_DUP_MAX 115(255\*(DG) inclusive, 116and if there are two of them, the first may not exceed the second. 117An atom followed by a bound containing one integer 118.Em i 119and no comma matches 120a sequence of exactly 121.Em i 122matches of the atom. 123An atom followed by a bound 124containing one integer 125.Em i 126and a comma matches 127a sequence of 128.Em i 129or more matches of the atom. 130An atom followed by a bound 131containing two integers 132.Em i 133and 134.Em j 135matches 136a sequence of 137.Em i 138through 139.Em j 140(inclusive) matches of the atom. 141.Pp 142An atom is a regular expression enclosed in 143.Ql () 144(matching a match for the 145regular expression), 146an empty set of 147.Ql () 148(matching the null string)\*(DG, 149a 150.Em bracket expression 151(see below), 152.Ql .\& 153(matching any single character), 154.Ql \&^ 155(matching the null string at the beginning of a line), 156.Ql \&$ 157(matching the null string at the end of a line), a 158.Ql \e 159followed by one of the characters 160.Ql ^.[$()|*+?{\e 161(matching that character taken as an ordinary character), 162a 163.Ql \e 164followed by any other character\*(DG 165(matching that character taken as an ordinary character, 166as if the 167.Ql \e 168had not been present\*(DG), 169or a single character with no other significance (matching that character). 170A 171.Ql \&{ 172followed by a character other than a digit is an ordinary 173character, not the beginning of a bound\*(DG. 174It is illegal to end an RE with 175.Ql \e . 176.Pp 177A 178.Em bracket expression 179is a list of characters enclosed in 180.Ql [] . 181It normally matches any single character from the list (but see below). 182If the list begins with 183.Ql \&^ , 184it matches any single character 185(but see below) 186.Em not 187from the rest of the list. 188If two characters in the list are separated by 189.Ql \&- , 190this is shorthand 191for the full 192.Em range 193of characters between those two (inclusive) in the 194collating sequence, 195.No e.g. Ql [0-9] 196in ASCII matches any decimal digit. 197It is illegal\*(DG for two ranges to share an 198endpoint, 199.No e.g. Ql a-c-e . 200Ranges are very collating-sequence-dependent, 201and portable programs should avoid relying on them. 202.Pp 203To include a literal 204.Ql \&] 205in the list, make it the first character 206(following a possible 207.Ql \&^ ) . 208To include a literal 209.Ql \&- , 210make it the first or last character, 211or the second endpoint of a range. 212To use a literal 213.Ql \&- 214as the first endpoint of a range, 215enclose it in 216.Ql [.\& 217and 218.Ql .]\& 219to make it a collating element (see below). 220With the exception of these and some combinations using 221.Ql \&[ 222(see next paragraphs), all other special characters, including 223.Ql \e , 224lose their special significance within a bracket expression. 225.Pp 226Within a bracket expression, a collating element (a character, 227a multi-character sequence that collates as if it were a single character, 228or a collating-sequence name for either) 229enclosed in 230.Ql [.\& 231and 232.Ql .]\& 233stands for the 234sequence of characters of that collating element. 235The sequence is a single element of the bracket expression's list. 236A bracket expression containing a multi-character collating element 237can thus match more than one character, 238e.g.\& if the collating sequence includes a 239.Ql ch 240collating element, 241then the RE 242.Ql [[.ch.]]*c 243matches the first five characters 244of 245.Ql chchcc . 246.Pp 247Within a bracket expression, a collating element enclosed in 248.Ql [= 249and 250.Ql =] 251is an equivalence class, standing for the sequences of characters 252of all collating elements equivalent to that one, including itself. 253(If there are no other equivalent collating elements, 254the treatment is as if the enclosing delimiters were 255.Ql [.\& 256and 257.Ql .] . ) 258For example, if 259.Ql x 260and 261.Ql y 262are the members of an equivalence class, 263then 264.Ql [[=x=]] , 265.Ql [[=y=]] , 266and 267.Ql [xy] 268are all synonymous. 269An equivalence class may not\*(DG be an endpoint 270of a range. 271.Pp 272Within a bracket expression, the name of a 273.Em character class 274enclosed in 275.Ql [: 276and 277.Ql :] 278stands for the list of all characters belonging to that 279class. 280Standard character class names are: 281.Bl -column "alnum" "digit" "xdigit" -offset indent 282.It Em "alnum digit punct" 283.It Em "alpha graph space" 284.It Em "blank lower upper" 285.It Em "cntrl print xdigit" 286.El 287.Pp 288These stand for the character classes defined in 289.Xr ctype 3 . 290A locale may provide others. 291A character class may not be used as an endpoint of a range. 292.Pp 293A bracketed expression like 294.Ql [[:class:]] 295can be used to match a single character that belongs to a character 296class. 297The reverse, matching any character that does not belong to a specific 298class, the negation operator of bracket expressions may be used: 299.Ql [^[:class:]] . 300.Pp 301There are two special cases\*(DG of bracket expressions: 302the bracket expressions 303.Ql [[:<:]] 304and 305.Ql [[:>:]] 306match the null string at the beginning and end of a word respectively. 307A word is defined as a sequence of word characters 308which is neither preceded nor followed by 309word characters. 310A word character is an 311.Em alnum 312character (as defined by 313.Xr ctype 3 ) 314or an underscore. 315This is an extension, 316compatible with but not specified by 317.St -p1003.2 , 318and should be used with 319caution in software intended to be portable to other systems. 320The additional word delimiters 321.Ql \e< 322and 323.Ql \e> 324are provided to ease compatibility with traditional 325SVR4 326systems but are not portable and should be avoided. 327.Pp 328In the event that an RE could match more than one substring of a given 329string, 330the RE matches the one starting earliest in the string. 331If the RE could match more than one substring starting at that point, 332it matches the longest. 333Subexpressions also match the longest possible substrings, subject to 334the constraint that the whole match be as long as possible, 335with subexpressions starting earlier in the RE taking priority over 336ones starting later. 337Note that higher-level subexpressions thus take priority over 338their lower-level component subexpressions. 339.Pp 340Match lengths are measured in characters, not collating elements. 341A null string is considered longer than no match at all. 342For example, 343.Ql bb* 344matches the three middle characters of 345.Ql abbbc , 346.Ql (wee|week)(knights|nights) 347matches all ten characters of 348.Ql weeknights , 349when 350.Ql (.*).*\& 351is matched against 352.Ql abc 353the parenthesized subexpression 354matches all three characters, and 355when 356.Ql (a*)* 357is matched against 358.Ql bc 359both the whole RE and the parenthesized 360subexpression match the null string. 361.Pp 362If case-independent matching is specified, 363the effect is much as if all case distinctions had vanished from the 364alphabet. 365When an alphabetic that exists in multiple cases appears as an 366ordinary character outside a bracket expression, it is effectively 367transformed into a bracket expression containing both cases, 368.No e.g. Ql x 369becomes 370.Ql [xX] . 371When it appears inside a bracket expression, all case counterparts 372of it are added to the bracket expression, so that (e.g.) 373.Ql [x] 374becomes 375.Ql [xX] 376and 377.Ql [^x] 378becomes 379.Ql [^xX] . 380.Pp 381No particular limit is imposed on the length of REs\*(DG. 382Programs intended to be portable should not employ REs longer 383than 256 bytes, 384as an implementation can refuse to accept such REs and remain 385POSIX-compliant. 386.Ss Basic regular expressions 387Obsolete 388.Pq Dq basic 389regular expressions differ in several respects. 390.Ql \&| 391is an ordinary character and there is no equivalent 392for its functionality. 393.Ql \&+ 394and 395.Ql ?\& 396are ordinary characters, and their functionality 397can be expressed using bounds 398.Po 399.Ql {1,} 400or 401.Ql {0,1} 402respectively 403.Pc . 404Also note that 405.Ql x+ 406in modern REs is equivalent to 407.Ql xx* . 408The delimiters for bounds are 409.Ql \e{ 410and 411.Ql \e} , 412with 413.Ql \&{ 414and 415.Ql \&} 416by themselves ordinary characters. 417The parentheses for nested subexpressions are 418.Ql \e( 419and 420.Ql \e) , 421with 422.Ql \&( 423and 424.Ql \&) 425by themselves ordinary characters. 426.Ql \&^ 427is an ordinary character except at the beginning of the 428RE or\*(DG the beginning of a parenthesized subexpression, 429.Ql \&$ 430is an ordinary character except at the end of the 431RE or\*(DG the end of a parenthesized subexpression, 432and 433.Ql \&* 434is an ordinary character if it appears at the beginning of the 435RE or the beginning of a parenthesized subexpression 436(after a possible leading 437.Ql \&^ ) . 438Finally, there is one new type of atom, a 439.Em back reference : 440.Ql \e 441followed by a non-zero decimal digit 442.Em d 443matches the same sequence of characters 444matched by the 445.Em d Ns th 446parenthesized subexpression 447(numbering subexpressions by the positions of their opening parentheses, 448left to right), 449so that (e.g.) 450.Ql \e([bc]\e)\e1 451matches 452.Ql bb 453or 454.Ql cc 455but not 456.Ql bc . 457.Sh SEE ALSO 458.Xr regex 3 459.Rs 460.%T Regular Expression Notation 461.%R IEEE Std 462.%N 1003.2 463.%P section 2.8 464.Re 465.Sh BUGS 466Having two kinds of REs is a botch. 467.Pp 468The current 469.St -p1003.2 470spec says that 471.Ql \&) 472is an ordinary character in 473the absence of an unmatched 474.Ql \&( ; 475this was an unintentional result of a wording error, 476and change is likely. 477Avoid relying on it. 478.Pp 479Back references are a dreadful botch, 480posing major problems for efficient implementations. 481They are also somewhat vaguely defined 482(does 483.Ql a\e(\e(b\e)*\e2\e)*d 484match 485.Ql abbbd ? ) . 486Avoid using them. 487.Pp 488.St -p1003.2 489specification of case-independent matching is vague. 490The 491.Dq one case implies all cases 492definition given above 493is current consensus among implementors as to the right interpretation. 494.Pp 495The syntax for word boundaries is incredibly ugly. 496