1.\" $OpenBSD: re_format.7,v 1.18 2014/09/10 15:10:19 jmc Exp $ 2.\" 3.\" Copyright (c) 1997, Phillip F Knaack. All rights reserved. 4.\" 5.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 6.\" Copyright (c) 1992, 1993, 1994 7.\" The Regents of the University of California. All rights reserved. 8.\" 9.\" This code is derived from software contributed to Berkeley by 10.\" Henry Spencer. 11.\" 12.\" Redistribution and use in source and binary forms, with or without 13.\" modification, are permitted provided that the following conditions 14.\" are met: 15.\" 1. Redistributions of source code must retain the above copyright 16.\" notice, this list of conditions and the following disclaimer. 17.\" 2. Redistributions in binary form must reproduce the above copyright 18.\" notice, this list of conditions and the following disclaimer in the 19.\" documentation and/or other materials provided with the distribution. 20.\" 3. Neither the name of the University nor the names of its contributors 21.\" may be used to endorse or promote products derived from this software 22.\" without specific prior written permission. 23.\" 24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 27.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 34.\" SUCH DAMAGE. 35.\" 36.\" @(#)re_format.7 8.3 (Berkeley) 3/20/94 37.\" 38.Dd $Mdocdate: September 10 2014 $ 39.Dt RE_FORMAT 7 40.Os 41.Sh NAME 42.Nm re_format 43.Nd POSIX regular expressions 44.Sh DESCRIPTION 45Regular expressions (REs), 46as defined in 47.St -p1003.1-2004 , 48come in two forms: 49basic regular expressions 50(BREs) 51and extended regular expressions 52(EREs). 53Both forms of regular expressions are supported 54by the interfaces described in 55.Xr regex 3 . 56Applications dealing with regular expressions 57may use one or the other form 58(or indeed both). 59For example, 60.Xr ed 1 61uses BREs, 62whilst 63.Xr egrep 1 64talks EREs. 65Consult the manual page for the specific application to find out which 66it uses. 67.Pp 68POSIX leaves some aspects of RE syntax and semantics open; 69.Sq ** 70marks decisions on these aspects that 71may not be fully portable to other POSIX implementations. 72.Pp 73This manual page first describes regular expressions in general, 74specifically extended regular expressions, 75and then discusses differences between them and basic regular expressions. 76.Sh EXTENDED REGULAR EXPRESSIONS 77An ERE is one** or more non-empty** 78.Em branches , 79separated by 80.Sq \*(Ba . 81It matches anything that matches one of the branches. 82.Pp 83A branch is one** or more 84.Em pieces , 85concatenated. 86It matches a match for the first, followed by a match for the second, etc. 87.Pp 88A piece is an 89.Em atom 90possibly followed by a single** 91.Sq * , 92.Sq + , 93.Sq ?\& , 94or 95.Em bound . 96An atom followed by 97.Sq * 98matches a sequence of 0 or more matches of the atom. 99An atom followed by 100.Sq + 101matches a sequence of 1 or more matches of the atom. 102An atom followed by 103.Sq ?\& 104matches a sequence of 0 or 1 matches of the atom. 105.Pp 106A bound is 107.Sq { 108followed by an unsigned decimal integer, 109possibly followed by 110.Sq ,\& 111possibly followed by another unsigned decimal integer, 112always followed by 113.Sq } . 114The integers must lie between 0 and 115.Dv RE_DUP_MAX 116(255**) inclusive, 117and if there are two of them, the first may not exceed the second. 118An atom followed by a bound containing one integer 119.Ar i 120and no comma matches 121a sequence of exactly 122.Ar i 123matches of the atom. 124An atom followed by a bound 125containing one integer 126.Ar i 127and a comma matches 128a sequence of 129.Ar i 130or more matches of the atom. 131An atom followed by a bound 132containing two integers 133.Ar i 134and 135.Ar j 136matches a sequence of 137.Ar i 138through 139.Ar j 140(inclusive) matches of the atom. 141.Pp 142An atom is a regular expression enclosed in 143.Sq () 144(matching a part of the regular expression), 145an empty set of 146.Sq () 147(matching the null string)**, 148a 149.Em bracket expression 150(see below), 151.Sq .\& 152(matching any single character), 153.Sq ^ 154(matching the null string at the beginning of a line), 155.Sq $ 156(matching the null string at the end of a line), 157a 158.Sq \e 159followed by one of the characters 160.Sq ^.[$()|*+?{\e 161(matching that character taken as an ordinary character), 162a 163.Sq \e 164followed by any other character** 165(matching that character taken as an ordinary character, 166as if the 167.Sq \e 168had not been present**), 169or a single character with no other significance (matching that character). 170A 171.Sq { 172followed by a character other than a digit is an ordinary character, 173not the beginning of a bound**. 174It is illegal to end an RE with 175.Sq \e . 176.Pp 177A bracket expression is a list of characters enclosed in 178.Sq [] . 179It normally matches any single character from the list (but see below). 180If the list begins with 181.Sq ^ , 182it matches any single character 183.Em not 184from the rest of the list 185(but see below). 186If two characters in the list are separated by 187.Sq - , 188this is shorthand for the full 189.Em range 190of characters between those two (inclusive) in the 191collating sequence, e.g.\& 192.Sq [0-9] 193in ASCII matches any decimal digit. 194It is illegal** for two ranges to share an endpoint, e.g.\& 195.Sq a-c-e . 196Ranges are very collating-sequence-dependent, 197and portable programs should avoid relying on them. 198.Pp 199To include a literal 200.Sq ]\& 201in the list, make it the first character 202(following a possible 203.Sq ^ ) . 204To include a literal 205.Sq - , 206make it the first or last character, 207or the second endpoint of a range. 208To use a literal 209.Sq - 210as the first endpoint of a range, 211enclose it in 212.Sq [. 213and 214.Sq .] 215to make it a collating element (see below). 216With the exception of these and some combinations using 217.Sq \&[ 218(see next paragraphs), 219all other special characters, including 220.Sq \e , 221lose their special significance within a bracket expression. 222.Pp 223Within a bracket expression, a collating element 224(a character, 225a multi-character sequence that collates as if it were a single character, 226or a collating-sequence name for either) 227enclosed in 228.Sq [. 229and 230.Sq .] 231stands for the sequence of characters of that collating element. 232The sequence is a single element of the bracket expression's list. 233A bracket expression containing a multi-character collating element 234can thus match more than one character, 235e.g. if the collating sequence includes a 236.Sq ch 237collating element, 238then the RE 239.Sq [[.ch.]]*c 240matches the first five characters of 241.Sq chchcc . 242.Pp 243Within a bracket expression, a collating element enclosed in 244.Sq [= 245and 246.Sq =] 247is an equivalence class, standing for the sequences of characters 248of all collating elements equivalent to that one, including itself. 249(If there are no other equivalent collating elements, 250the treatment is as if the enclosing delimiters were 251.Sq [. 252and 253.Sq .] . ) 254For example, if 255.Sq x 256and 257.Sq y 258are the members of an equivalence class, 259then 260.Sq [[=x=]] , 261.Sq [[=y=]] , 262and 263.Sq [xy] 264are all synonymous. 265An equivalence class may not** be an endpoint of a range. 266.Pp 267Within a bracket expression, the name of a 268.Em character class 269enclosed 270in 271.Sq [: 272and 273.Sq :] 274stands for the list of all characters belonging to that class. 275Standard character class names are: 276.Bd -literal -offset indent 277alnum digit punct 278alpha graph space 279blank lower upper 280cntrl print xdigit 281.Ed 282.Pp 283These stand for the character classes defined in 284.Xr ctype 3 . 285A locale may provide others. 286A character class may not be used as an endpoint of a range. 287.Pp 288There are two special cases** of bracket expressions: 289the bracket expressions 290.Sq [[:<:]] 291and 292.Sq [[:>:]] 293match the null string at the beginning and end of a word, respectively. 294A word is defined as a sequence of 295characters starting and ending with a word character 296which is neither preceded nor followed by 297word characters. 298A word character is an 299.Em alnum 300character (as defined by 301.Xr ctype 3 ) 302or an underscore. 303This is an extension, 304compatible with but not specified by POSIX, 305and should be used with 306caution in software intended to be portable to other systems. 307The additional word delimiters 308.Ql \e< 309and 310.Ql \e> 311are provided to ease compatibility with traditional SVR4 312systems but are not portable and should be avoided. 313.Pp 314In the event that an RE could match more than one substring of a given 315string, 316the RE matches the one starting earliest in the string. 317If the RE could match more than one substring starting at that point, 318it matches the longest. 319Subexpressions also match the longest possible substrings, subject to 320the constraint that the whole match be as long as possible, 321with subexpressions starting earlier in the RE taking priority over 322ones starting later. 323Note that higher-level subexpressions thus take priority over 324their lower-level component subexpressions. 325.Pp 326Match lengths are measured in characters, not collating elements. 327A null string is considered longer than no match at all. 328For example, 329.Sq bb* 330matches the three middle characters of 331.Sq abbbc ; 332.Sq (wee|week)(knights|nights) 333matches all ten characters of 334.Sq weeknights ; 335when 336.Sq (.*).* 337is matched against 338.Sq abc , 339the parenthesized subexpression matches all three characters; 340and when 341.Sq (a*)* 342is matched against 343.Sq bc , 344both the whole RE and the parenthesized subexpression match the null string. 345.Pp 346If case-independent matching is specified, 347the effect is much as if all case distinctions had vanished from the 348alphabet. 349When an alphabetic that exists in multiple cases appears as an 350ordinary character outside a bracket expression, it is effectively 351transformed into a bracket expression containing both cases, 352e.g.\& 353.Sq x 354becomes 355.Sq [xX] . 356When it appears inside a bracket expression, 357all case counterparts of it are added to the bracket expression, 358so that, for example, 359.Sq [x] 360becomes 361.Sq [xX] 362and 363.Sq [^x] 364becomes 365.Sq [^xX] . 366.Pp 367No particular limit is imposed on the length of REs**. 368Programs intended to be portable should not employ REs longer 369than 256 bytes, 370as an implementation can refuse to accept such REs and remain 371POSIX-compliant. 372.Pp 373The following is a list of extended regular expressions: 374.Bl -tag -width Ds 375.It Ar c 376Any character 377.Ar c 378not listed below matches itself. 379.It \e Ns Ar c 380Any backslash-escaped character 381.Ar c 382matches itself. 383.It \&. 384Matches any single character that is not a newline 385.Pq Sq \en . 386.It Bq Ar char-class 387Matches any single character in 388.Ar char-class . 389To include a 390.Ql \&] 391in 392.Ar char-class , 393it must be the first character. 394A range of characters may be specified by separating the end characters 395of the range with a 396.Ql - ; 397e.g.\& 398.Ar a-z 399specifies the lower case characters. 400The following literal expressions can also be used in 401.Ar char-class 402to specify sets of characters: 403.Bd -unfilled -offset indent 404[:alnum:] [:cntrl:] [:lower:] [:space:] 405[:alpha:] [:digit:] [:print:] [:upper:] 406[:blank:] [:graph:] [:punct:] [:xdigit:] 407.Ed 408.Pp 409If 410.Ql - 411appears as the first or last character of 412.Ar char-class , 413then it matches itself. 414All other characters in 415.Ar char-class 416match themselves. 417.Pp 418Patterns in 419.Ar char-class 420of the form 421.Eo [. 422.Ar col-elm 423.Ec .]\& 424or 425.Eo [= 426.Ar col-elm 427.Ec =]\& , 428where 429.Ar col-elm 430is a collating element, are interpreted according to 431.Xr setlocale 3 432.Pq not currently supported . 433.It Bq ^ Ns Ar char-class 434Matches any single character, other than newline, not in 435.Ar char-class . 436.Ar char-class 437is defined as above. 438.It ^ 439If 440.Sq ^ 441is the first character of a regular expression, then it 442anchors the regular expression to the beginning of a line. 443Otherwise, it matches itself. 444.It $ 445If 446.Sq $ 447is the last character of a regular expression, 448it anchors the regular expression to the end of a line. 449Otherwise, it matches itself. 450.It [[:<:]] 451Anchors the single character regular expression or subexpression 452immediately following it to the beginning of a word. 453.It [[:>:]] 454Anchors the single character regular expression or subexpression 455immediately preceding it to the end of a word. 456.It Pq Ar re 457Defines a subexpression 458.Ar re . 459Any set of characters enclosed in parentheses 460matches whatever the set of characters without parentheses matches 461(that is a long-winded way of saying the constructs 462.Sq (re) 463and 464.Sq re 465match identically). 466.It * 467Matches the single character regular expression or subexpression 468immediately preceding it zero or more times. 469If 470.Sq * 471is the first character of a regular expression or subexpression, 472then it matches itself. 473The 474.Sq * 475operator sometimes yields unexpected results. 476For example, the regular expression 477.Ar b* 478matches the beginning of the string 479.Qq abbb 480(as opposed to the substring 481.Qq bbb ) , 482since a null match is the only leftmost match. 483.It + 484Matches the singular character regular expression 485or subexpression immediately preceding it 486one or more times. 487.It ? 488Matches the singular character regular expression 489or subexpression immediately preceding it 4900 or 1 times. 491.Sm off 492.It Xo 493.Pf { Ar n , m No }\ \& 494.Pf { Ar n , No }\ \& 495.Pf { Ar n No } 496.Xc 497.Sm on 498Matches the single character regular expression or subexpression 499immediately preceding it at least 500.Ar n 501and at most 502.Ar m 503times. 504If 505.Ar m 506is omitted, then it matches at least 507.Ar n 508times. 509If the comma is also omitted, then it matches exactly 510.Ar n 511times. 512.It \*(Ba 513Used to separate patterns. 514For example, 515the pattern 516.Sq cat\*(Badog 517matches either 518.Sq cat 519or 520.Sq dog . 521.El 522.Sh BASIC REGULAR EXPRESSIONS 523Basic regular expressions differ in several respects: 524.Bl -bullet -offset 3n 525.It 526.Sq \*(Ba , 527.Sq + , 528and 529.Sq ?\& 530are ordinary characters and there is no equivalent 531for their functionality. 532.It 533The delimiters for bounds are 534.Sq \e{ 535and 536.Sq \e} , 537with 538.Sq { 539and 540.Sq } 541by themselves ordinary characters. 542.It 543The parentheses for nested subexpressions are 544.Sq \e( 545and 546.Sq \e) , 547with 548.Sq \&( 549and 550.Sq )\& 551by themselves ordinary characters. 552.It 553.Sq ^ 554is an ordinary character except at the beginning of the 555RE or** the beginning of a parenthesized subexpression. 556.It 557.Sq $ 558is an ordinary character except at the end of the 559RE or** the end of a parenthesized subexpression. 560.It 561.Sq * 562is an ordinary character if it appears at the beginning of the 563RE or the beginning of a parenthesized subexpression 564(after a possible leading 565.Sq ^ ) . 566.It 567Finally, there is one new type of atom, a 568.Em back-reference : 569.Sq \e 570followed by a non-zero decimal digit 571.Ar d 572matches the same sequence of characters matched by the 573.Ar d Ns th 574parenthesized subexpression 575(numbering subexpressions by the positions of their opening parentheses, 576left to right), 577so that, for example, 578.Sq \e([bc]\e)\e1 579matches 580.Sq bb\& 581or 582.Sq cc 583but not 584.Sq bc . 585.El 586.Pp 587The following is a list of basic regular expressions: 588.Bl -tag -width Ds 589.It Ar c 590Any character 591.Ar c 592not listed below matches itself. 593.It \e Ns Ar c 594Any backslash-escaped character 595.Ar c , 596except for 597.Sq { , 598.Sq } , 599.Sq \&( , 600and 601.Sq \&) , 602matches itself. 603.It \&. 604Matches any single character that is not a newline 605.Pq Sq \en . 606.It Bq Ar char-class 607Matches any single character in 608.Ar char-class . 609To include a 610.Ql \&] 611in 612.Ar char-class , 613it must be the first character. 614A range of characters may be specified by separating the end characters 615of the range with a 616.Ql - ; 617e.g.\& 618.Ar a-z 619specifies the lower case characters. 620The following literal expressions can also be used in 621.Ar char-class 622to specify sets of characters: 623.Bd -unfilled -offset indent 624[:alnum:] [:cntrl:] [:lower:] [:space:] 625[:alpha:] [:digit:] [:print:] [:upper:] 626[:blank:] [:graph:] [:punct:] [:xdigit:] 627.Ed 628.Pp 629If 630.Ql - 631appears as the first or last character of 632.Ar char-class , 633then it matches itself. 634All other characters in 635.Ar char-class 636match themselves. 637.Pp 638Patterns in 639.Ar char-class 640of the form 641.Eo [. 642.Ar col-elm 643.Ec .]\& 644or 645.Eo [= 646.Ar col-elm 647.Ec =]\& , 648where 649.Ar col-elm 650is a collating element, are interpreted according to 651.Xr setlocale 3 652.Pq not currently supported . 653.It Bq ^ Ns Ar char-class 654Matches any single character, other than newline, not in 655.Ar char-class . 656.Ar char-class 657is defined as above. 658.It ^ 659If 660.Sq ^ 661is the first character of a regular expression, then it 662anchors the regular expression to the beginning of a line. 663Otherwise, it matches itself. 664.It $ 665If 666.Sq $ 667is the last character of a regular expression, 668it anchors the regular expression to the end of a line. 669Otherwise, it matches itself. 670.It [[:<:]] 671Anchors the single character regular expression or subexpression 672immediately following it to the beginning of a word. 673.It [[:>:]] 674Anchors the single character regular expression or subexpression 675immediately following it to the end of a word. 676.It \e( Ns Ar re Ns \e) 677Defines a subexpression 678.Ar re . 679Subexpressions may be nested. 680A subsequent backreference of the form 681.Pf \e Ns Ar n , 682where 683.Ar n 684is a number in the range [1,9], expands to the text matched by the 685.Ar n Ns th 686subexpression. 687For example, the regular expression 688.Ar \e(.*\e)\e1 689matches any string consisting of identical adjacent substrings. 690Subexpressions are ordered relative to their left delimiter. 691.It * 692Matches the single character regular expression or subexpression 693immediately preceding it zero or more times. 694If 695.Sq * 696is the first character of a regular expression or subexpression, 697then it matches itself. 698The 699.Sq * 700operator sometimes yields unexpected results. 701For example, the regular expression 702.Ar b* 703matches the beginning of the string 704.Qq abbb 705(as opposed to the substring 706.Qq bbb ) , 707since a null match is the only leftmost match. 708.Sm off 709.It Xo 710.Pf \e{ Ar n , m No \e}\ \& 711.Pf \e{ Ar n , No \e}\ \& 712.Pf \e{ Ar n No \e} 713.Xc 714.Sm on 715Matches the single character regular expression or subexpression 716immediately preceding it at least 717.Ar n 718and at most 719.Ar m 720times. 721If 722.Ar m 723is omitted, then it matches at least 724.Ar n 725times. 726If the comma is also omitted, then it matches exactly 727.Ar n 728times. 729.El 730.Sh SEE ALSO 731.Xr ctype 3 , 732.Xr regex 3 733.Sh STANDARDS 734.St -p1003.1-2004 : 735Base Definitions, Chapter 9 (Regular Expressions). 736.Sh BUGS 737Having two kinds of REs is a botch. 738.Pp 739The current POSIX spec says that 740.Sq )\& 741is an ordinary character in the absence of an unmatched 742.Sq \&( ; 743this was an unintentional result of a wording error, 744and change is likely. 745Avoid relying on it. 746.Pp 747Back-references are a dreadful botch, 748posing major problems for efficient implementations. 749They are also somewhat vaguely defined 750(does 751.Sq a\e(\e(b\e)*\e2\e)*d 752match 753.Sq abbbd ? ) . 754Avoid using them. 755.Pp 756POSIX's specification of case-independent matching is vague. 757The 758.Dq one case implies all cases 759definition given above 760is the current consensus among implementors as to the right interpretation. 761.Pp 762The syntax for word boundaries is incredibly ugly. 763