1*0a6a1f1dSLionel Sambuc.\" $NetBSD: re_format.7,v 1.11 2015/08/22 14:04:54 wiz Exp $ 22fe8fb19SBen Gras.\" 3b7061124SArun Thomas.\" Copyright (c) 1992, 1993, 1994 4b7061124SArun Thomas.\" The Regents of the University of California. All rights reserved. 5b7061124SArun Thomas.\" 6b7061124SArun Thomas.\" This code is derived from software contributed to Berkeley by 7b7061124SArun Thomas.\" Henry Spencer. 8b7061124SArun Thomas.\" 9b7061124SArun Thomas.\" Redistribution and use in source and binary forms, with or without 10b7061124SArun Thomas.\" modification, are permitted provided that the following conditions 11b7061124SArun Thomas.\" are met: 12b7061124SArun Thomas.\" 1. Redistributions of source code must retain the above copyright 13b7061124SArun Thomas.\" notice, this list of conditions and the following disclaimer. 14b7061124SArun Thomas.\" 2. Redistributions in binary form must reproduce the above copyright 15b7061124SArun Thomas.\" notice, this list of conditions and the following disclaimer in the 16b7061124SArun Thomas.\" documentation and/or other materials provided with the distribution. 172fe8fb19SBen Gras.\" 3. Neither the name of the University nor the names of its contributors 182fe8fb19SBen Gras.\" may be used to endorse or promote products derived from this software 192fe8fb19SBen Gras.\" without specific prior written permission. 202fe8fb19SBen Gras.\" 212fe8fb19SBen Gras.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 222fe8fb19SBen Gras.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 232fe8fb19SBen Gras.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 242fe8fb19SBen Gras.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 252fe8fb19SBen Gras.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 262fe8fb19SBen Gras.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 272fe8fb19SBen Gras.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 282fe8fb19SBen Gras.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 292fe8fb19SBen Gras.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 302fe8fb19SBen Gras.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 312fe8fb19SBen Gras.\" SUCH DAMAGE. 322fe8fb19SBen Gras.\" 332fe8fb19SBen Gras.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 342fe8fb19SBen Gras.\" 352fe8fb19SBen Gras.\" This code is derived from software contributed to Berkeley by 362fe8fb19SBen Gras.\" Henry Spencer. 372fe8fb19SBen Gras.\" 382fe8fb19SBen Gras.\" Redistribution and use in source and binary forms, with or without 392fe8fb19SBen Gras.\" modification, are permitted provided that the following conditions 402fe8fb19SBen Gras.\" are met: 412fe8fb19SBen Gras.\" 1. Redistributions of source code must retain the above copyright 422fe8fb19SBen Gras.\" notice, this list of conditions and the following disclaimer. 432fe8fb19SBen Gras.\" 2. Redistributions in binary form must reproduce the above copyright 442fe8fb19SBen Gras.\" notice, this list of conditions and the following disclaimer in the 452fe8fb19SBen Gras.\" documentation and/or other materials provided with the distribution. 46b7061124SArun Thomas.\" 3. All advertising materials mentioning features or use of this software 47b7061124SArun Thomas.\" must display the following acknowledgement: 48b7061124SArun Thomas.\" This product includes software developed by the University of 49b7061124SArun Thomas.\" California, Berkeley and its contributors. 50b7061124SArun Thomas.\" 4. Neither the name of the University nor the names of its contributors 51b7061124SArun Thomas.\" may be used to endorse or promote products derived from this software 52b7061124SArun Thomas.\" without specific prior written permission. 53b7061124SArun Thomas.\" 54b7061124SArun Thomas.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 55b7061124SArun Thomas.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 56b7061124SArun Thomas.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 57b7061124SArun Thomas.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 58b7061124SArun Thomas.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 59b7061124SArun Thomas.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 60b7061124SArun Thomas.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 61b7061124SArun Thomas.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 62b7061124SArun Thomas.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 63b7061124SArun Thomas.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 64b7061124SArun Thomas.\" SUCH DAMAGE. 65b7061124SArun Thomas.\" 66b7061124SArun Thomas.\" @(#)re_format.7 8.3 (Berkeley) 3/20/94 67b7061124SArun Thomas.\" 682fe8fb19SBen Gras.Dd March 20, 1994 692fe8fb19SBen Gras.Dt RE_FORMAT 7 702fe8fb19SBen Gras.Os 712fe8fb19SBen Gras.Sh NAME 722fe8fb19SBen Gras.Nm re_format 732fe8fb19SBen Gras.Nd POSIX 1003.2 regular expressions 742fe8fb19SBen Gras.Sh DESCRIPTION 75b7061124SArun ThomasRegular expressions (``RE''s), 76b7061124SArun Thomasas defined in POSIX 1003.2, come in two forms: 77b7061124SArun Thomasmodern REs (roughly those of 782fe8fb19SBen Gras.Xr egrep 1 ; 79b7061124SArun Thomas1003.2 calls these ``extended'' REs) 80b7061124SArun Thomasand obsolete REs (roughly those of 812fe8fb19SBen Gras.Xr ed 1 ; 82b7061124SArun Thomas1003.2 ``basic'' REs). 83b7061124SArun ThomasObsolete REs mostly exist for backward compatibility in some old programs; 84b7061124SArun Thomasthey will be discussed at the end. 85b7061124SArun Thomas1003.2 leaves some aspects of RE syntax and semantics open; 86*0a6a1f1dSLionel Sambuc`(*)' marks decisions on these aspects that 87b7061124SArun Thomasmay not be fully portable to other 1003.2 implementations. 882fe8fb19SBen Gras.Pp 89*0a6a1f1dSLionel SambucA (modern) RE is one(*) or more non-empty(*) 902fe8fb19SBen Gras.Em branches , 91b7061124SArun Thomasseparated by `|'. 92b7061124SArun ThomasIt matches anything that matches one of the branches. 932fe8fb19SBen Gras.Pp 94*0a6a1f1dSLionel SambucA branch is one(*) or more 952fe8fb19SBen Gras.Em pieces , 962fe8fb19SBen Grasconcatenated. 97b7061124SArun ThomasIt matches a match for the first, followed by a match for the second, etc. 982fe8fb19SBen Gras.Pp 992fe8fb19SBen GrasA piece is an 1002fe8fb19SBen Gras.Em atom 1012fe8fb19SBen Graspossibly followed 102*0a6a1f1dSLionel Sambucby a single(*) `*', `+', `?', or 1032fe8fb19SBen Gras.Em bound . 104b7061124SArun ThomasAn atom followed by `*' matches a sequence of 0 or more matches of the atom. 105b7061124SArun ThomasAn atom followed by `+' matches a sequence of 1 or more matches of the atom. 106b7061124SArun ThomasAn atom followed by `?' matches a sequence of 0 or 1 matches of the atom. 1072fe8fb19SBen Gras.Pp 1082fe8fb19SBen GrasA 1092fe8fb19SBen Gras.Em bound 1102fe8fb19SBen Grasis `{' followed by an unsigned decimal integer, possibly followed by `,' 111b7061124SArun Thomaspossibly followed by another unsigned decimal integer, 112b7061124SArun Thomasalways followed by `}'. 113*0a6a1f1dSLionel SambucThe integers must lie between 0 and RE_DUP_MAX (255(*)) inclusive, 114b7061124SArun Thomasand if there are two of them, the first may not exceed the second. 1152fe8fb19SBen GrasAn atom followed by a bound containing one integer 1162fe8fb19SBen Gras.Em i 1172fe8fb19SBen Grasand no comma matches a sequence of exactly 1182fe8fb19SBen Gras.Em i 1192fe8fb19SBen Grasmatches of the atom. 1202fe8fb19SBen GrasAn atom followed by a bound containing one integer 1212fe8fb19SBen Gras.Em i 1222fe8fb19SBen Grasand a comma matches a sequence of 1232fe8fb19SBen Gras.Em i 1242fe8fb19SBen Grasor more matches of the atom. 1252fe8fb19SBen GrasAn atom followed by a bound containing two integers 1262fe8fb19SBen Gras.Em i 1272fe8fb19SBen Grasand 1282fe8fb19SBen Gras.Em j 1292fe8fb19SBen Grasmatches a sequence of 1302fe8fb19SBen Gras.Em i 1312fe8fb19SBen Grasthrough 1322fe8fb19SBen Gras.Em j 1332fe8fb19SBen Gras(inclusive) matches of the atom. 1342fe8fb19SBen Gras.Pp 135b7061124SArun ThomasAn atom is a regular expression enclosed in `()' (matching a match for the 136*0a6a1f1dSLionel Sambucregular expression), an empty set of `()' (matching the null string)(*), a 1372fe8fb19SBen Gras.Em bracket expression 1382fe8fb19SBen Gras(see below), `.' (matching any single character), 1392fe8fb19SBen Gras`^' (matching the null string at the beginning of a line), 1402fe8fb19SBen Gras`$' (matching the null string at the end of a line), 1412fe8fb19SBen Grasa `\e' followed by one of the characters `^.[$()|*+?{\e' 142b7061124SArun Thomas(matching that character taken as an ordinary character), 143*0a6a1f1dSLionel Sambuca `\e' followed by any other character(*) 144b7061124SArun Thomas(matching that character taken as an ordinary character, 145*0a6a1f1dSLionel Sambucas if the `\e' had not been present(*)), 146b7061124SArun Thomasor a single character with no other significance (matching that character). 147b7061124SArun ThomasA `{' followed by a character other than a digit is an ordinary 148*0a6a1f1dSLionel Sambuccharacter, not the beginning of a bound(*). 149b7061124SArun ThomasIt is illegal to end an RE with `\e'. 1502fe8fb19SBen Gras.Pp 1512fe8fb19SBen GrasA 1522fe8fb19SBen Gras.Em bracket expression 1532fe8fb19SBen Grasis a list of characters enclosed in `[]'. 154b7061124SArun ThomasIt normally matches any single character from the list (but see below). 155b7061124SArun ThomasIf the list begins with `^', 1562fe8fb19SBen Grasit matches any single character (but see below) 1572fe8fb19SBen Gras.Em not 1582fe8fb19SBen Grasfrom the rest of the list. 159b7061124SArun ThomasIf two characters in the list are separated by `\-', this is shorthand 1602fe8fb19SBen Grasfor the full 1612fe8fb19SBen Gras.Em range 1622fe8fb19SBen Grasof characters between those two (inclusive) in the collating sequence, 163b7061124SArun Thomase.g. `[0-9]' in ASCII matches any decimal digit. 164*0a6a1f1dSLionel SambucIt is illegal(*) for two ranges to share an endpoint, e.g. `a-c-e'. 165b7061124SArun ThomasRanges are very collating-sequence-dependent, 166b7061124SArun Thomasand portable programs should avoid relying on them. 1672fe8fb19SBen Gras.Pp 168b7061124SArun ThomasTo include a literal `]' in the list, make it the first character 169b7061124SArun Thomas(following a possible `^'). 170b7061124SArun ThomasTo include a literal `\-', make it the first or last character, 171b7061124SArun Thomasor the second endpoint of a range. 172b7061124SArun ThomasTo use a literal `\-' as the first endpoint of a range, 173b7061124SArun Thomasenclose it in `[.' and `.]' to make it a collating element (see below). 174b7061124SArun ThomasWith the exception of these and some combinations using `[' (see next 175b7061124SArun Thomasparagraphs), all other special characters, including `\e', lose their 176b7061124SArun Thomasspecial significance within a bracket expression. 1772fe8fb19SBen Gras.Pp 178b7061124SArun ThomasWithin a bracket expression, a collating element (a character, 179b7061124SArun Thomasa multi-character sequence that collates as if it were a single character, 180b7061124SArun Thomasor a collating-sequence name for either) 181b7061124SArun Thomasenclosed in `[.' and `.]' stands for the 182b7061124SArun Thomassequence of characters of that collating element. 183b7061124SArun ThomasThe sequence is a single element of the bracket expression's list. 184b7061124SArun ThomasA bracket expression containing a multi-character collating element 185b7061124SArun Thomascan thus match more than one character, 186b7061124SArun Thomase.g. if the collating sequence includes a `ch' collating element, 187b7061124SArun Thomasthen the RE `[[.ch.]]*c' matches the first five characters 188b7061124SArun Thomasof `chchcc'. 1892fe8fb19SBen Gras.Pp 190b7061124SArun ThomasWithin a bracket expression, a collating element enclosed in `[=' and 191b7061124SArun Thomas`=]' is an equivalence class, standing for the sequences of characters 192b7061124SArun Thomasof all collating elements equivalent to that one, including itself. 193b7061124SArun Thomas(If there are no other equivalent collating elements, 194b7061124SArun Thomasthe treatment is as if the enclosing delimiters were `[.' and `.]'.) 1952fe8fb19SBen GrasFor example, if o and '\(^o' are the members of an equivalence class, 1962fe8fb19SBen Grasthen `[[=o=]]', `[[=\(^o'=]]', and `[o\(^o']' are all synonymous. 197*0a6a1f1dSLionel SambucAn equivalence class may not(*) be an endpoint 198b7061124SArun Thomasof a range. 1992fe8fb19SBen Gras.Pp 2002fe8fb19SBen GrasWithin a bracket expression, the name of a 2012fe8fb19SBen Gras.Em character class 2022fe8fb19SBen Grasenclosed in `[:' and `:]' stands for the list of all characters 2032fe8fb19SBen Grasbelonging to that class. 204b7061124SArun ThomasStandard character class names are: 2052fe8fb19SBen Gras.Bl -column "alnum" "digit" "xdigit" 2062fe8fb19SBen Gras.It alnum digit punct 2072fe8fb19SBen Gras.It alpha graph space 2082fe8fb19SBen Gras.It blank lower upper 2092fe8fb19SBen Gras.It cntrl print xdigit 2102fe8fb19SBen Gras.El 2112fe8fb19SBen Gras.Pp 212b7061124SArun ThomasThese stand for the character classes defined in 2132fe8fb19SBen Gras.Xr ctype 3 . 214b7061124SArun ThomasA locale may provide others. 215b7061124SArun ThomasA character class may not be used as an endpoint of a range. 2162fe8fb19SBen Gras.Pp 217*0a6a1f1dSLionel SambucThere are two special cases(*) of bracket expressions: 2182fe8fb19SBen Grasthe bracket expressions `[[:\*[Lt]:]]' and `[[:\*[Gt]:]]' match 2192fe8fb19SBen Grasthe null string at the beginning and end of a word respectively. 2202fe8fb19SBen GrasA word is defined as a sequence of word characters 2212fe8fb19SBen Graswhich is neither preceded nor followed by word characters. 222b7061124SArun ThomasA word character is an 2232fe8fb19SBen Gras.Em alnum 224b7061124SArun Thomascharacter (as defined by 2252fe8fb19SBen Gras.Xr ctype 3 ) 226b7061124SArun Thomasor an underscore. 2272fe8fb19SBen GrasThis is an extension, compatible with but not specified by POSIX 1003.2, 2282fe8fb19SBen Grasand should be used with caution in software intended to be portable 2292fe8fb19SBen Grasto other systems. 2302fe8fb19SBen Gras.Pp 231b7061124SArun ThomasIn the event that an RE could match more than one substring of a given 2322fe8fb19SBen Grasstring, the RE matches the one starting earliest in the string. 233b7061124SArun ThomasIf the RE could match more than one substring starting at that point, 234b7061124SArun Thomasit matches the longest. 235b7061124SArun ThomasSubexpressions also match the longest possible substrings, subject to 236b7061124SArun Thomasthe constraint that the whole match be as long as possible, 237b7061124SArun Thomaswith subexpressions starting earlier in the RE taking priority over 238b7061124SArun Thomasones starting later. 239b7061124SArun ThomasNote that higher-level subexpressions thus take priority over 240b7061124SArun Thomastheir lower-level component subexpressions. 2412fe8fb19SBen Gras.Pp 242b7061124SArun ThomasMatch lengths are measured in characters, not collating elements. 243b7061124SArun ThomasA null string is considered longer than no match at all. 244b7061124SArun ThomasFor example, 245b7061124SArun Thomas`bb*' matches the three middle characters of `abbbc', 246b7061124SArun Thomas`(wee|week)(knights|nights)' matches all ten characters of `weeknights', 247b7061124SArun Thomaswhen `(.*).*' is matched against `abc' the parenthesized subexpression 248b7061124SArun Thomasmatches all three characters, and 249b7061124SArun Thomaswhen `(a*)*' is matched against `bc' both the whole RE and the parenthesized 250b7061124SArun Thomassubexpression match the null string. 2512fe8fb19SBen Gras.Pp 252b7061124SArun ThomasIf case-independent matching is specified, 253b7061124SArun Thomasthe effect is much as if all case distinctions had vanished from the 254b7061124SArun Thomasalphabet. 255b7061124SArun ThomasWhen an alphabetic that exists in multiple cases appears as an 256b7061124SArun Thomasordinary character outside a bracket expression, it is effectively 257b7061124SArun Thomastransformed into a bracket expression containing both cases, 258b7061124SArun Thomase.g. `x' becomes `[xX]'. 259b7061124SArun ThomasWhen it appears inside a bracket expression, all case counterparts 260b7061124SArun Thomasof it are added to the bracket expression, so that (e.g.) `[x]' 261b7061124SArun Thomasbecomes `[xX]' and `[^x]' becomes `[^xX]'. 2622fe8fb19SBen Gras.Pp 263*0a6a1f1dSLionel SambucNo particular limit is imposed on the length of REs(*). 264b7061124SArun ThomasPrograms intended to be portable should not employ REs longer 265b7061124SArun Thomasthan 256 bytes, 266b7061124SArun Thomasas an implementation can refuse to accept such REs and remain 267b7061124SArun ThomasPOSIX-compliant. 2682fe8fb19SBen Gras.Pp 269b7061124SArun ThomasObsolete (``basic'') regular expressions differ in several respects. 270b7061124SArun Thomas`|', `+', and `?' are ordinary characters and there is no equivalent 271b7061124SArun Thomasfor their functionality. 272b7061124SArun ThomasThe delimiters for bounds are `\e{' and `\e}', 273b7061124SArun Thomaswith `{' and `}' by themselves ordinary characters. 274b7061124SArun ThomasThe parentheses for nested subexpressions are `\e(' and `\e)', 275b7061124SArun Thomaswith `(' and `)' by themselves ordinary characters. 276b7061124SArun Thomas`^' is an ordinary character except at the beginning of the 277*0a6a1f1dSLionel SambucRE or(*) the beginning of a parenthesized subexpression, 278b7061124SArun Thomas`$' is an ordinary character except at the end of the 279*0a6a1f1dSLionel SambucRE or(*) the end of a parenthesized subexpression, 280b7061124SArun Thomasand `*' is an ordinary character if it appears at the beginning of the 281b7061124SArun ThomasRE or the beginning of a parenthesized subexpression 282b7061124SArun Thomas(after a possible leading `^'). 2832fe8fb19SBen GrasFinally, there is one new type of atom, a 2842fe8fb19SBen Gras.Em back reference : 2852fe8fb19SBen Gras`\e' followed by a non-zero decimal digit 2862fe8fb19SBen Gras.Em d 287b7061124SArun Thomasmatches the same sequence of characters 2882fe8fb19SBen Grasmatched by the 2892fe8fb19SBen Gras.Em d Ns th parenthesized subexpression 290b7061124SArun Thomas(numbering subexpressions by the positions of their opening parentheses, 291b7061124SArun Thomasleft to right), 292b7061124SArun Thomasso that (e.g.) `\e([bc]\e)\e1' matches `bb' or `cc' but not `bc'. 2932fe8fb19SBen Gras.Sh SEE ALSO 2942fe8fb19SBen Gras.Xr regex 3 2952fe8fb19SBen Gras.Pp 296b7061124SArun ThomasPOSIX 1003.2, section 2.8 (Regular Expression Notation). 2972fe8fb19SBen Gras.Sh BUGS 298b7061124SArun ThomasHaving two kinds of REs is a botch. 2992fe8fb19SBen Gras.Pp 300b7061124SArun ThomasThe current 1003.2 spec says that `)' is an ordinary character in 301b7061124SArun Thomasthe absence of an unmatched `('; 3022fe8fb19SBen Grasthis was an unintentional result of a wording error, and change is likely. 303b7061124SArun ThomasAvoid relying on it. 3042fe8fb19SBen Gras.Pp 305b7061124SArun ThomasBack references are a dreadful botch, 306b7061124SArun Thomasposing major problems for efficient implementations. 307b7061124SArun ThomasThey are also somewhat vaguely defined 3082fe8fb19SBen Gras(does `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?). 309b7061124SArun ThomasAvoid using them. 3102fe8fb19SBen Gras.Pp 311b7061124SArun Thomas1003.2's specification of case-independent matching is vague. 312b7061124SArun ThomasThe ``one case implies all cases'' definition given above 313b7061124SArun Thomasis current consensus among implementors as to the right interpretation. 3142fe8fb19SBen Gras.Pp 315b7061124SArun ThomasThe syntax for word boundaries is incredibly ugly. 316