1*0Sstevel@tonic-gate=head1 NAME 2*0Sstevel@tonic-gate 3*0Sstevel@tonic-gateperlre - Perl regular expressions 4*0Sstevel@tonic-gate 5*0Sstevel@tonic-gate=head1 DESCRIPTION 6*0Sstevel@tonic-gate 7*0Sstevel@tonic-gateThis page describes the syntax of regular expressions in Perl. 8*0Sstevel@tonic-gate 9*0Sstevel@tonic-gateIf you haven't used regular expressions before, a quick-start 10*0Sstevel@tonic-gateintroduction is available in L<perlrequick>, and a longer tutorial 11*0Sstevel@tonic-gateintroduction is available in L<perlretut>. 12*0Sstevel@tonic-gate 13*0Sstevel@tonic-gateFor reference on how regular expressions are used in matching 14*0Sstevel@tonic-gateoperations, plus various examples of the same, see discussions of 15*0Sstevel@tonic-gateC<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like 16*0Sstevel@tonic-gateOperators">. 17*0Sstevel@tonic-gate 18*0Sstevel@tonic-gateMatching operations can have various modifiers. Modifiers 19*0Sstevel@tonic-gatethat relate to the interpretation of the regular expression inside 20*0Sstevel@tonic-gateare listed below. Modifiers that alter the way a regular expression 21*0Sstevel@tonic-gateis used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and 22*0Sstevel@tonic-gateL<perlop/"Gory details of parsing quoted constructs">. 23*0Sstevel@tonic-gate 24*0Sstevel@tonic-gate=over 4 25*0Sstevel@tonic-gate 26*0Sstevel@tonic-gate=item i 27*0Sstevel@tonic-gate 28*0Sstevel@tonic-gateDo case-insensitive pattern matching. 29*0Sstevel@tonic-gate 30*0Sstevel@tonic-gateIf C<use locale> is in effect, the case map is taken from the current 31*0Sstevel@tonic-gatelocale. See L<perllocale>. 32*0Sstevel@tonic-gate 33*0Sstevel@tonic-gate=item m 34*0Sstevel@tonic-gate 35*0Sstevel@tonic-gateTreat string as multiple lines. That is, change "^" and "$" from matching 36*0Sstevel@tonic-gatethe start or end of the string to matching the start or end of any 37*0Sstevel@tonic-gateline anywhere within the string. 38*0Sstevel@tonic-gate 39*0Sstevel@tonic-gate=item s 40*0Sstevel@tonic-gate 41*0Sstevel@tonic-gateTreat string as single line. That is, change "." to match any character 42*0Sstevel@tonic-gatewhatsoever, even a newline, which normally it would not match. 43*0Sstevel@tonic-gate 44*0Sstevel@tonic-gateThe C</s> and C</m> modifiers both override the C<$*> setting. That 45*0Sstevel@tonic-gateis, no matter what C<$*> contains, C</s> without C</m> will force 46*0Sstevel@tonic-gate"^" to match only at the beginning of the string and "$" to match 47*0Sstevel@tonic-gateonly at the end (or just before a newline at the end) of the string. 48*0Sstevel@tonic-gateTogether, as /ms, they let the "." match any character whatsoever, 49*0Sstevel@tonic-gatewhile still allowing "^" and "$" to match, respectively, just after 50*0Sstevel@tonic-gateand just before newlines within the string. 51*0Sstevel@tonic-gate 52*0Sstevel@tonic-gate=item x 53*0Sstevel@tonic-gate 54*0Sstevel@tonic-gateExtend your pattern's legibility by permitting whitespace and comments. 55*0Sstevel@tonic-gate 56*0Sstevel@tonic-gate=back 57*0Sstevel@tonic-gate 58*0Sstevel@tonic-gateThese are usually written as "the C</x> modifier", even though the delimiter 59*0Sstevel@tonic-gatein question might not really be a slash. Any of these 60*0Sstevel@tonic-gatemodifiers may also be embedded within the regular expression itself using 61*0Sstevel@tonic-gatethe C<(?...)> construct. See below. 62*0Sstevel@tonic-gate 63*0Sstevel@tonic-gateThe C</x> modifier itself needs a little more explanation. It tells 64*0Sstevel@tonic-gatethe regular expression parser to ignore whitespace that is neither 65*0Sstevel@tonic-gatebackslashed nor within a character class. You can use this to break up 66*0Sstevel@tonic-gateyour regular expression into (slightly) more readable parts. The C<#> 67*0Sstevel@tonic-gatecharacter is also treated as a metacharacter introducing a comment, 68*0Sstevel@tonic-gatejust as in ordinary Perl code. This also means that if you want real 69*0Sstevel@tonic-gatewhitespace or C<#> characters in the pattern (outside a character 70*0Sstevel@tonic-gateclass, where they are unaffected by C</x>), that you'll either have to 71*0Sstevel@tonic-gateescape them or encode them using octal or hex escapes. Taken together, 72*0Sstevel@tonic-gatethese features go a long way towards making Perl's regular expressions 73*0Sstevel@tonic-gatemore readable. Note that you have to be careful not to include the 74*0Sstevel@tonic-gatepattern delimiter in the comment--perl has no way of knowing you did 75*0Sstevel@tonic-gatenot intend to close the pattern early. See the C-comment deletion code 76*0Sstevel@tonic-gatein L<perlop>. 77*0Sstevel@tonic-gate 78*0Sstevel@tonic-gate=head2 Regular Expressions 79*0Sstevel@tonic-gate 80*0Sstevel@tonic-gateThe patterns used in Perl pattern matching derive from supplied in 81*0Sstevel@tonic-gatethe Version 8 regex routines. (The routines are derived 82*0Sstevel@tonic-gate(distantly) from Henry Spencer's freely redistributable reimplementation 83*0Sstevel@tonic-gateof the V8 routines.) See L<Version 8 Regular Expressions> for 84*0Sstevel@tonic-gatedetails. 85*0Sstevel@tonic-gate 86*0Sstevel@tonic-gateIn particular the following metacharacters have their standard I<egrep>-ish 87*0Sstevel@tonic-gatemeanings: 88*0Sstevel@tonic-gate 89*0Sstevel@tonic-gate \ Quote the next metacharacter 90*0Sstevel@tonic-gate ^ Match the beginning of the line 91*0Sstevel@tonic-gate . Match any character (except newline) 92*0Sstevel@tonic-gate $ Match the end of the line (or before newline at the end) 93*0Sstevel@tonic-gate | Alternation 94*0Sstevel@tonic-gate () Grouping 95*0Sstevel@tonic-gate [] Character class 96*0Sstevel@tonic-gate 97*0Sstevel@tonic-gateBy default, the "^" character is guaranteed to match only the 98*0Sstevel@tonic-gatebeginning of the string, the "$" character only the end (or before the 99*0Sstevel@tonic-gatenewline at the end), and Perl does certain optimizations with the 100*0Sstevel@tonic-gateassumption that the string contains only one line. Embedded newlines 101*0Sstevel@tonic-gatewill not be matched by "^" or "$". You may, however, wish to treat a 102*0Sstevel@tonic-gatestring as a multi-line buffer, such that the "^" will match after any 103*0Sstevel@tonic-gatenewline within the string, and "$" will match before any newline. At the 104*0Sstevel@tonic-gatecost of a little more overhead, you can do this by using the /m modifier 105*0Sstevel@tonic-gateon the pattern match operator. (Older programs did this by setting C<$*>, 106*0Sstevel@tonic-gatebut this practice is now deprecated.) 107*0Sstevel@tonic-gate 108*0Sstevel@tonic-gateTo simplify multi-line substitutions, the "." character never matches a 109*0Sstevel@tonic-gatenewline unless you use the C</s> modifier, which in effect tells Perl to pretend 110*0Sstevel@tonic-gatethe string is a single line--even if it isn't. The C</s> modifier also 111*0Sstevel@tonic-gateoverrides the setting of C<$*>, in case you have some (badly behaved) older 112*0Sstevel@tonic-gatecode that sets it in another module. 113*0Sstevel@tonic-gate 114*0Sstevel@tonic-gateThe following standard quantifiers are recognized: 115*0Sstevel@tonic-gate 116*0Sstevel@tonic-gate * Match 0 or more times 117*0Sstevel@tonic-gate + Match 1 or more times 118*0Sstevel@tonic-gate ? Match 1 or 0 times 119*0Sstevel@tonic-gate {n} Match exactly n times 120*0Sstevel@tonic-gate {n,} Match at least n times 121*0Sstevel@tonic-gate {n,m} Match at least n but not more than m times 122*0Sstevel@tonic-gate 123*0Sstevel@tonic-gate(If a curly bracket occurs in any other context, it is treated 124*0Sstevel@tonic-gateas a regular character. In particular, the lower bound 125*0Sstevel@tonic-gateis not optional.) The "*" modifier is equivalent to C<{0,}>, the "+" 126*0Sstevel@tonic-gatemodifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited 127*0Sstevel@tonic-gateto integral values less than a preset limit defined when perl is built. 128*0Sstevel@tonic-gateThis is usually 32766 on the most common platforms. The actual limit can 129*0Sstevel@tonic-gatebe seen in the error message generated by code such as this: 130*0Sstevel@tonic-gate 131*0Sstevel@tonic-gate $_ **= $_ , / {$_} / for 2 .. 42; 132*0Sstevel@tonic-gate 133*0Sstevel@tonic-gateBy default, a quantified subpattern is "greedy", that is, it will match as 134*0Sstevel@tonic-gatemany times as possible (given a particular starting location) while still 135*0Sstevel@tonic-gateallowing the rest of the pattern to match. If you want it to match the 136*0Sstevel@tonic-gateminimum number of times possible, follow the quantifier with a "?". Note 137*0Sstevel@tonic-gatethat the meanings don't change, just the "greediness": 138*0Sstevel@tonic-gate 139*0Sstevel@tonic-gate *? Match 0 or more times 140*0Sstevel@tonic-gate +? Match 1 or more times 141*0Sstevel@tonic-gate ?? Match 0 or 1 time 142*0Sstevel@tonic-gate {n}? Match exactly n times 143*0Sstevel@tonic-gate {n,}? Match at least n times 144*0Sstevel@tonic-gate {n,m}? Match at least n but not more than m times 145*0Sstevel@tonic-gate 146*0Sstevel@tonic-gateBecause patterns are processed as double quoted strings, the following 147*0Sstevel@tonic-gatealso work: 148*0Sstevel@tonic-gate 149*0Sstevel@tonic-gate \t tab (HT, TAB) 150*0Sstevel@tonic-gate \n newline (LF, NL) 151*0Sstevel@tonic-gate \r return (CR) 152*0Sstevel@tonic-gate \f form feed (FF) 153*0Sstevel@tonic-gate \a alarm (bell) (BEL) 154*0Sstevel@tonic-gate \e escape (think troff) (ESC) 155*0Sstevel@tonic-gate \033 octal char (think of a PDP-11) 156*0Sstevel@tonic-gate \x1B hex char 157*0Sstevel@tonic-gate \x{263a} wide hex char (Unicode SMILEY) 158*0Sstevel@tonic-gate \c[ control char 159*0Sstevel@tonic-gate \N{name} named char 160*0Sstevel@tonic-gate \l lowercase next char (think vi) 161*0Sstevel@tonic-gate \u uppercase next char (think vi) 162*0Sstevel@tonic-gate \L lowercase till \E (think vi) 163*0Sstevel@tonic-gate \U uppercase till \E (think vi) 164*0Sstevel@tonic-gate \E end case modification (think vi) 165*0Sstevel@tonic-gate \Q quote (disable) pattern metacharacters till \E 166*0Sstevel@tonic-gate 167*0Sstevel@tonic-gateIf C<use locale> is in effect, the case map used by C<\l>, C<\L>, C<\u> 168*0Sstevel@tonic-gateand C<\U> is taken from the current locale. See L<perllocale>. For 169*0Sstevel@tonic-gatedocumentation of C<\N{name}>, see L<charnames>. 170*0Sstevel@tonic-gate 171*0Sstevel@tonic-gateYou cannot include a literal C<$> or C<@> within a C<\Q> sequence. 172*0Sstevel@tonic-gateAn unescaped C<$> or C<@> interpolates the corresponding variable, 173*0Sstevel@tonic-gatewhile escaping will cause the literal string C<\$> to be matched. 174*0Sstevel@tonic-gateYou'll need to write something like C<m/\Quser\E\@\Qhost/>. 175*0Sstevel@tonic-gate 176*0Sstevel@tonic-gateIn addition, Perl defines the following: 177*0Sstevel@tonic-gate 178*0Sstevel@tonic-gate \w Match a "word" character (alphanumeric plus "_") 179*0Sstevel@tonic-gate \W Match a non-"word" character 180*0Sstevel@tonic-gate \s Match a whitespace character 181*0Sstevel@tonic-gate \S Match a non-whitespace character 182*0Sstevel@tonic-gate \d Match a digit character 183*0Sstevel@tonic-gate \D Match a non-digit character 184*0Sstevel@tonic-gate \pP Match P, named property. Use \p{Prop} for longer names. 185*0Sstevel@tonic-gate \PP Match non-P 186*0Sstevel@tonic-gate \X Match eXtended Unicode "combining character sequence", 187*0Sstevel@tonic-gate equivalent to (?:\PM\pM*) 188*0Sstevel@tonic-gate \C Match a single C char (octet) even under Unicode. 189*0Sstevel@tonic-gate NOTE: breaks up characters into their UTF-8 bytes, 190*0Sstevel@tonic-gate so you may end up with malformed pieces of UTF-8. 191*0Sstevel@tonic-gate Unsupported in lookbehind. 192*0Sstevel@tonic-gate 193*0Sstevel@tonic-gateA C<\w> matches a single alphanumeric character (an alphabetic 194*0Sstevel@tonic-gatecharacter, or a decimal digit) or C<_>, not a whole word. Use C<\w+> 195*0Sstevel@tonic-gateto match a string of Perl-identifier characters (which isn't the same 196*0Sstevel@tonic-gateas matching an English word). If C<use locale> is in effect, the list 197*0Sstevel@tonic-gateof alphabetic characters generated by C<\w> is taken from the current 198*0Sstevel@tonic-gatelocale. See L<perllocale>. You may use C<\w>, C<\W>, C<\s>, C<\S>, 199*0Sstevel@tonic-gateC<\d>, and C<\D> within character classes, but if you try to use them 200*0Sstevel@tonic-gateas endpoints of a range, that's not a range, the "-" is understood 201*0Sstevel@tonic-gateliterally. If Unicode is in effect, C<\s> matches also "\x{85}", 202*0Sstevel@tonic-gate"\x{2028}, and "\x{2029}", see L<perlunicode> for more details about 203*0Sstevel@tonic-gateC<\pP>, C<\PP>, and C<\X>, and L<perluniintro> about Unicode in general. 204*0Sstevel@tonic-gateYou can define your own C<\p> and C<\P> propreties, see L<perlunicode>. 205*0Sstevel@tonic-gate 206*0Sstevel@tonic-gateThe POSIX character class syntax 207*0Sstevel@tonic-gate 208*0Sstevel@tonic-gate [:class:] 209*0Sstevel@tonic-gate 210*0Sstevel@tonic-gateis also available. The available classes and their backslash 211*0Sstevel@tonic-gateequivalents (if available) are as follows: 212*0Sstevel@tonic-gate 213*0Sstevel@tonic-gate alpha 214*0Sstevel@tonic-gate alnum 215*0Sstevel@tonic-gate ascii 216*0Sstevel@tonic-gate blank [1] 217*0Sstevel@tonic-gate cntrl 218*0Sstevel@tonic-gate digit \d 219*0Sstevel@tonic-gate graph 220*0Sstevel@tonic-gate lower 221*0Sstevel@tonic-gate print 222*0Sstevel@tonic-gate punct 223*0Sstevel@tonic-gate space \s [2] 224*0Sstevel@tonic-gate upper 225*0Sstevel@tonic-gate word \w [3] 226*0Sstevel@tonic-gate xdigit 227*0Sstevel@tonic-gate 228*0Sstevel@tonic-gate=over 229*0Sstevel@tonic-gate 230*0Sstevel@tonic-gate=item [1] 231*0Sstevel@tonic-gate 232*0Sstevel@tonic-gateA GNU extension equivalent to C<[ \t]>, `all horizontal whitespace'. 233*0Sstevel@tonic-gate 234*0Sstevel@tonic-gate=item [2] 235*0Sstevel@tonic-gate 236*0Sstevel@tonic-gateNot exactly equivalent to C<\s> since the C<[[:space:]]> includes 237*0Sstevel@tonic-gatealso the (very rare) `vertical tabulator', "\ck", chr(11). 238*0Sstevel@tonic-gate 239*0Sstevel@tonic-gate=item [3] 240*0Sstevel@tonic-gate 241*0Sstevel@tonic-gateA Perl extension, see above. 242*0Sstevel@tonic-gate 243*0Sstevel@tonic-gate=back 244*0Sstevel@tonic-gate 245*0Sstevel@tonic-gateFor example use C<[:upper:]> to match all the uppercase characters. 246*0Sstevel@tonic-gateNote that the C<[]> are part of the C<[::]> construct, not part of the 247*0Sstevel@tonic-gatewhole character class. For example: 248*0Sstevel@tonic-gate 249*0Sstevel@tonic-gate [01[:alpha:]%] 250*0Sstevel@tonic-gate 251*0Sstevel@tonic-gatematches zero, one, any alphabetic character, and the percentage sign. 252*0Sstevel@tonic-gate 253*0Sstevel@tonic-gateThe following equivalences to Unicode \p{} constructs and equivalent 254*0Sstevel@tonic-gatebackslash character classes (if available), will hold: 255*0Sstevel@tonic-gate 256*0Sstevel@tonic-gate [:...:] \p{...} backslash 257*0Sstevel@tonic-gate 258*0Sstevel@tonic-gate alpha IsAlpha 259*0Sstevel@tonic-gate alnum IsAlnum 260*0Sstevel@tonic-gate ascii IsASCII 261*0Sstevel@tonic-gate blank IsSpace 262*0Sstevel@tonic-gate cntrl IsCntrl 263*0Sstevel@tonic-gate digit IsDigit \d 264*0Sstevel@tonic-gate graph IsGraph 265*0Sstevel@tonic-gate lower IsLower 266*0Sstevel@tonic-gate print IsPrint 267*0Sstevel@tonic-gate punct IsPunct 268*0Sstevel@tonic-gate space IsSpace 269*0Sstevel@tonic-gate IsSpacePerl \s 270*0Sstevel@tonic-gate upper IsUpper 271*0Sstevel@tonic-gate word IsWord 272*0Sstevel@tonic-gate xdigit IsXDigit 273*0Sstevel@tonic-gate 274*0Sstevel@tonic-gateFor example C<[:lower:]> and C<\p{IsLower}> are equivalent. 275*0Sstevel@tonic-gate 276*0Sstevel@tonic-gateIf the C<utf8> pragma is not used but the C<locale> pragma is, the 277*0Sstevel@tonic-gateclasses correlate with the usual isalpha(3) interface (except for 278*0Sstevel@tonic-gate`word' and `blank'). 279*0Sstevel@tonic-gate 280*0Sstevel@tonic-gateThe assumedly non-obviously named classes are: 281*0Sstevel@tonic-gate 282*0Sstevel@tonic-gate=over 4 283*0Sstevel@tonic-gate 284*0Sstevel@tonic-gate=item cntrl 285*0Sstevel@tonic-gate 286*0Sstevel@tonic-gateAny control character. Usually characters that don't produce output as 287*0Sstevel@tonic-gatesuch but instead control the terminal somehow: for example newline and 288*0Sstevel@tonic-gatebackspace are control characters. All characters with ord() less than 289*0Sstevel@tonic-gate32 are most often classified as control characters (assuming ASCII, 290*0Sstevel@tonic-gatethe ISO Latin character sets, and Unicode), as is the character with 291*0Sstevel@tonic-gatethe ord() value of 127 (C<DEL>). 292*0Sstevel@tonic-gate 293*0Sstevel@tonic-gate=item graph 294*0Sstevel@tonic-gate 295*0Sstevel@tonic-gateAny alphanumeric or punctuation (special) character. 296*0Sstevel@tonic-gate 297*0Sstevel@tonic-gate=item print 298*0Sstevel@tonic-gate 299*0Sstevel@tonic-gateAny alphanumeric or punctuation (special) character or the space character. 300*0Sstevel@tonic-gate 301*0Sstevel@tonic-gate=item punct 302*0Sstevel@tonic-gate 303*0Sstevel@tonic-gateAny punctuation (special) character. 304*0Sstevel@tonic-gate 305*0Sstevel@tonic-gate=item xdigit 306*0Sstevel@tonic-gate 307*0Sstevel@tonic-gateAny hexadecimal digit. Though this may feel silly ([0-9A-Fa-f] would 308*0Sstevel@tonic-gatework just fine) it is included for completeness. 309*0Sstevel@tonic-gate 310*0Sstevel@tonic-gate=back 311*0Sstevel@tonic-gate 312*0Sstevel@tonic-gateYou can negate the [::] character classes by prefixing the class name 313*0Sstevel@tonic-gatewith a '^'. This is a Perl extension. For example: 314*0Sstevel@tonic-gate 315*0Sstevel@tonic-gate POSIX traditional Unicode 316*0Sstevel@tonic-gate 317*0Sstevel@tonic-gate [:^digit:] \D \P{IsDigit} 318*0Sstevel@tonic-gate [:^space:] \S \P{IsSpace} 319*0Sstevel@tonic-gate [:^word:] \W \P{IsWord} 320*0Sstevel@tonic-gate 321*0Sstevel@tonic-gatePerl respects the POSIX standard in that POSIX character classes are 322*0Sstevel@tonic-gateonly supported within a character class. The POSIX character classes 323*0Sstevel@tonic-gate[.cc.] and [=cc=] are recognized but B<not> supported and trying to 324*0Sstevel@tonic-gateuse them will cause an error. 325*0Sstevel@tonic-gate 326*0Sstevel@tonic-gatePerl defines the following zero-width assertions: 327*0Sstevel@tonic-gate 328*0Sstevel@tonic-gate \b Match a word boundary 329*0Sstevel@tonic-gate \B Match a non-(word boundary) 330*0Sstevel@tonic-gate \A Match only at beginning of string 331*0Sstevel@tonic-gate \Z Match only at end of string, or before newline at the end 332*0Sstevel@tonic-gate \z Match only at end of string 333*0Sstevel@tonic-gate \G Match only at pos() (e.g. at the end-of-match position 334*0Sstevel@tonic-gate of prior m//g) 335*0Sstevel@tonic-gate 336*0Sstevel@tonic-gateA word boundary (C<\b>) is a spot between two characters 337*0Sstevel@tonic-gatethat has a C<\w> on one side of it and a C<\W> on the other side 338*0Sstevel@tonic-gateof it (in either order), counting the imaginary characters off the 339*0Sstevel@tonic-gatebeginning and end of the string as matching a C<\W>. (Within 340*0Sstevel@tonic-gatecharacter classes C<\b> represents backspace rather than a word 341*0Sstevel@tonic-gateboundary, just as it normally does in any double-quoted string.) 342*0Sstevel@tonic-gateThe C<\A> and C<\Z> are just like "^" and "$", except that they 343*0Sstevel@tonic-gatewon't match multiple times when the C</m> modifier is used, while 344*0Sstevel@tonic-gate"^" and "$" will match at every internal line boundary. To match 345*0Sstevel@tonic-gatethe actual end of the string and not ignore an optional trailing 346*0Sstevel@tonic-gatenewline, use C<\z>. 347*0Sstevel@tonic-gate 348*0Sstevel@tonic-gateThe C<\G> assertion can be used to chain global matches (using 349*0Sstevel@tonic-gateC<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">. 350*0Sstevel@tonic-gateIt is also useful when writing C<lex>-like scanners, when you have 351*0Sstevel@tonic-gateseveral patterns that you want to match against consequent substrings 352*0Sstevel@tonic-gateof your string, see the previous reference. The actual location 353*0Sstevel@tonic-gatewhere C<\G> will match can also be influenced by using C<pos()> as 354*0Sstevel@tonic-gatean lvalue: see L<perlfunc/pos>. Currently C<\G> is only fully 355*0Sstevel@tonic-gatesupported when anchored to the start of the pattern; while it 356*0Sstevel@tonic-gateis permitted to use it elsewhere, as in C</(?<=\G..)./g>, some 357*0Sstevel@tonic-gatesuch uses (C</.\G/g>, for example) currently cause problems, and 358*0Sstevel@tonic-gateit is recommended that you avoid such usage for now. 359*0Sstevel@tonic-gate 360*0Sstevel@tonic-gateThe bracketing construct C<( ... )> creates capture buffers. To 361*0Sstevel@tonic-gaterefer to the digit'th buffer use \<digit> within the 362*0Sstevel@tonic-gatematch. Outside the match use "$" instead of "\". (The 363*0Sstevel@tonic-gate\<digit> notation works in certain circumstances outside 364*0Sstevel@tonic-gatethe match. See the warning below about \1 vs $1 for details.) 365*0Sstevel@tonic-gateReferring back to another part of the match is called a 366*0Sstevel@tonic-gateI<backreference>. 367*0Sstevel@tonic-gate 368*0Sstevel@tonic-gateThere is no limit to the number of captured substrings that you may 369*0Sstevel@tonic-gateuse. However Perl also uses \10, \11, etc. as aliases for \010, 370*0Sstevel@tonic-gate\011, etc. (Recall that 0 means octal, so \011 is the character at 371*0Sstevel@tonic-gatenumber 9 in your coded character set; which would be the 10th character, 372*0Sstevel@tonic-gatea horizontal tab under ASCII.) Perl resolves this 373*0Sstevel@tonic-gateambiguity by interpreting \10 as a backreference only if at least 10 374*0Sstevel@tonic-gateleft parentheses have opened before it. Likewise \11 is a 375*0Sstevel@tonic-gatebackreference only if at least 11 left parentheses have opened 376*0Sstevel@tonic-gatebefore it. And so on. \1 through \9 are always interpreted as 377*0Sstevel@tonic-gatebackreferences. 378*0Sstevel@tonic-gate 379*0Sstevel@tonic-gateExamples: 380*0Sstevel@tonic-gate 381*0Sstevel@tonic-gate s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words 382*0Sstevel@tonic-gate 383*0Sstevel@tonic-gate if (/(.)\1/) { # find first doubled char 384*0Sstevel@tonic-gate print "'$1' is the first doubled character\n"; 385*0Sstevel@tonic-gate } 386*0Sstevel@tonic-gate 387*0Sstevel@tonic-gate if (/Time: (..):(..):(..)/) { # parse out values 388*0Sstevel@tonic-gate $hours = $1; 389*0Sstevel@tonic-gate $minutes = $2; 390*0Sstevel@tonic-gate $seconds = $3; 391*0Sstevel@tonic-gate } 392*0Sstevel@tonic-gate 393*0Sstevel@tonic-gateSeveral special variables also refer back to portions of the previous 394*0Sstevel@tonic-gatematch. C<$+> returns whatever the last bracket match matched. 395*0Sstevel@tonic-gateC<$&> returns the entire matched string. (At one point C<$0> did 396*0Sstevel@tonic-gatealso, but now it returns the name of the program.) C<$`> returns 397*0Sstevel@tonic-gateeverything before the matched string. C<$'> returns everything 398*0Sstevel@tonic-gateafter the matched string. And C<$^N> contains whatever was matched by 399*0Sstevel@tonic-gatethe most-recently closed group (submatch). C<$^N> can be used in 400*0Sstevel@tonic-gateextended patterns (see below), for example to assign a submatch to a 401*0Sstevel@tonic-gatevariable. 402*0Sstevel@tonic-gate 403*0Sstevel@tonic-gateThe numbered match variables ($1, $2, $3, etc.) and the related punctuation 404*0Sstevel@tonic-gateset (C<$+>, C<$&>, C<$`>, C<$'>, and C<$^N>) are all dynamically scoped 405*0Sstevel@tonic-gateuntil the end of the enclosing block or until the next successful 406*0Sstevel@tonic-gatematch, whichever comes first. (See L<perlsyn/"Compound Statements">.) 407*0Sstevel@tonic-gate 408*0Sstevel@tonic-gateB<NOTE>: failed matches in Perl do not reset the match variables, 409*0Sstevel@tonic-gatewhich makes easier to write code that tests for a series of more 410*0Sstevel@tonic-gatespecific cases and remembers the best match. 411*0Sstevel@tonic-gate 412*0Sstevel@tonic-gateB<WARNING>: Once Perl sees that you need one of C<$&>, C<$`>, or 413*0Sstevel@tonic-gateC<$'> anywhere in the program, it has to provide them for every 414*0Sstevel@tonic-gatepattern match. This may substantially slow your program. Perl 415*0Sstevel@tonic-gateuses the same mechanism to produce $1, $2, etc, so you also pay a 416*0Sstevel@tonic-gateprice for each pattern that contains capturing parentheses. (To 417*0Sstevel@tonic-gateavoid this cost while retaining the grouping behaviour, use the 418*0Sstevel@tonic-gateextended regular expression C<(?: ... )> instead.) But if you never 419*0Sstevel@tonic-gateuse C<$&>, C<$`> or C<$'>, then patterns I<without> capturing 420*0Sstevel@tonic-gateparentheses will not be penalized. So avoid C<$&>, C<$'>, and C<$`> 421*0Sstevel@tonic-gateif you can, but if you can't (and some algorithms really appreciate 422*0Sstevel@tonic-gatethem), once you've used them once, use them at will, because you've 423*0Sstevel@tonic-gatealready paid the price. As of 5.005, C<$&> is not so costly as the 424*0Sstevel@tonic-gateother two. 425*0Sstevel@tonic-gate 426*0Sstevel@tonic-gateBackslashed metacharacters in Perl are alphanumeric, such as C<\b>, 427*0Sstevel@tonic-gateC<\w>, C<\n>. Unlike some other regular expression languages, there 428*0Sstevel@tonic-gateare no backslashed symbols that aren't alphanumeric. So anything 429*0Sstevel@tonic-gatethat looks like \\, \(, \), \<, \>, \{, or \} is always 430*0Sstevel@tonic-gateinterpreted as a literal character, not a metacharacter. This was 431*0Sstevel@tonic-gateonce used in a common idiom to disable or quote the special meanings 432*0Sstevel@tonic-gateof regular expression metacharacters in a string that you want to 433*0Sstevel@tonic-gateuse for a pattern. Simply quote all non-"word" characters: 434*0Sstevel@tonic-gate 435*0Sstevel@tonic-gate $pattern =~ s/(\W)/\\$1/g; 436*0Sstevel@tonic-gate 437*0Sstevel@tonic-gate(If C<use locale> is set, then this depends on the current locale.) 438*0Sstevel@tonic-gateToday it is more common to use the quotemeta() function or the C<\Q> 439*0Sstevel@tonic-gatemetaquoting escape sequence to disable all metacharacters' special 440*0Sstevel@tonic-gatemeanings like this: 441*0Sstevel@tonic-gate 442*0Sstevel@tonic-gate /$unquoted\Q$quoted\E$unquoted/ 443*0Sstevel@tonic-gate 444*0Sstevel@tonic-gateBeware that if you put literal backslashes (those not inside 445*0Sstevel@tonic-gateinterpolated variables) between C<\Q> and C<\E>, double-quotish 446*0Sstevel@tonic-gatebackslash interpolation may lead to confusing results. If you 447*0Sstevel@tonic-gateI<need> to use literal backslashes within C<\Q...\E>, 448*0Sstevel@tonic-gateconsult L<perlop/"Gory details of parsing quoted constructs">. 449*0Sstevel@tonic-gate 450*0Sstevel@tonic-gate=head2 Extended Patterns 451*0Sstevel@tonic-gate 452*0Sstevel@tonic-gatePerl also defines a consistent extension syntax for features not 453*0Sstevel@tonic-gatefound in standard tools like B<awk> and B<lex>. The syntax is a 454*0Sstevel@tonic-gatepair of parentheses with a question mark as the first thing within 455*0Sstevel@tonic-gatethe parentheses. The character after the question mark indicates 456*0Sstevel@tonic-gatethe extension. 457*0Sstevel@tonic-gate 458*0Sstevel@tonic-gateThe stability of these extensions varies widely. Some have been 459*0Sstevel@tonic-gatepart of the core language for many years. Others are experimental 460*0Sstevel@tonic-gateand may change without warning or be completely removed. Check 461*0Sstevel@tonic-gatethe documentation on an individual feature to verify its current 462*0Sstevel@tonic-gatestatus. 463*0Sstevel@tonic-gate 464*0Sstevel@tonic-gateA question mark was chosen for this and for the minimal-matching 465*0Sstevel@tonic-gateconstruct because 1) question marks are rare in older regular 466*0Sstevel@tonic-gateexpressions, and 2) whenever you see one, you should stop and 467*0Sstevel@tonic-gate"question" exactly what is going on. That's psychology... 468*0Sstevel@tonic-gate 469*0Sstevel@tonic-gate=over 10 470*0Sstevel@tonic-gate 471*0Sstevel@tonic-gate=item C<(?#text)> 472*0Sstevel@tonic-gate 473*0Sstevel@tonic-gateA comment. The text is ignored. If the C</x> modifier enables 474*0Sstevel@tonic-gatewhitespace formatting, a simple C<#> will suffice. Note that Perl closes 475*0Sstevel@tonic-gatethe comment as soon as it sees a C<)>, so there is no way to put a literal 476*0Sstevel@tonic-gateC<)> in the comment. 477*0Sstevel@tonic-gate 478*0Sstevel@tonic-gate=item C<(?imsx-imsx)> 479*0Sstevel@tonic-gate 480*0Sstevel@tonic-gateOne or more embedded pattern-match modifiers, to be turned on (or 481*0Sstevel@tonic-gateturned off, if preceded by C<->) for the remainder of the pattern or 482*0Sstevel@tonic-gatethe remainder of the enclosing pattern group (if any). This is 483*0Sstevel@tonic-gateparticularly useful for dynamic patterns, such as those read in from a 484*0Sstevel@tonic-gateconfiguration file, read in as an argument, are specified in a table 485*0Sstevel@tonic-gatesomewhere, etc. Consider the case that some of which want to be case 486*0Sstevel@tonic-gatesensitive and some do not. The case insensitive ones need to include 487*0Sstevel@tonic-gatemerely C<(?i)> at the front of the pattern. For example: 488*0Sstevel@tonic-gate 489*0Sstevel@tonic-gate $pattern = "foobar"; 490*0Sstevel@tonic-gate if ( /$pattern/i ) { } 491*0Sstevel@tonic-gate 492*0Sstevel@tonic-gate # more flexible: 493*0Sstevel@tonic-gate 494*0Sstevel@tonic-gate $pattern = "(?i)foobar"; 495*0Sstevel@tonic-gate if ( /$pattern/ ) { } 496*0Sstevel@tonic-gate 497*0Sstevel@tonic-gateThese modifiers are restored at the end of the enclosing group. For example, 498*0Sstevel@tonic-gate 499*0Sstevel@tonic-gate ( (?i) blah ) \s+ \1 500*0Sstevel@tonic-gate 501*0Sstevel@tonic-gatewill match a repeated (I<including the case>!) word C<blah> in any 502*0Sstevel@tonic-gatecase, assuming C<x> modifier, and no C<i> modifier outside this 503*0Sstevel@tonic-gategroup. 504*0Sstevel@tonic-gate 505*0Sstevel@tonic-gate=item C<(?:pattern)> 506*0Sstevel@tonic-gate 507*0Sstevel@tonic-gate=item C<(?imsx-imsx:pattern)> 508*0Sstevel@tonic-gate 509*0Sstevel@tonic-gateThis is for clustering, not capturing; it groups subexpressions like 510*0Sstevel@tonic-gate"()", but doesn't make backreferences as "()" does. So 511*0Sstevel@tonic-gate 512*0Sstevel@tonic-gate @fields = split(/\b(?:a|b|c)\b/) 513*0Sstevel@tonic-gate 514*0Sstevel@tonic-gateis like 515*0Sstevel@tonic-gate 516*0Sstevel@tonic-gate @fields = split(/\b(a|b|c)\b/) 517*0Sstevel@tonic-gate 518*0Sstevel@tonic-gatebut doesn't spit out extra fields. It's also cheaper not to capture 519*0Sstevel@tonic-gatecharacters if you don't need to. 520*0Sstevel@tonic-gate 521*0Sstevel@tonic-gateAny letters between C<?> and C<:> act as flags modifiers as with 522*0Sstevel@tonic-gateC<(?imsx-imsx)>. For example, 523*0Sstevel@tonic-gate 524*0Sstevel@tonic-gate /(?s-i:more.*than).*million/i 525*0Sstevel@tonic-gate 526*0Sstevel@tonic-gateis equivalent to the more verbose 527*0Sstevel@tonic-gate 528*0Sstevel@tonic-gate /(?:(?s-i)more.*than).*million/i 529*0Sstevel@tonic-gate 530*0Sstevel@tonic-gate=item C<(?=pattern)> 531*0Sstevel@tonic-gate 532*0Sstevel@tonic-gateA zero-width positive look-ahead assertion. For example, C</\w+(?=\t)/> 533*0Sstevel@tonic-gatematches a word followed by a tab, without including the tab in C<$&>. 534*0Sstevel@tonic-gate 535*0Sstevel@tonic-gate=item C<(?!pattern)> 536*0Sstevel@tonic-gate 537*0Sstevel@tonic-gateA zero-width negative look-ahead assertion. For example C</foo(?!bar)/> 538*0Sstevel@tonic-gatematches any occurrence of "foo" that isn't followed by "bar". Note 539*0Sstevel@tonic-gatehowever that look-ahead and look-behind are NOT the same thing. You cannot 540*0Sstevel@tonic-gateuse this for look-behind. 541*0Sstevel@tonic-gate 542*0Sstevel@tonic-gateIf you are looking for a "bar" that isn't preceded by a "foo", C</(?!foo)bar/> 543*0Sstevel@tonic-gatewill not do what you want. That's because the C<(?!foo)> is just saying that 544*0Sstevel@tonic-gatethe next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will 545*0Sstevel@tonic-gatematch. You would have to do something like C</(?!foo)...bar/> for that. We 546*0Sstevel@tonic-gatesay "like" because there's the case of your "bar" not having three characters 547*0Sstevel@tonic-gatebefore it. You could cover that this way: C</(?:(?!foo)...|^.{0,2})bar/>. 548*0Sstevel@tonic-gateSometimes it's still easier just to say: 549*0Sstevel@tonic-gate 550*0Sstevel@tonic-gate if (/bar/ && $` !~ /foo$/) 551*0Sstevel@tonic-gate 552*0Sstevel@tonic-gateFor look-behind see below. 553*0Sstevel@tonic-gate 554*0Sstevel@tonic-gate=item C<(?<=pattern)> 555*0Sstevel@tonic-gate 556*0Sstevel@tonic-gateA zero-width positive look-behind assertion. For example, C</(?<=\t)\w+/> 557*0Sstevel@tonic-gatematches a word that follows a tab, without including the tab in C<$&>. 558*0Sstevel@tonic-gateWorks only for fixed-width look-behind. 559*0Sstevel@tonic-gate 560*0Sstevel@tonic-gate=item C<(?<!pattern)> 561*0Sstevel@tonic-gate 562*0Sstevel@tonic-gateA zero-width negative look-behind assertion. For example C</(?<!bar)foo/> 563*0Sstevel@tonic-gatematches any occurrence of "foo" that does not follow "bar". Works 564*0Sstevel@tonic-gateonly for fixed-width look-behind. 565*0Sstevel@tonic-gate 566*0Sstevel@tonic-gate=item C<(?{ code })> 567*0Sstevel@tonic-gate 568*0Sstevel@tonic-gateB<WARNING>: This extended regular expression feature is considered 569*0Sstevel@tonic-gatehighly experimental, and may be changed or deleted without notice. 570*0Sstevel@tonic-gate 571*0Sstevel@tonic-gateThis zero-width assertion evaluates any embedded Perl code. It 572*0Sstevel@tonic-gatealways succeeds, and its C<code> is not interpolated. Currently, 573*0Sstevel@tonic-gatethe rules to determine where the C<code> ends are somewhat convoluted. 574*0Sstevel@tonic-gate 575*0Sstevel@tonic-gateThis feature can be used together with the special variable C<$^N> to 576*0Sstevel@tonic-gatecapture the results of submatches in variables without having to keep 577*0Sstevel@tonic-gatetrack of the number of nested parentheses. For example: 578*0Sstevel@tonic-gate 579*0Sstevel@tonic-gate $_ = "The brown fox jumps over the lazy dog"; 580*0Sstevel@tonic-gate /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i; 581*0Sstevel@tonic-gate print "color = $color, animal = $animal\n"; 582*0Sstevel@tonic-gate 583*0Sstevel@tonic-gateInside the C<(?{...})> block, C<$_> refers to the string the regular 584*0Sstevel@tonic-gateexpression is matching against. You can also use C<pos()> to know what is 585*0Sstevel@tonic-gatethe current position of matching withing this string. 586*0Sstevel@tonic-gate 587*0Sstevel@tonic-gateThe C<code> is properly scoped in the following sense: If the assertion 588*0Sstevel@tonic-gateis backtracked (compare L<"Backtracking">), all changes introduced after 589*0Sstevel@tonic-gateC<local>ization are undone, so that 590*0Sstevel@tonic-gate 591*0Sstevel@tonic-gate $_ = 'a' x 8; 592*0Sstevel@tonic-gate m< 593*0Sstevel@tonic-gate (?{ $cnt = 0 }) # Initialize $cnt. 594*0Sstevel@tonic-gate ( 595*0Sstevel@tonic-gate a 596*0Sstevel@tonic-gate (?{ 597*0Sstevel@tonic-gate local $cnt = $cnt + 1; # Update $cnt, backtracking-safe. 598*0Sstevel@tonic-gate }) 599*0Sstevel@tonic-gate )* 600*0Sstevel@tonic-gate aaaa 601*0Sstevel@tonic-gate (?{ $res = $cnt }) # On success copy to non-localized 602*0Sstevel@tonic-gate # location. 603*0Sstevel@tonic-gate >x; 604*0Sstevel@tonic-gate 605*0Sstevel@tonic-gatewill set C<$res = 4>. Note that after the match, $cnt returns to the globally 606*0Sstevel@tonic-gateintroduced value, because the scopes that restrict C<local> operators 607*0Sstevel@tonic-gateare unwound. 608*0Sstevel@tonic-gate 609*0Sstevel@tonic-gateThis assertion may be used as a C<(?(condition)yes-pattern|no-pattern)> 610*0Sstevel@tonic-gateswitch. If I<not> used in this way, the result of evaluation of 611*0Sstevel@tonic-gateC<code> is put into the special variable C<$^R>. This happens 612*0Sstevel@tonic-gateimmediately, so C<$^R> can be used from other C<(?{ code })> assertions 613*0Sstevel@tonic-gateinside the same regular expression. 614*0Sstevel@tonic-gate 615*0Sstevel@tonic-gateThe assignment to C<$^R> above is properly localized, so the old 616*0Sstevel@tonic-gatevalue of C<$^R> is restored if the assertion is backtracked; compare 617*0Sstevel@tonic-gateL<"Backtracking">. 618*0Sstevel@tonic-gate 619*0Sstevel@tonic-gateFor reasons of security, this construct is forbidden if the regular 620*0Sstevel@tonic-gateexpression involves run-time interpolation of variables, unless the 621*0Sstevel@tonic-gateperilous C<use re 'eval'> pragma has been used (see L<re>), or the 622*0Sstevel@tonic-gatevariables contain results of C<qr//> operator (see 623*0Sstevel@tonic-gateL<perlop/"qr/STRING/imosx">). 624*0Sstevel@tonic-gate 625*0Sstevel@tonic-gateThis restriction is because of the wide-spread and remarkably convenient 626*0Sstevel@tonic-gatecustom of using run-time determined strings as patterns. For example: 627*0Sstevel@tonic-gate 628*0Sstevel@tonic-gate $re = <>; 629*0Sstevel@tonic-gate chomp $re; 630*0Sstevel@tonic-gate $string =~ /$re/; 631*0Sstevel@tonic-gate 632*0Sstevel@tonic-gateBefore Perl knew how to execute interpolated code within a pattern, 633*0Sstevel@tonic-gatethis operation was completely safe from a security point of view, 634*0Sstevel@tonic-gatealthough it could raise an exception from an illegal pattern. If 635*0Sstevel@tonic-gateyou turn on the C<use re 'eval'>, though, it is no longer secure, 636*0Sstevel@tonic-gateso you should only do so if you are also using taint checking. 637*0Sstevel@tonic-gateBetter yet, use the carefully constrained evaluation within a Safe 638*0Sstevel@tonic-gatecompartment. See L<perlsec> for details about both these mechanisms. 639*0Sstevel@tonic-gate 640*0Sstevel@tonic-gate=item C<(??{ code })> 641*0Sstevel@tonic-gate 642*0Sstevel@tonic-gateB<WARNING>: This extended regular expression feature is considered 643*0Sstevel@tonic-gatehighly experimental, and may be changed or deleted without notice. 644*0Sstevel@tonic-gateA simplified version of the syntax may be introduced for commonly 645*0Sstevel@tonic-gateused idioms. 646*0Sstevel@tonic-gate 647*0Sstevel@tonic-gateThis is a "postponed" regular subexpression. The C<code> is evaluated 648*0Sstevel@tonic-gateat run time, at the moment this subexpression may match. The result 649*0Sstevel@tonic-gateof evaluation is considered as a regular expression and matched as 650*0Sstevel@tonic-gateif it were inserted instead of this construct. 651*0Sstevel@tonic-gate 652*0Sstevel@tonic-gateThe C<code> is not interpolated. As before, the rules to determine 653*0Sstevel@tonic-gatewhere the C<code> ends are currently somewhat convoluted. 654*0Sstevel@tonic-gate 655*0Sstevel@tonic-gateThe following pattern matches a parenthesized group: 656*0Sstevel@tonic-gate 657*0Sstevel@tonic-gate $re = qr{ 658*0Sstevel@tonic-gate \( 659*0Sstevel@tonic-gate (?: 660*0Sstevel@tonic-gate (?> [^()]+ ) # Non-parens without backtracking 661*0Sstevel@tonic-gate | 662*0Sstevel@tonic-gate (??{ $re }) # Group with matching parens 663*0Sstevel@tonic-gate )* 664*0Sstevel@tonic-gate \) 665*0Sstevel@tonic-gate }x; 666*0Sstevel@tonic-gate 667*0Sstevel@tonic-gate=item C<< (?>pattern) >> 668*0Sstevel@tonic-gate 669*0Sstevel@tonic-gateB<WARNING>: This extended regular expression feature is considered 670*0Sstevel@tonic-gatehighly experimental, and may be changed or deleted without notice. 671*0Sstevel@tonic-gate 672*0Sstevel@tonic-gateAn "independent" subexpression, one which matches the substring 673*0Sstevel@tonic-gatethat a I<standalone> C<pattern> would match if anchored at the given 674*0Sstevel@tonic-gateposition, and it matches I<nothing other than this substring>. This 675*0Sstevel@tonic-gateconstruct is useful for optimizations of what would otherwise be 676*0Sstevel@tonic-gate"eternal" matches, because it will not backtrack (see L<"Backtracking">). 677*0Sstevel@tonic-gateIt may also be useful in places where the "grab all you can, and do not 678*0Sstevel@tonic-gategive anything back" semantic is desirable. 679*0Sstevel@tonic-gate 680*0Sstevel@tonic-gateFor example: C<< ^(?>a*)ab >> will never match, since C<< (?>a*) >> 681*0Sstevel@tonic-gate(anchored at the beginning of string, as above) will match I<all> 682*0Sstevel@tonic-gatecharacters C<a> at the beginning of string, leaving no C<a> for 683*0Sstevel@tonic-gateC<ab> to match. In contrast, C<a*ab> will match the same as C<a+b>, 684*0Sstevel@tonic-gatesince the match of the subgroup C<a*> is influenced by the following 685*0Sstevel@tonic-gategroup C<ab> (see L<"Backtracking">). In particular, C<a*> inside 686*0Sstevel@tonic-gateC<a*ab> will match fewer characters than a standalone C<a*>, since 687*0Sstevel@tonic-gatethis makes the tail match. 688*0Sstevel@tonic-gate 689*0Sstevel@tonic-gateAn effect similar to C<< (?>pattern) >> may be achieved by writing 690*0Sstevel@tonic-gateC<(?=(pattern))\1>. This matches the same substring as a standalone 691*0Sstevel@tonic-gateC<a+>, and the following C<\1> eats the matched string; it therefore 692*0Sstevel@tonic-gatemakes a zero-length assertion into an analogue of C<< (?>...) >>. 693*0Sstevel@tonic-gate(The difference between these two constructs is that the second one 694*0Sstevel@tonic-gateuses a capturing group, thus shifting ordinals of backreferences 695*0Sstevel@tonic-gatein the rest of a regular expression.) 696*0Sstevel@tonic-gate 697*0Sstevel@tonic-gateConsider this pattern: 698*0Sstevel@tonic-gate 699*0Sstevel@tonic-gate m{ \( 700*0Sstevel@tonic-gate ( 701*0Sstevel@tonic-gate [^()]+ # x+ 702*0Sstevel@tonic-gate | 703*0Sstevel@tonic-gate \( [^()]* \) 704*0Sstevel@tonic-gate )+ 705*0Sstevel@tonic-gate \) 706*0Sstevel@tonic-gate }x 707*0Sstevel@tonic-gate 708*0Sstevel@tonic-gateThat will efficiently match a nonempty group with matching parentheses 709*0Sstevel@tonic-gatetwo levels deep or less. However, if there is no such group, it 710*0Sstevel@tonic-gatewill take virtually forever on a long string. That's because there 711*0Sstevel@tonic-gateare so many different ways to split a long string into several 712*0Sstevel@tonic-gatesubstrings. This is what C<(.+)+> is doing, and C<(.+)+> is similar 713*0Sstevel@tonic-gateto a subpattern of the above pattern. Consider how the pattern 714*0Sstevel@tonic-gateabove detects no-match on C<((()aaaaaaaaaaaaaaaaaa> in several 715*0Sstevel@tonic-gateseconds, but that each extra letter doubles this time. This 716*0Sstevel@tonic-gateexponential performance will make it appear that your program has 717*0Sstevel@tonic-gatehung. However, a tiny change to this pattern 718*0Sstevel@tonic-gate 719*0Sstevel@tonic-gate m{ \( 720*0Sstevel@tonic-gate ( 721*0Sstevel@tonic-gate (?> [^()]+ ) # change x+ above to (?> x+ ) 722*0Sstevel@tonic-gate | 723*0Sstevel@tonic-gate \( [^()]* \) 724*0Sstevel@tonic-gate )+ 725*0Sstevel@tonic-gate \) 726*0Sstevel@tonic-gate }x 727*0Sstevel@tonic-gate 728*0Sstevel@tonic-gatewhich uses C<< (?>...) >> matches exactly when the one above does (verifying 729*0Sstevel@tonic-gatethis yourself would be a productive exercise), but finishes in a fourth 730*0Sstevel@tonic-gatethe time when used on a similar string with 1000000 C<a>s. Be aware, 731*0Sstevel@tonic-gatehowever, that this pattern currently triggers a warning message under 732*0Sstevel@tonic-gatethe C<use warnings> pragma or B<-w> switch saying it 733*0Sstevel@tonic-gateC<"matches null string many times in regex">. 734*0Sstevel@tonic-gate 735*0Sstevel@tonic-gateOn simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable 736*0Sstevel@tonic-gateeffect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>. 737*0Sstevel@tonic-gateThis was only 4 times slower on a string with 1000000 C<a>s. 738*0Sstevel@tonic-gate 739*0Sstevel@tonic-gateThe "grab all you can, and do not give anything back" semantic is desirable 740*0Sstevel@tonic-gatein many situations where on the first sight a simple C<()*> looks like 741*0Sstevel@tonic-gatethe correct solution. Suppose we parse text with comments being delimited 742*0Sstevel@tonic-gateby C<#> followed by some optional (horizontal) whitespace. Contrary to 743*0Sstevel@tonic-gateits appearance, C<#[ \t]*> I<is not> the correct subexpression to match 744*0Sstevel@tonic-gatethe comment delimiter, because it may "give up" some whitespace if 745*0Sstevel@tonic-gatethe remainder of the pattern can be made to match that way. The correct 746*0Sstevel@tonic-gateanswer is either one of these: 747*0Sstevel@tonic-gate 748*0Sstevel@tonic-gate (?>#[ \t]*) 749*0Sstevel@tonic-gate #[ \t]*(?![ \t]) 750*0Sstevel@tonic-gate 751*0Sstevel@tonic-gateFor example, to grab non-empty comments into $1, one should use either 752*0Sstevel@tonic-gateone of these: 753*0Sstevel@tonic-gate 754*0Sstevel@tonic-gate / (?> \# [ \t]* ) ( .+ ) /x; 755*0Sstevel@tonic-gate / \# [ \t]* ( [^ \t] .* ) /x; 756*0Sstevel@tonic-gate 757*0Sstevel@tonic-gateWhich one you pick depends on which of these expressions better reflects 758*0Sstevel@tonic-gatethe above specification of comments. 759*0Sstevel@tonic-gate 760*0Sstevel@tonic-gate=item C<(?(condition)yes-pattern|no-pattern)> 761*0Sstevel@tonic-gate 762*0Sstevel@tonic-gate=item C<(?(condition)yes-pattern)> 763*0Sstevel@tonic-gate 764*0Sstevel@tonic-gateB<WARNING>: This extended regular expression feature is considered 765*0Sstevel@tonic-gatehighly experimental, and may be changed or deleted without notice. 766*0Sstevel@tonic-gate 767*0Sstevel@tonic-gateConditional expression. C<(condition)> should be either an integer in 768*0Sstevel@tonic-gateparentheses (which is valid if the corresponding pair of parentheses 769*0Sstevel@tonic-gatematched), or look-ahead/look-behind/evaluate zero-width assertion. 770*0Sstevel@tonic-gate 771*0Sstevel@tonic-gateFor example: 772*0Sstevel@tonic-gate 773*0Sstevel@tonic-gate m{ ( \( )? 774*0Sstevel@tonic-gate [^()]+ 775*0Sstevel@tonic-gate (?(1) \) ) 776*0Sstevel@tonic-gate }x 777*0Sstevel@tonic-gate 778*0Sstevel@tonic-gatematches a chunk of non-parentheses, possibly included in parentheses 779*0Sstevel@tonic-gatethemselves. 780*0Sstevel@tonic-gate 781*0Sstevel@tonic-gate=back 782*0Sstevel@tonic-gate 783*0Sstevel@tonic-gate=head2 Backtracking 784*0Sstevel@tonic-gate 785*0Sstevel@tonic-gateNOTE: This section presents an abstract approximation of regular 786*0Sstevel@tonic-gateexpression behavior. For a more rigorous (and complicated) view of 787*0Sstevel@tonic-gatethe rules involved in selecting a match among possible alternatives, 788*0Sstevel@tonic-gatesee L<Combining pieces together>. 789*0Sstevel@tonic-gate 790*0Sstevel@tonic-gateA fundamental feature of regular expression matching involves the 791*0Sstevel@tonic-gatenotion called I<backtracking>, which is currently used (when needed) 792*0Sstevel@tonic-gateby all regular expression quantifiers, namely C<*>, C<*?>, C<+>, 793*0Sstevel@tonic-gateC<+?>, C<{n,m}>, and C<{n,m}?>. Backtracking is often optimized 794*0Sstevel@tonic-gateinternally, but the general principle outlined here is valid. 795*0Sstevel@tonic-gate 796*0Sstevel@tonic-gateFor a regular expression to match, the I<entire> regular expression must 797*0Sstevel@tonic-gatematch, not just part of it. So if the beginning of a pattern containing a 798*0Sstevel@tonic-gatequantifier succeeds in a way that causes later parts in the pattern to 799*0Sstevel@tonic-gatefail, the matching engine backs up and recalculates the beginning 800*0Sstevel@tonic-gatepart--that's why it's called backtracking. 801*0Sstevel@tonic-gate 802*0Sstevel@tonic-gateHere is an example of backtracking: Let's say you want to find the 803*0Sstevel@tonic-gateword following "foo" in the string "Food is on the foo table.": 804*0Sstevel@tonic-gate 805*0Sstevel@tonic-gate $_ = "Food is on the foo table."; 806*0Sstevel@tonic-gate if ( /\b(foo)\s+(\w+)/i ) { 807*0Sstevel@tonic-gate print "$2 follows $1.\n"; 808*0Sstevel@tonic-gate } 809*0Sstevel@tonic-gate 810*0Sstevel@tonic-gateWhen the match runs, the first part of the regular expression (C<\b(foo)>) 811*0Sstevel@tonic-gatefinds a possible match right at the beginning of the string, and loads up 812*0Sstevel@tonic-gate$1 with "Foo". However, as soon as the matching engine sees that there's 813*0Sstevel@tonic-gateno whitespace following the "Foo" that it had saved in $1, it realizes its 814*0Sstevel@tonic-gatemistake and starts over again one character after where it had the 815*0Sstevel@tonic-gatetentative match. This time it goes all the way until the next occurrence 816*0Sstevel@tonic-gateof "foo". The complete regular expression matches this time, and you get 817*0Sstevel@tonic-gatethe expected output of "table follows foo." 818*0Sstevel@tonic-gate 819*0Sstevel@tonic-gateSometimes minimal matching can help a lot. Imagine you'd like to match 820*0Sstevel@tonic-gateeverything between "foo" and "bar". Initially, you write something 821*0Sstevel@tonic-gatelike this: 822*0Sstevel@tonic-gate 823*0Sstevel@tonic-gate $_ = "The food is under the bar in the barn."; 824*0Sstevel@tonic-gate if ( /foo(.*)bar/ ) { 825*0Sstevel@tonic-gate print "got <$1>\n"; 826*0Sstevel@tonic-gate } 827*0Sstevel@tonic-gate 828*0Sstevel@tonic-gateWhich perhaps unexpectedly yields: 829*0Sstevel@tonic-gate 830*0Sstevel@tonic-gate got <d is under the bar in the > 831*0Sstevel@tonic-gate 832*0Sstevel@tonic-gateThat's because C<.*> was greedy, so you get everything between the 833*0Sstevel@tonic-gateI<first> "foo" and the I<last> "bar". Here it's more effective 834*0Sstevel@tonic-gateto use minimal matching to make sure you get the text between a "foo" 835*0Sstevel@tonic-gateand the first "bar" thereafter. 836*0Sstevel@tonic-gate 837*0Sstevel@tonic-gate if ( /foo(.*?)bar/ ) { print "got <$1>\n" } 838*0Sstevel@tonic-gate got <d is under the > 839*0Sstevel@tonic-gate 840*0Sstevel@tonic-gateHere's another example: let's say you'd like to match a number at the end 841*0Sstevel@tonic-gateof a string, and you also want to keep the preceding part of the match. 842*0Sstevel@tonic-gateSo you write this: 843*0Sstevel@tonic-gate 844*0Sstevel@tonic-gate $_ = "I have 2 numbers: 53147"; 845*0Sstevel@tonic-gate if ( /(.*)(\d*)/ ) { # Wrong! 846*0Sstevel@tonic-gate print "Beginning is <$1>, number is <$2>.\n"; 847*0Sstevel@tonic-gate } 848*0Sstevel@tonic-gate 849*0Sstevel@tonic-gateThat won't work at all, because C<.*> was greedy and gobbled up the 850*0Sstevel@tonic-gatewhole string. As C<\d*> can match on an empty string the complete 851*0Sstevel@tonic-gateregular expression matched successfully. 852*0Sstevel@tonic-gate 853*0Sstevel@tonic-gate Beginning is <I have 2 numbers: 53147>, number is <>. 854*0Sstevel@tonic-gate 855*0Sstevel@tonic-gateHere are some variants, most of which don't work: 856*0Sstevel@tonic-gate 857*0Sstevel@tonic-gate $_ = "I have 2 numbers: 53147"; 858*0Sstevel@tonic-gate @pats = qw{ 859*0Sstevel@tonic-gate (.*)(\d*) 860*0Sstevel@tonic-gate (.*)(\d+) 861*0Sstevel@tonic-gate (.*?)(\d*) 862*0Sstevel@tonic-gate (.*?)(\d+) 863*0Sstevel@tonic-gate (.*)(\d+)$ 864*0Sstevel@tonic-gate (.*?)(\d+)$ 865*0Sstevel@tonic-gate (.*)\b(\d+)$ 866*0Sstevel@tonic-gate (.*\D)(\d+)$ 867*0Sstevel@tonic-gate }; 868*0Sstevel@tonic-gate 869*0Sstevel@tonic-gate for $pat (@pats) { 870*0Sstevel@tonic-gate printf "%-12s ", $pat; 871*0Sstevel@tonic-gate if ( /$pat/ ) { 872*0Sstevel@tonic-gate print "<$1> <$2>\n"; 873*0Sstevel@tonic-gate } else { 874*0Sstevel@tonic-gate print "FAIL\n"; 875*0Sstevel@tonic-gate } 876*0Sstevel@tonic-gate } 877*0Sstevel@tonic-gate 878*0Sstevel@tonic-gateThat will print out: 879*0Sstevel@tonic-gate 880*0Sstevel@tonic-gate (.*)(\d*) <I have 2 numbers: 53147> <> 881*0Sstevel@tonic-gate (.*)(\d+) <I have 2 numbers: 5314> <7> 882*0Sstevel@tonic-gate (.*?)(\d*) <> <> 883*0Sstevel@tonic-gate (.*?)(\d+) <I have > <2> 884*0Sstevel@tonic-gate (.*)(\d+)$ <I have 2 numbers: 5314> <7> 885*0Sstevel@tonic-gate (.*?)(\d+)$ <I have 2 numbers: > <53147> 886*0Sstevel@tonic-gate (.*)\b(\d+)$ <I have 2 numbers: > <53147> 887*0Sstevel@tonic-gate (.*\D)(\d+)$ <I have 2 numbers: > <53147> 888*0Sstevel@tonic-gate 889*0Sstevel@tonic-gateAs you see, this can be a bit tricky. It's important to realize that a 890*0Sstevel@tonic-gateregular expression is merely a set of assertions that gives a definition 891*0Sstevel@tonic-gateof success. There may be 0, 1, or several different ways that the 892*0Sstevel@tonic-gatedefinition might succeed against a particular string. And if there are 893*0Sstevel@tonic-gatemultiple ways it might succeed, you need to understand backtracking to 894*0Sstevel@tonic-gateknow which variety of success you will achieve. 895*0Sstevel@tonic-gate 896*0Sstevel@tonic-gateWhen using look-ahead assertions and negations, this can all get even 897*0Sstevel@tonic-gatetrickier. Imagine you'd like to find a sequence of non-digits not 898*0Sstevel@tonic-gatefollowed by "123". You might try to write that as 899*0Sstevel@tonic-gate 900*0Sstevel@tonic-gate $_ = "ABC123"; 901*0Sstevel@tonic-gate if ( /^\D*(?!123)/ ) { # Wrong! 902*0Sstevel@tonic-gate print "Yup, no 123 in $_\n"; 903*0Sstevel@tonic-gate } 904*0Sstevel@tonic-gate 905*0Sstevel@tonic-gateBut that isn't going to match; at least, not the way you're hoping. It 906*0Sstevel@tonic-gateclaims that there is no 123 in the string. Here's a clearer picture of 907*0Sstevel@tonic-gatewhy that pattern matches, contrary to popular expectations: 908*0Sstevel@tonic-gate 909*0Sstevel@tonic-gate $x = 'ABC123' ; 910*0Sstevel@tonic-gate $y = 'ABC445' ; 911*0Sstevel@tonic-gate 912*0Sstevel@tonic-gate print "1: got $1\n" if $x =~ /^(ABC)(?!123)/ ; 913*0Sstevel@tonic-gate print "2: got $1\n" if $y =~ /^(ABC)(?!123)/ ; 914*0Sstevel@tonic-gate 915*0Sstevel@tonic-gate print "3: got $1\n" if $x =~ /^(\D*)(?!123)/ ; 916*0Sstevel@tonic-gate print "4: got $1\n" if $y =~ /^(\D*)(?!123)/ ; 917*0Sstevel@tonic-gate 918*0Sstevel@tonic-gateThis prints 919*0Sstevel@tonic-gate 920*0Sstevel@tonic-gate 2: got ABC 921*0Sstevel@tonic-gate 3: got AB 922*0Sstevel@tonic-gate 4: got ABC 923*0Sstevel@tonic-gate 924*0Sstevel@tonic-gateYou might have expected test 3 to fail because it seems to a more 925*0Sstevel@tonic-gategeneral purpose version of test 1. The important difference between 926*0Sstevel@tonic-gatethem is that test 3 contains a quantifier (C<\D*>) and so can use 927*0Sstevel@tonic-gatebacktracking, whereas test 1 will not. What's happening is 928*0Sstevel@tonic-gatethat you've asked "Is it true that at the start of $x, following 0 or more 929*0Sstevel@tonic-gatenon-digits, you have something that's not 123?" If the pattern matcher had 930*0Sstevel@tonic-gatelet C<\D*> expand to "ABC", this would have caused the whole pattern to 931*0Sstevel@tonic-gatefail. 932*0Sstevel@tonic-gate 933*0Sstevel@tonic-gateThe search engine will initially match C<\D*> with "ABC". Then it will 934*0Sstevel@tonic-gatetry to match C<(?!123> with "123", which fails. But because 935*0Sstevel@tonic-gatea quantifier (C<\D*>) has been used in the regular expression, the 936*0Sstevel@tonic-gatesearch engine can backtrack and retry the match differently 937*0Sstevel@tonic-gatein the hope of matching the complete regular expression. 938*0Sstevel@tonic-gate 939*0Sstevel@tonic-gateThe pattern really, I<really> wants to succeed, so it uses the 940*0Sstevel@tonic-gatestandard pattern back-off-and-retry and lets C<\D*> expand to just "AB" this 941*0Sstevel@tonic-gatetime. Now there's indeed something following "AB" that is not 942*0Sstevel@tonic-gate"123". It's "C123", which suffices. 943*0Sstevel@tonic-gate 944*0Sstevel@tonic-gateWe can deal with this by using both an assertion and a negation. 945*0Sstevel@tonic-gateWe'll say that the first part in $1 must be followed both by a digit 946*0Sstevel@tonic-gateand by something that's not "123". Remember that the look-aheads 947*0Sstevel@tonic-gateare zero-width expressions--they only look, but don't consume any 948*0Sstevel@tonic-gateof the string in their match. So rewriting this way produces what 949*0Sstevel@tonic-gateyou'd expect; that is, case 5 will fail, but case 6 succeeds: 950*0Sstevel@tonic-gate 951*0Sstevel@tonic-gate print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/ ; 952*0Sstevel@tonic-gate print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/ ; 953*0Sstevel@tonic-gate 954*0Sstevel@tonic-gate 6: got ABC 955*0Sstevel@tonic-gate 956*0Sstevel@tonic-gateIn other words, the two zero-width assertions next to each other work as though 957*0Sstevel@tonic-gatethey're ANDed together, just as you'd use any built-in assertions: C</^$/> 958*0Sstevel@tonic-gatematches only if you're at the beginning of the line AND the end of the 959*0Sstevel@tonic-gateline simultaneously. The deeper underlying truth is that juxtaposition in 960*0Sstevel@tonic-gateregular expressions always means AND, except when you write an explicit OR 961*0Sstevel@tonic-gateusing the vertical bar. C</ab/> means match "a" AND (then) match "b", 962*0Sstevel@tonic-gatealthough the attempted matches are made at different positions because "a" 963*0Sstevel@tonic-gateis not a zero-width assertion, but a one-width assertion. 964*0Sstevel@tonic-gate 965*0Sstevel@tonic-gateB<WARNING>: particularly complicated regular expressions can take 966*0Sstevel@tonic-gateexponential time to solve because of the immense number of possible 967*0Sstevel@tonic-gateways they can use backtracking to try match. For example, without 968*0Sstevel@tonic-gateinternal optimizations done by the regular expression engine, this will 969*0Sstevel@tonic-gatetake a painfully long time to run: 970*0Sstevel@tonic-gate 971*0Sstevel@tonic-gate 'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/ 972*0Sstevel@tonic-gate 973*0Sstevel@tonic-gateAnd if you used C<*>'s in the internal groups instead of limiting them 974*0Sstevel@tonic-gateto 0 through 5 matches, then it would take forever--or until you ran 975*0Sstevel@tonic-gateout of stack space. Moreover, these internal optimizations are not 976*0Sstevel@tonic-gatealways applicable. For example, if you put C<{0,5}> instead of C<*> 977*0Sstevel@tonic-gateon the external group, no current optimization is applicable, and the 978*0Sstevel@tonic-gatematch takes a long time to finish. 979*0Sstevel@tonic-gate 980*0Sstevel@tonic-gateA powerful tool for optimizing such beasts is what is known as an 981*0Sstevel@tonic-gate"independent group", 982*0Sstevel@tonic-gatewhich does not backtrack (see L<C<< (?>pattern) >>>). Note also that 983*0Sstevel@tonic-gatezero-length look-ahead/look-behind assertions will not backtrack to make 984*0Sstevel@tonic-gatethe tail match, since they are in "logical" context: only 985*0Sstevel@tonic-gatewhether they match is considered relevant. For an example 986*0Sstevel@tonic-gatewhere side-effects of look-ahead I<might> have influenced the 987*0Sstevel@tonic-gatefollowing match, see L<C<< (?>pattern) >>>. 988*0Sstevel@tonic-gate 989*0Sstevel@tonic-gate=head2 Version 8 Regular Expressions 990*0Sstevel@tonic-gate 991*0Sstevel@tonic-gateIn case you're not familiar with the "regular" Version 8 regex 992*0Sstevel@tonic-gateroutines, here are the pattern-matching rules not described above. 993*0Sstevel@tonic-gate 994*0Sstevel@tonic-gateAny single character matches itself, unless it is a I<metacharacter> 995*0Sstevel@tonic-gatewith a special meaning described here or above. You can cause 996*0Sstevel@tonic-gatecharacters that normally function as metacharacters to be interpreted 997*0Sstevel@tonic-gateliterally by prefixing them with a "\" (e.g., "\." matches a ".", not any 998*0Sstevel@tonic-gatecharacter; "\\" matches a "\"). A series of characters matches that 999*0Sstevel@tonic-gateseries of characters in the target string, so the pattern C<blurfl> 1000*0Sstevel@tonic-gatewould match "blurfl" in the target string. 1001*0Sstevel@tonic-gate 1002*0Sstevel@tonic-gateYou can specify a character class, by enclosing a list of characters 1003*0Sstevel@tonic-gatein C<[]>, which will match any one character from the list. If the 1004*0Sstevel@tonic-gatefirst character after the "[" is "^", the class matches any character not 1005*0Sstevel@tonic-gatein the list. Within a list, the "-" character specifies a 1006*0Sstevel@tonic-gaterange, so that C<a-z> represents all characters between "a" and "z", 1007*0Sstevel@tonic-gateinclusive. If you want either "-" or "]" itself to be a member of a 1008*0Sstevel@tonic-gateclass, put it at the start of the list (possibly after a "^"), or 1009*0Sstevel@tonic-gateescape it with a backslash. "-" is also taken literally when it is 1010*0Sstevel@tonic-gateat the end of the list, just before the closing "]". (The 1011*0Sstevel@tonic-gatefollowing all specify the same class of three characters: C<[-az]>, 1012*0Sstevel@tonic-gateC<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which 1013*0Sstevel@tonic-gatespecifies a class containing twenty-six characters, even on EBCDIC 1014*0Sstevel@tonic-gatebased coded character sets.) Also, if you try to use the character 1015*0Sstevel@tonic-gateclasses C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, or C<\D> as endpoints of 1016*0Sstevel@tonic-gatea range, that's not a range, the "-" is understood literally. 1017*0Sstevel@tonic-gate 1018*0Sstevel@tonic-gateNote also that the whole range idea is rather unportable between 1019*0Sstevel@tonic-gatecharacter sets--and even within character sets they may cause results 1020*0Sstevel@tonic-gateyou probably didn't expect. A sound principle is to use only ranges 1021*0Sstevel@tonic-gatethat begin from and end at either alphabets of equal case ([a-e], 1022*0Sstevel@tonic-gate[A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt, 1023*0Sstevel@tonic-gatespell out the character sets in full. 1024*0Sstevel@tonic-gate 1025*0Sstevel@tonic-gateCharacters may be specified using a metacharacter syntax much like that 1026*0Sstevel@tonic-gateused in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return, 1027*0Sstevel@tonic-gate"\f" a form feed, etc. More generally, \I<nnn>, where I<nnn> is a string 1028*0Sstevel@tonic-gateof octal digits, matches the character whose coded character set value 1029*0Sstevel@tonic-gateis I<nnn>. Similarly, \xI<nn>, where I<nn> are hexadecimal digits, 1030*0Sstevel@tonic-gatematches the character whose numeric value is I<nn>. The expression \cI<x> 1031*0Sstevel@tonic-gatematches the character control-I<x>. Finally, the "." metacharacter 1032*0Sstevel@tonic-gatematches any character except "\n" (unless you use C</s>). 1033*0Sstevel@tonic-gate 1034*0Sstevel@tonic-gateYou can specify a series of alternatives for a pattern using "|" to 1035*0Sstevel@tonic-gateseparate them, so that C<fee|fie|foe> will match any of "fee", "fie", 1036*0Sstevel@tonic-gateor "foe" in the target string (as would C<f(e|i|o)e>). The 1037*0Sstevel@tonic-gatefirst alternative includes everything from the last pattern delimiter 1038*0Sstevel@tonic-gate("(", "[", or the beginning of the pattern) up to the first "|", and 1039*0Sstevel@tonic-gatethe last alternative contains everything from the last "|" to the next 1040*0Sstevel@tonic-gatepattern delimiter. That's why it's common practice to include 1041*0Sstevel@tonic-gatealternatives in parentheses: to minimize confusion about where they 1042*0Sstevel@tonic-gatestart and end. 1043*0Sstevel@tonic-gate 1044*0Sstevel@tonic-gateAlternatives are tried from left to right, so the first 1045*0Sstevel@tonic-gatealternative found for which the entire expression matches, is the one that 1046*0Sstevel@tonic-gateis chosen. This means that alternatives are not necessarily greedy. For 1047*0Sstevel@tonic-gateexample: when matching C<foo|foot> against "barefoot", only the "foo" 1048*0Sstevel@tonic-gatepart will match, as that is the first alternative tried, and it successfully 1049*0Sstevel@tonic-gatematches the target string. (This might not seem important, but it is 1050*0Sstevel@tonic-gateimportant when you are capturing matched text using parentheses.) 1051*0Sstevel@tonic-gate 1052*0Sstevel@tonic-gateAlso remember that "|" is interpreted as a literal within square brackets, 1053*0Sstevel@tonic-gateso if you write C<[fee|fie|foe]> you're really only matching C<[feio|]>. 1054*0Sstevel@tonic-gate 1055*0Sstevel@tonic-gateWithin a pattern, you may designate subpatterns for later reference 1056*0Sstevel@tonic-gateby enclosing them in parentheses, and you may refer back to the 1057*0Sstevel@tonic-gateI<n>th subpattern later in the pattern using the metacharacter 1058*0Sstevel@tonic-gate\I<n>. Subpatterns are numbered based on the left to right order 1059*0Sstevel@tonic-gateof their opening parenthesis. A backreference matches whatever 1060*0Sstevel@tonic-gateactually matched the subpattern in the string being examined, not 1061*0Sstevel@tonic-gatethe rules for that subpattern. Therefore, C<(0|0x)\d*\s\1\d*> will 1062*0Sstevel@tonic-gatematch "0x1234 0x4321", but not "0x1234 01234", because subpattern 1063*0Sstevel@tonic-gate1 matched "0x", even though the rule C<0|0x> could potentially match 1064*0Sstevel@tonic-gatethe leading 0 in the second number. 1065*0Sstevel@tonic-gate 1066*0Sstevel@tonic-gate=head2 Warning on \1 vs $1 1067*0Sstevel@tonic-gate 1068*0Sstevel@tonic-gateSome people get too used to writing things like: 1069*0Sstevel@tonic-gate 1070*0Sstevel@tonic-gate $pattern =~ s/(\W)/\\\1/g; 1071*0Sstevel@tonic-gate 1072*0Sstevel@tonic-gateThis is grandfathered for the RHS of a substitute to avoid shocking the 1073*0Sstevel@tonic-gateB<sed> addicts, but it's a dirty habit to get into. That's because in 1074*0Sstevel@tonic-gatePerlThink, the righthand side of an C<s///> is a double-quoted string. C<\1> in 1075*0Sstevel@tonic-gatethe usual double-quoted string means a control-A. The customary Unix 1076*0Sstevel@tonic-gatemeaning of C<\1> is kludged in for C<s///>. However, if you get into the habit 1077*0Sstevel@tonic-gateof doing that, you get yourself into trouble if you then add an C</e> 1078*0Sstevel@tonic-gatemodifier. 1079*0Sstevel@tonic-gate 1080*0Sstevel@tonic-gate s/(\d+)/ \1 + 1 /eg; # causes warning under -w 1081*0Sstevel@tonic-gate 1082*0Sstevel@tonic-gateOr if you try to do 1083*0Sstevel@tonic-gate 1084*0Sstevel@tonic-gate s/(\d+)/\1000/; 1085*0Sstevel@tonic-gate 1086*0Sstevel@tonic-gateYou can't disambiguate that by saying C<\{1}000>, whereas you can fix it with 1087*0Sstevel@tonic-gateC<${1}000>. The operation of interpolation should not be confused 1088*0Sstevel@tonic-gatewith the operation of matching a backreference. Certainly they mean two 1089*0Sstevel@tonic-gatedifferent things on the I<left> side of the C<s///>. 1090*0Sstevel@tonic-gate 1091*0Sstevel@tonic-gate=head2 Repeated patterns matching zero-length substring 1092*0Sstevel@tonic-gate 1093*0Sstevel@tonic-gateB<WARNING>: Difficult material (and prose) ahead. This section needs a rewrite. 1094*0Sstevel@tonic-gate 1095*0Sstevel@tonic-gateRegular expressions provide a terse and powerful programming language. As 1096*0Sstevel@tonic-gatewith most other power tools, power comes together with the ability 1097*0Sstevel@tonic-gateto wreak havoc. 1098*0Sstevel@tonic-gate 1099*0Sstevel@tonic-gateA common abuse of this power stems from the ability to make infinite 1100*0Sstevel@tonic-gateloops using regular expressions, with something as innocuous as: 1101*0Sstevel@tonic-gate 1102*0Sstevel@tonic-gate 'foo' =~ m{ ( o? )* }x; 1103*0Sstevel@tonic-gate 1104*0Sstevel@tonic-gateThe C<o?> can match at the beginning of C<'foo'>, and since the position 1105*0Sstevel@tonic-gatein the string is not moved by the match, C<o?> would match again and again 1106*0Sstevel@tonic-gatebecause of the C<*> modifier. Another common way to create a similar cycle 1107*0Sstevel@tonic-gateis with the looping modifier C<//g>: 1108*0Sstevel@tonic-gate 1109*0Sstevel@tonic-gate @matches = ( 'foo' =~ m{ o? }xg ); 1110*0Sstevel@tonic-gate 1111*0Sstevel@tonic-gateor 1112*0Sstevel@tonic-gate 1113*0Sstevel@tonic-gate print "match: <$&>\n" while 'foo' =~ m{ o? }xg; 1114*0Sstevel@tonic-gate 1115*0Sstevel@tonic-gateor the loop implied by split(). 1116*0Sstevel@tonic-gate 1117*0Sstevel@tonic-gateHowever, long experience has shown that many programming tasks may 1118*0Sstevel@tonic-gatebe significantly simplified by using repeated subexpressions that 1119*0Sstevel@tonic-gatemay match zero-length substrings. Here's a simple example being: 1120*0Sstevel@tonic-gate 1121*0Sstevel@tonic-gate @chars = split //, $string; # // is not magic in split 1122*0Sstevel@tonic-gate ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// / 1123*0Sstevel@tonic-gate 1124*0Sstevel@tonic-gateThus Perl allows such constructs, by I<forcefully breaking 1125*0Sstevel@tonic-gatethe infinite loop>. The rules for this are different for lower-level 1126*0Sstevel@tonic-gateloops given by the greedy modifiers C<*+{}>, and for higher-level 1127*0Sstevel@tonic-gateones like the C</g> modifier or split() operator. 1128*0Sstevel@tonic-gate 1129*0Sstevel@tonic-gateThe lower-level loops are I<interrupted> (that is, the loop is 1130*0Sstevel@tonic-gatebroken) when Perl detects that a repeated expression matched a 1131*0Sstevel@tonic-gatezero-length substring. Thus 1132*0Sstevel@tonic-gate 1133*0Sstevel@tonic-gate m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x; 1134*0Sstevel@tonic-gate 1135*0Sstevel@tonic-gateis made equivalent to 1136*0Sstevel@tonic-gate 1137*0Sstevel@tonic-gate m{ (?: NON_ZERO_LENGTH )* 1138*0Sstevel@tonic-gate | 1139*0Sstevel@tonic-gate (?: ZERO_LENGTH )? 1140*0Sstevel@tonic-gate }x; 1141*0Sstevel@tonic-gate 1142*0Sstevel@tonic-gateThe higher level-loops preserve an additional state between iterations: 1143*0Sstevel@tonic-gatewhether the last match was zero-length. To break the loop, the following 1144*0Sstevel@tonic-gatematch after a zero-length match is prohibited to have a length of zero. 1145*0Sstevel@tonic-gateThis prohibition interacts with backtracking (see L<"Backtracking">), 1146*0Sstevel@tonic-gateand so the I<second best> match is chosen if the I<best> match is of 1147*0Sstevel@tonic-gatezero length. 1148*0Sstevel@tonic-gate 1149*0Sstevel@tonic-gateFor example: 1150*0Sstevel@tonic-gate 1151*0Sstevel@tonic-gate $_ = 'bar'; 1152*0Sstevel@tonic-gate s/\w??/<$&>/g; 1153*0Sstevel@tonic-gate 1154*0Sstevel@tonic-gateresults in C<< <><b><><a><><r><> >>. At each position of the string the best 1155*0Sstevel@tonic-gatematch given by non-greedy C<??> is the zero-length match, and the I<second 1156*0Sstevel@tonic-gatebest> match is what is matched by C<\w>. Thus zero-length matches 1157*0Sstevel@tonic-gatealternate with one-character-long matches. 1158*0Sstevel@tonic-gate 1159*0Sstevel@tonic-gateSimilarly, for repeated C<m/()/g> the second-best match is the match at the 1160*0Sstevel@tonic-gateposition one notch further in the string. 1161*0Sstevel@tonic-gate 1162*0Sstevel@tonic-gateThe additional state of being I<matched with zero-length> is associated with 1163*0Sstevel@tonic-gatethe matched string, and is reset by each assignment to pos(). 1164*0Sstevel@tonic-gateZero-length matches at the end of the previous match are ignored 1165*0Sstevel@tonic-gateduring C<split>. 1166*0Sstevel@tonic-gate 1167*0Sstevel@tonic-gate=head2 Combining pieces together 1168*0Sstevel@tonic-gate 1169*0Sstevel@tonic-gateEach of the elementary pieces of regular expressions which were described 1170*0Sstevel@tonic-gatebefore (such as C<ab> or C<\Z>) could match at most one substring 1171*0Sstevel@tonic-gateat the given position of the input string. However, in a typical regular 1172*0Sstevel@tonic-gateexpression these elementary pieces are combined into more complicated 1173*0Sstevel@tonic-gatepatterns using combining operators C<ST>, C<S|T>, C<S*> etc 1174*0Sstevel@tonic-gate(in these examples C<S> and C<T> are regular subexpressions). 1175*0Sstevel@tonic-gate 1176*0Sstevel@tonic-gateSuch combinations can include alternatives, leading to a problem of choice: 1177*0Sstevel@tonic-gateif we match a regular expression C<a|ab> against C<"abc">, will it match 1178*0Sstevel@tonic-gatesubstring C<"a"> or C<"ab">? One way to describe which substring is 1179*0Sstevel@tonic-gateactually matched is the concept of backtracking (see L<"Backtracking">). 1180*0Sstevel@tonic-gateHowever, this description is too low-level and makes you think 1181*0Sstevel@tonic-gatein terms of a particular implementation. 1182*0Sstevel@tonic-gate 1183*0Sstevel@tonic-gateAnother description starts with notions of "better"/"worse". All the 1184*0Sstevel@tonic-gatesubstrings which may be matched by the given regular expression can be 1185*0Sstevel@tonic-gatesorted from the "best" match to the "worst" match, and it is the "best" 1186*0Sstevel@tonic-gatematch which is chosen. This substitutes the question of "what is chosen?" 1187*0Sstevel@tonic-gateby the question of "which matches are better, and which are worse?". 1188*0Sstevel@tonic-gate 1189*0Sstevel@tonic-gateAgain, for elementary pieces there is no such question, since at most 1190*0Sstevel@tonic-gateone match at a given position is possible. This section describes the 1191*0Sstevel@tonic-gatenotion of better/worse for combining operators. In the description 1192*0Sstevel@tonic-gatebelow C<S> and C<T> are regular subexpressions. 1193*0Sstevel@tonic-gate 1194*0Sstevel@tonic-gate=over 4 1195*0Sstevel@tonic-gate 1196*0Sstevel@tonic-gate=item C<ST> 1197*0Sstevel@tonic-gate 1198*0Sstevel@tonic-gateConsider two possible matches, C<AB> and C<A'B'>, C<A> and C<A'> are 1199*0Sstevel@tonic-gatesubstrings which can be matched by C<S>, C<B> and C<B'> are substrings 1200*0Sstevel@tonic-gatewhich can be matched by C<T>. 1201*0Sstevel@tonic-gate 1202*0Sstevel@tonic-gateIf C<A> is better match for C<S> than C<A'>, C<AB> is a better 1203*0Sstevel@tonic-gatematch than C<A'B'>. 1204*0Sstevel@tonic-gate 1205*0Sstevel@tonic-gateIf C<A> and C<A'> coincide: C<AB> is a better match than C<AB'> if 1206*0Sstevel@tonic-gateC<B> is better match for C<T> than C<B'>. 1207*0Sstevel@tonic-gate 1208*0Sstevel@tonic-gate=item C<S|T> 1209*0Sstevel@tonic-gate 1210*0Sstevel@tonic-gateWhen C<S> can match, it is a better match than when only C<T> can match. 1211*0Sstevel@tonic-gate 1212*0Sstevel@tonic-gateOrdering of two matches for C<S> is the same as for C<S>. Similar for 1213*0Sstevel@tonic-gatetwo matches for C<T>. 1214*0Sstevel@tonic-gate 1215*0Sstevel@tonic-gate=item C<S{REPEAT_COUNT}> 1216*0Sstevel@tonic-gate 1217*0Sstevel@tonic-gateMatches as C<SSS...S> (repeated as many times as necessary). 1218*0Sstevel@tonic-gate 1219*0Sstevel@tonic-gate=item C<S{min,max}> 1220*0Sstevel@tonic-gate 1221*0Sstevel@tonic-gateMatches as C<S{max}|S{max-1}|...|S{min+1}|S{min}>. 1222*0Sstevel@tonic-gate 1223*0Sstevel@tonic-gate=item C<S{min,max}?> 1224*0Sstevel@tonic-gate 1225*0Sstevel@tonic-gateMatches as C<S{min}|S{min+1}|...|S{max-1}|S{max}>. 1226*0Sstevel@tonic-gate 1227*0Sstevel@tonic-gate=item C<S?>, C<S*>, C<S+> 1228*0Sstevel@tonic-gate 1229*0Sstevel@tonic-gateSame as C<S{0,1}>, C<S{0,BIG_NUMBER}>, C<S{1,BIG_NUMBER}> respectively. 1230*0Sstevel@tonic-gate 1231*0Sstevel@tonic-gate=item C<S??>, C<S*?>, C<S+?> 1232*0Sstevel@tonic-gate 1233*0Sstevel@tonic-gateSame as C<S{0,1}?>, C<S{0,BIG_NUMBER}?>, C<S{1,BIG_NUMBER}?> respectively. 1234*0Sstevel@tonic-gate 1235*0Sstevel@tonic-gate=item C<< (?>S) >> 1236*0Sstevel@tonic-gate 1237*0Sstevel@tonic-gateMatches the best match for C<S> and only that. 1238*0Sstevel@tonic-gate 1239*0Sstevel@tonic-gate=item C<(?=S)>, C<(?<=S)> 1240*0Sstevel@tonic-gate 1241*0Sstevel@tonic-gateOnly the best match for C<S> is considered. (This is important only if 1242*0Sstevel@tonic-gateC<S> has capturing parentheses, and backreferences are used somewhere 1243*0Sstevel@tonic-gateelse in the whole regular expression.) 1244*0Sstevel@tonic-gate 1245*0Sstevel@tonic-gate=item C<(?!S)>, C<(?<!S)> 1246*0Sstevel@tonic-gate 1247*0Sstevel@tonic-gateFor this grouping operator there is no need to describe the ordering, since 1248*0Sstevel@tonic-gateonly whether or not C<S> can match is important. 1249*0Sstevel@tonic-gate 1250*0Sstevel@tonic-gate=item C<(??{ EXPR })> 1251*0Sstevel@tonic-gate 1252*0Sstevel@tonic-gateThe ordering is the same as for the regular expression which is 1253*0Sstevel@tonic-gatethe result of EXPR. 1254*0Sstevel@tonic-gate 1255*0Sstevel@tonic-gate=item C<(?(condition)yes-pattern|no-pattern)> 1256*0Sstevel@tonic-gate 1257*0Sstevel@tonic-gateRecall that which of C<yes-pattern> or C<no-pattern> actually matches is 1258*0Sstevel@tonic-gatealready determined. The ordering of the matches is the same as for the 1259*0Sstevel@tonic-gatechosen subexpression. 1260*0Sstevel@tonic-gate 1261*0Sstevel@tonic-gate=back 1262*0Sstevel@tonic-gate 1263*0Sstevel@tonic-gateThe above recipes describe the ordering of matches I<at a given position>. 1264*0Sstevel@tonic-gateOne more rule is needed to understand how a match is determined for the 1265*0Sstevel@tonic-gatewhole regular expression: a match at an earlier position is always better 1266*0Sstevel@tonic-gatethan a match at a later position. 1267*0Sstevel@tonic-gate 1268*0Sstevel@tonic-gate=head2 Creating custom RE engines 1269*0Sstevel@tonic-gate 1270*0Sstevel@tonic-gateOverloaded constants (see L<overload>) provide a simple way to extend 1271*0Sstevel@tonic-gatethe functionality of the RE engine. 1272*0Sstevel@tonic-gate 1273*0Sstevel@tonic-gateSuppose that we want to enable a new RE escape-sequence C<\Y|> which 1274*0Sstevel@tonic-gatematches at boundary between white-space characters and non-whitespace 1275*0Sstevel@tonic-gatecharacters. Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly 1276*0Sstevel@tonic-gateat these positions, so we want to have each C<\Y|> in the place of the 1277*0Sstevel@tonic-gatemore complicated version. We can create a module C<customre> to do 1278*0Sstevel@tonic-gatethis: 1279*0Sstevel@tonic-gate 1280*0Sstevel@tonic-gate package customre; 1281*0Sstevel@tonic-gate use overload; 1282*0Sstevel@tonic-gate 1283*0Sstevel@tonic-gate sub import { 1284*0Sstevel@tonic-gate shift; 1285*0Sstevel@tonic-gate die "No argument to customre::import allowed" if @_; 1286*0Sstevel@tonic-gate overload::constant 'qr' => \&convert; 1287*0Sstevel@tonic-gate } 1288*0Sstevel@tonic-gate 1289*0Sstevel@tonic-gate sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"} 1290*0Sstevel@tonic-gate 1291*0Sstevel@tonic-gate my %rules = ( '\\' => '\\', 1292*0Sstevel@tonic-gate 'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ ); 1293*0Sstevel@tonic-gate sub convert { 1294*0Sstevel@tonic-gate my $re = shift; 1295*0Sstevel@tonic-gate $re =~ s{ 1296*0Sstevel@tonic-gate \\ ( \\ | Y . ) 1297*0Sstevel@tonic-gate } 1298*0Sstevel@tonic-gate { $rules{$1} or invalid($re,$1) }sgex; 1299*0Sstevel@tonic-gate return $re; 1300*0Sstevel@tonic-gate } 1301*0Sstevel@tonic-gate 1302*0Sstevel@tonic-gateNow C<use customre> enables the new escape in constant regular 1303*0Sstevel@tonic-gateexpressions, i.e., those without any runtime variable interpolations. 1304*0Sstevel@tonic-gateAs documented in L<overload>, this conversion will work only over 1305*0Sstevel@tonic-gateliteral parts of regular expressions. For C<\Y|$re\Y|> the variable 1306*0Sstevel@tonic-gatepart of this regular expression needs to be converted explicitly 1307*0Sstevel@tonic-gate(but only if the special meaning of C<\Y|> should be enabled inside $re): 1308*0Sstevel@tonic-gate 1309*0Sstevel@tonic-gate use customre; 1310*0Sstevel@tonic-gate $re = <>; 1311*0Sstevel@tonic-gate chomp $re; 1312*0Sstevel@tonic-gate $re = customre::convert $re; 1313*0Sstevel@tonic-gate /\Y|$re\Y|/; 1314*0Sstevel@tonic-gate 1315*0Sstevel@tonic-gate=head1 BUGS 1316*0Sstevel@tonic-gate 1317*0Sstevel@tonic-gateThis document varies from difficult to understand to completely 1318*0Sstevel@tonic-gateand utterly opaque. The wandering prose riddled with jargon is 1319*0Sstevel@tonic-gatehard to fathom in several places. 1320*0Sstevel@tonic-gate 1321*0Sstevel@tonic-gateThis document needs a rewrite that separates the tutorial content 1322*0Sstevel@tonic-gatefrom the reference content. 1323*0Sstevel@tonic-gate 1324*0Sstevel@tonic-gate=head1 SEE ALSO 1325*0Sstevel@tonic-gate 1326*0Sstevel@tonic-gateL<perlrequick>. 1327*0Sstevel@tonic-gate 1328*0Sstevel@tonic-gateL<perlretut>. 1329*0Sstevel@tonic-gate 1330*0Sstevel@tonic-gateL<perlop/"Regexp Quote-Like Operators">. 1331*0Sstevel@tonic-gate 1332*0Sstevel@tonic-gateL<perlop/"Gory details of parsing quoted constructs">. 1333*0Sstevel@tonic-gate 1334*0Sstevel@tonic-gateL<perlfaq6>. 1335*0Sstevel@tonic-gate 1336*0Sstevel@tonic-gateL<perlfunc/pos>. 1337*0Sstevel@tonic-gate 1338*0Sstevel@tonic-gateL<perllocale>. 1339*0Sstevel@tonic-gate 1340*0Sstevel@tonic-gateL<perlebcdic>. 1341*0Sstevel@tonic-gate 1342*0Sstevel@tonic-gateI<Mastering Regular Expressions> by Jeffrey Friedl, published 1343*0Sstevel@tonic-gateby O'Reilly and Associates. 1344