1=head1 NAME 2X<regular expression> X<regex> X<regexp> 3 4perlre - Perl regular expressions 5 6=head1 DESCRIPTION 7 8This page describes the syntax of regular expressions in Perl. 9 10If you haven't used regular expressions before, a quick-start 11introduction is available in L<perlrequick>, and a longer tutorial 12introduction is available in L<perlretut>. 13 14For reference on how regular expressions are used in matching 15operations, plus various examples of the same, see discussions of 16C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like 17Operators">. 18 19 20=head2 Modifiers 21 22Matching operations can have various modifiers. Modifiers 23that relate to the interpretation of the regular expression inside 24are listed below. Modifiers that alter the way a regular expression 25is used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and 26L<perlop/"Gory details of parsing quoted constructs">. 27 28=over 4 29 30=item m 31X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline> 32 33Treat string as multiple lines. That is, change "^" and "$" from matching 34the start or end of line only at the left and right ends of the string to 35matching them anywhere within the string. 36 37=item s 38X</s> X<regex, single-line> X<regexp, single-line> 39X<regular expression, single-line> 40 41Treat string as single line. That is, change "." to match any character 42whatsoever, even a newline, which normally it would not match. 43 44Used together, as C</ms>, they let the "." match any character whatsoever, 45while still allowing "^" and "$" to match, respectively, just after 46and just before newlines within the string. 47 48=item i 49X</i> X<regex, case-insensitive> X<regexp, case-insensitive> 50X<regular expression, case-insensitive> 51 52Do case-insensitive pattern matching. 53 54If locale matching rules are in effect, the case map is taken from the 55current 56locale for code points less than 255, and from Unicode rules for larger 57code points. However, matches that would cross the Unicode 58rules/non-Unicode rules boundary (ords 255/256) will not succeed. See 59L<perllocale>. 60 61There are a number of Unicode characters that match multiple characters 62under C</i>. For example, C<LATIN SMALL LIGATURE FI> 63should match the sequence C<fi>. Perl is not 64currently able to do this when the multiple characters are in the pattern and 65are split between groupings, or when one or more are quantified. Thus 66 67 "\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches 68 "\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match! 69 "\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match! 70 71 # The below doesn't match, and it isn't clear what $1 and $2 would 72 # be even if it did!! 73 "\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match! 74 75Perl doesn't match multiple characters in a bracketed 76character class unless the character that maps to them is explicitly 77mentioned, and it doesn't match them at all if the character class is 78inverted, which otherwise could be highly confusing. See 79L<perlrecharclass/Bracketed Character Classes>, and 80L<perlrecharclass/Negation>. 81 82=item x 83X</x> 84 85Extend your pattern's legibility by permitting whitespace and comments. 86Details in L</"/x"> 87 88=item p 89X</p> X<regex, preserve> X<regexp, preserve> 90 91Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and 92${^POSTMATCH} are available for use after matching. 93 94=item g and c 95X</g> X</c> 96 97Global matching, and keep the Current position after failed matching. 98Unlike i, m, s and x, these two flags affect the way the regex is used 99rather than the regex itself. See 100L<perlretut/"Using regular expressions in Perl"> for further explanation 101of the g and c modifiers. 102 103=item a, d, l and u 104X</a> X</d> X</l> X</u> 105 106These modifiers, all new in 5.14, affect which character-set semantics 107(Unicode, etc.) are used, as described below in 108L</Character set modifiers>. 109 110=back 111 112Regular expression modifiers are usually written in documentation 113as e.g., "the C</x> modifier", even though the delimiter 114in question might not really be a slash. The modifiers C</imsxadlup> 115may also be embedded within the regular expression itself using 116the C<(?...)> construct, see L</Extended Patterns> below. 117 118=head3 /x 119 120C</x> tells 121the regular expression parser to ignore most whitespace that is neither 122backslashed nor within a character class. You can use this to break up 123your regular expression into (slightly) more readable parts. The C<#> 124character is also treated as a metacharacter introducing a comment, 125just as in ordinary Perl code. This also means that if you want real 126whitespace or C<#> characters in the pattern (outside a character 127class, where they are unaffected by C</x>), then you'll either have to 128escape them (using backslashes or C<\Q...\E>) or encode them using octal, 129hex, or C<\N{}> escapes. Taken together, these features go a long way towards 130making Perl's regular expressions more readable. Note that you have to 131be careful not to include the pattern delimiter in the comment--perl has 132no way of knowing you did not intend to close the pattern early. See 133the C-comment deletion code in L<perlop>. Also note that anything inside 134a C<\Q...\E> stays unaffected by C</x>. And note that C</x> doesn't affect 135space interpretation within a single multi-character construct. For 136example in C<\x{...}>, regardless of the C</x> modifier, there can be no 137spaces. Same for a L<quantifier|/Quantifiers> such as C<{3}> or 138C<{5,}>. Similarly, C<(?:...)> can't have a space between the C<(>, 139C<?>, and C<:>. Within any delimiters for such a 140construct, allowed spaces are not affected by C</x>, and depend on the 141construct. For example, C<\x{...}> can't have spaces because hexadecimal 142numbers don't have spaces in them. But, Unicode properties can have spaces, so 143in C<\p{...}> there can be spaces that follow the Unicode rules, for which see 144L<perluniprops/Properties accessible through \p{} and \P{}>. 145X</x> 146 147=head3 Character set modifiers 148 149C</d>, C</u>, C</a>, and C</l>, available starting in 5.14, are called 150the character set modifiers; they affect the character set semantics 151used for the regular expression. 152 153The C</d>, C</u>, and C</l> modifiers are not likely to be of much use 154to you, and so you need not worry about them very much. They exist for 155Perl's internal use, so that complex regular expression data structures 156can be automatically serialized and later exactly reconstituted, 157including all their nuances. But, since Perl can't keep a secret, and 158there may be rare instances where they are useful, they are documented 159here. 160 161The C</a> modifier, on the other hand, may be useful. Its purpose is to 162allow code that is to work mostly on ASCII data to not have to concern 163itself with Unicode. 164 165Briefly, C</l> sets the character set to that of whatever B<L>ocale is in 166effect at the time of the execution of the pattern match. 167 168C</u> sets the character set to B<U>nicode. 169 170C</a> also sets the character set to Unicode, BUT adds several 171restrictions for B<A>SCII-safe matching. 172 173C</d> is the old, problematic, pre-5.14 B<D>efault character set 174behavior. Its only use is to force that old behavior. 175 176At any given time, exactly one of these modifiers is in effect. Their 177existence allows Perl to keep the originally compiled behavior of a 178regular expression, regardless of what rules are in effect when it is 179actually executed. And if it is interpolated into a larger regex, the 180original's rules continue to apply to it, and only it. 181 182The C</l> and C</u> modifiers are automatically selected for 183regular expressions compiled within the scope of various pragmas, 184and we recommend that in general, you use those pragmas instead of 185specifying these modifiers explicitly. For one thing, the modifiers 186affect only pattern matching, and do not extend to even any replacement 187done, whereas using the pragmas give consistent results for all 188appropriate operations within their scopes. For example, 189 190 s/foo/\Ubar/il 191 192will match "foo" using the locale's rules for case-insensitive matching, 193but the C</l> does not affect how the C<\U> operates. Most likely you 194want both of them to use locale rules. To do this, instead compile the 195regular expression within the scope of C<use locale>. This both 196implicitly adds the C</l> and applies locale rules to the C<\U>. The 197lesson is to C<use locale> and not C</l> explicitly. 198 199Similarly, it would be better to use C<use feature 'unicode_strings'> 200instead of, 201 202 s/foo/\Lbar/iu 203 204to get Unicode rules, as the C<\L> in the former (but not necessarily 205the latter) would also use Unicode rules. 206 207More detail on each of the modifiers follows. Most likely you don't 208need to know this detail for C</l>, C</u>, and C</d>, and can skip ahead 209to L<E<sol>a|/E<sol>a (and E<sol>aa)>. 210 211=head4 /l 212 213means to use the current locale's rules (see L<perllocale>) when pattern 214matching. For example, C<\w> will match the "word" characters of that 215locale, and C<"/i"> case-insensitive matching will match according to 216the locale's case folding rules. The locale used will be the one in 217effect at the time of execution of the pattern match. This may not be 218the same as the compilation-time locale, and can differ from one match 219to another if there is an intervening call of the 220L<setlocale() function|perllocale/The setlocale function>. 221 222Perl only supports single-byte locales. This means that code points 223above 255 are treated as Unicode no matter what locale is in effect. 224Under Unicode rules, there are a few case-insensitive matches that cross 225the 255/256 boundary. These are disallowed under C</l>. For example, 2260xFF (on ASCII platforms) does not caselessly match the character at 2270x178, C<LATIN CAPITAL LETTER Y WITH DIAERESIS>, because 0xFF may not be 228C<LATIN SMALL LETTER Y WITH DIAERESIS> in the current locale, and Perl 229has no way of knowing if that character even exists in the locale, much 230less what code point it is. 231 232This modifier may be specified to be the default by C<use locale>, but 233see L</Which character set modifier is in effect?>. 234X</l> 235 236=head4 /u 237 238means to use Unicode rules when pattern matching. On ASCII platforms, 239this means that the code points between 128 and 255 take on their 240Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's). 241(Otherwise Perl considers their meanings to be undefined.) Thus, 242under this modifier, the ASCII platform effectively becomes a Unicode 243platform; and hence, for example, C<\w> will match any of the more than 244100_000 word characters in Unicode. 245 246Unlike most locales, which are specific to a language and country pair, 247Unicode classifies all the characters that are letters I<somewhere> in 248the world as 249C<\w>. For example, your locale might not think that C<LATIN SMALL 250LETTER ETH> is a letter (unless you happen to speak Icelandic), but 251Unicode does. Similarly, all the characters that are decimal digits 252somewhere in the world will match C<\d>; this is hundreds, not 10, 253possible matches. And some of those digits look like some of the 10 254ASCII digits, but mean a different number, so a human could easily think 255a number is a different quantity than it really is. For example, 256C<BENGALI DIGIT FOUR> (U+09EA) looks very much like an 257C<ASCII DIGIT EIGHT> (U+0038). And, C<\d+>, may match strings of digits 258that are a mixture from different writing systems, creating a security 259issue. L<Unicode::UCD/num()> can be used to sort 260this out. Or the C</a> modifier can be used to force C<\d> to match 261just the ASCII 0 through 9. 262 263Also, under this modifier, case-insensitive matching works on the full 264set of Unicode 265characters. The C<KELVIN SIGN>, for example matches the letters "k" and 266"K"; and C<LATIN SMALL LIGATURE FF> matches the sequence "ff", which, 267if you're not prepared, might make it look like a hexadecimal constant, 268presenting another potential security issue. See 269L<http://unicode.org/reports/tr36> for a detailed discussion of Unicode 270security issues. 271 272This modifier may be specified to be the default by C<use feature 273'unicode_strings>, C<use locale ':not_characters'>, or 274C<L<use 5.012|perlfunc/use VERSION>> (or higher), 275but see L</Which character set modifier is in effect?>. 276X</u> 277 278=head4 /d 279 280This modifier means to use the "Default" native rules of the platform 281except when there is cause to use Unicode rules instead, as follows: 282 283=over 4 284 285=item 1 286 287the target string is encoded in UTF-8; or 288 289=item 2 290 291the pattern is encoded in UTF-8; or 292 293=item 3 294 295the pattern explicitly mentions a code point that is above 255 (say by 296C<\x{100}>); or 297 298=item 4 299 300the pattern uses a Unicode name (C<\N{...}>); or 301 302=item 5 303 304the pattern uses a Unicode property (C<\p{...}>); or 305 306=item 6 307 308the pattern uses L</C<(?[ ])>> 309 310=back 311 312Another mnemonic for this modifier is "Depends", as the rules actually 313used depend on various things, and as a result you can get unexpected 314results. See L<perlunicode/The "Unicode Bug">. The Unicode Bug has 315become rather infamous, leading to yet another (printable) name for this 316modifier, "Dodgy". 317 318Unless the pattern or string are encoded in UTF-8, only ASCII characters 319can match positively. 320 321Here are some examples of how that works on an ASCII platform: 322 323 $str = "\xDF"; # $str is not in UTF-8 format. 324 $str =~ /^\w/; # No match, as $str isn't in UTF-8 format. 325 $str .= "\x{0e0b}"; # Now $str is in UTF-8 format. 326 $str =~ /^\w/; # Match! $str is now in UTF-8 format. 327 chop $str; 328 $str =~ /^\w/; # Still a match! $str remains in UTF-8 format. 329 330This modifier is automatically selected by default when none of the 331others are, so yet another name for it is "Default". 332 333Because of the unexpected behaviors associated with this modifier, you 334probably should only use it to maintain weird backward compatibilities. 335 336=head4 /a (and /aa) 337 338This modifier stands for ASCII-restrict (or ASCII-safe). This modifier, 339unlike the others, may be doubled-up to increase its effect. 340 341When it appears singly, it causes the sequences C<\d>, C<\s>, C<\w>, and 342the Posix character classes to match only in the ASCII range. They thus 343revert to their pre-5.6, pre-Unicode meanings. Under C</a>, C<\d> 344always means precisely the digits C<"0"> to C<"9">; C<\s> means the five 345characters C<[ \f\n\r\t]>, and starting in Perl v5.18, experimentally, 346the vertical tab; C<\w> means the 63 characters 347C<[A-Za-z0-9_]>; and likewise, all the Posix classes such as 348C<[[:print:]]> match only the appropriate ASCII-range characters. 349 350This modifier is useful for people who only incidentally use Unicode, 351and who do not wish to be burdened with its complexities and security 352concerns. 353 354With C</a>, one can write C<\d> with confidence that it will only match 355ASCII characters, and should the need arise to match beyond ASCII, you 356can instead use C<\p{Digit}> (or C<\p{Word}> for C<\w>). There are 357similar C<\p{...}> constructs that can match beyond ASCII both white 358space (see L<perlrecharclass/Whitespace>), and Posix classes (see 359L<perlrecharclass/POSIX Character Classes>). Thus, this modifier 360doesn't mean you can't use Unicode, it means that to get Unicode 361matching you must explicitly use a construct (C<\p{}>, C<\P{}>) that 362signals Unicode. 363 364As you would expect, this modifier causes, for example, C<\D> to mean 365the same thing as C<[^0-9]>; in fact, all non-ASCII characters match 366C<\D>, C<\S>, and C<\W>. C<\b> still means to match at the boundary 367between C<\w> and C<\W>, using the C</a> definitions of them (similarly 368for C<\B>). 369 370Otherwise, C</a> behaves like the C</u> modifier, in that 371case-insensitive matching uses Unicode semantics; for example, "k" will 372match the Unicode C<\N{KELVIN SIGN}> under C</i> matching, and code 373points in the Latin1 range, above ASCII will have Unicode rules when it 374comes to case-insensitive matching. 375 376To forbid ASCII/non-ASCII matches (like "k" with C<\N{KELVIN SIGN}>), 377specify the "a" twice, for example C</aai> or C</aia>. (The first 378occurrence of "a" restricts the C<\d>, etc., and the second occurrence 379adds the C</i> restrictions.) But, note that code points outside the 380ASCII range will use Unicode rules for C</i> matching, so the modifier 381doesn't really restrict things to just ASCII; it just forbids the 382intermixing of ASCII and non-ASCII. 383 384To summarize, this modifier provides protection for applications that 385don't wish to be exposed to all of Unicode. Specifying it twice 386gives added protection. 387 388This modifier may be specified to be the default by C<use re '/a'> 389or C<use re '/aa'>. If you do so, you may actually have occasion to use 390the C</u> modifier explictly if there are a few regular expressions 391where you do want full Unicode rules (but even here, it's best if 392everything were under feature C<"unicode_strings">, along with the 393C<use re '/aa'>). Also see L</Which character set modifier is in 394effect?>. 395X</a> 396X</aa> 397 398=head4 Which character set modifier is in effect? 399 400Which of these modifiers is in effect at any given point in a regular 401expression depends on a fairly complex set of interactions. These have 402been designed so that in general you don't have to worry about it, but 403this section gives the gory details. As 404explained below in L</Extended Patterns> it is possible to explicitly 405specify modifiers that apply only to portions of a regular expression. 406The innermost always has priority over any outer ones, and one applying 407to the whole expression has priority over any of the default settings that are 408described in the remainder of this section. 409 410The C<L<use re 'E<sol>foo'|re/"'/flags' mode">> pragma can be used to set 411default modifiers (including these) for regular expressions compiled 412within its scope. This pragma has precedence over the other pragmas 413listed below that also change the defaults. 414 415Otherwise, C<L<use locale|perllocale>> sets the default modifier to C</l>; 416and C<L<use feature 'unicode_strings|feature>>, or 417C<L<use 5.012|perlfunc/use VERSION>> (or higher) set the default to 418C</u> when not in the same scope as either C<L<use locale|perllocale>> 419or C<L<use bytes|bytes>>. 420(C<L<use locale ':not_characters'|perllocale/Unicode and UTF-8>> also 421sets the default to C</u>, overriding any plain C<use locale>.) 422Unlike the mechanisms mentioned above, these 423affect operations besides regular expressions pattern matching, and so 424give more consistent results with other operators, including using 425C<\U>, C<\l>, etc. in substitution replacements. 426 427If none of the above apply, for backwards compatibility reasons, the 428C</d> modifier is the one in effect by default. As this can lead to 429unexpected results, it is best to specify which other rule set should be 430used. 431 432=head4 Character set modifier behavior prior to Perl 5.14 433 434Prior to 5.14, there were no explicit modifiers, but C</l> was implied 435for regexes compiled within the scope of C<use locale>, and C</d> was 436implied otherwise. However, interpolating a regex into a larger regex 437would ignore the original compilation in favor of whatever was in effect 438at the time of the second compilation. There were a number of 439inconsistencies (bugs) with the C</d> modifier, where Unicode rules 440would be used when inappropriate, and vice versa. C<\p{}> did not imply 441Unicode rules, and neither did all occurrences of C<\N{}>, until 5.12. 442 443=head2 Regular Expressions 444 445=head3 Metacharacters 446 447The patterns used in Perl pattern matching evolved from those supplied in 448the Version 8 regex routines. (The routines are derived 449(distantly) from Henry Spencer's freely redistributable reimplementation 450of the V8 routines.) See L<Version 8 Regular Expressions> for 451details. 452 453In particular the following metacharacters have their standard I<egrep>-ish 454meanings: 455X<metacharacter> 456X<\> X<^> X<.> X<$> X<|> X<(> X<()> X<[> X<[]> 457 458 459 \ Quote the next metacharacter 460 ^ Match the beginning of the line 461 . Match any character (except newline) 462 $ Match the end of the line (or before newline at the end) 463 | Alternation 464 () Grouping 465 [] Bracketed Character class 466 467By default, the "^" character is guaranteed to match only the 468beginning of the string, the "$" character only the end (or before the 469newline at the end), and Perl does certain optimizations with the 470assumption that the string contains only one line. Embedded newlines 471will not be matched by "^" or "$". You may, however, wish to treat a 472string as a multi-line buffer, such that the "^" will match after any 473newline within the string (except if the newline is the last character in 474the string), and "$" will match before any newline. At the 475cost of a little more overhead, you can do this by using the /m modifier 476on the pattern match operator. (Older programs did this by setting C<$*>, 477but this option was removed in perl 5.10.) 478X<^> X<$> X</m> 479 480To simplify multi-line substitutions, the "." character never matches a 481newline unless you use the C</s> modifier, which in effect tells Perl to pretend 482the string is a single line--even if it isn't. 483X<.> X</s> 484 485=head3 Quantifiers 486 487The following standard quantifiers are recognized: 488X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}> 489 490 * Match 0 or more times 491 + Match 1 or more times 492 ? Match 1 or 0 times 493 {n} Match exactly n times 494 {n,} Match at least n times 495 {n,m} Match at least n but not more than m times 496 497(If a curly bracket occurs in any other context and does not form part of 498a backslashed sequence like C<\x{...}>, it is treated as a regular 499character. In particular, the lower quantifier bound is not optional, 500and a typo in a quantifier silently causes it to be treated as the 501literal characters. For example, 502 503 /o{4,3}/ 504 505looks like a quantifier that matches 0 times, since 4 is greater than 3, 506but it really means to match the sequence of six characters 507S<C<"o { 4 , 3 }">>. It is planned to eventually require literal uses 508of curly brackets to be escaped, say by preceding them with a backslash 509or enclosing them within square brackets, (C<"\{"> or C<"[{]">). This 510change will allow for future syntax extensions (like making the lower 511bound of a quantifier optional), and better error checking. In the 512meantime, you should get in the habit of escaping all instances where 513you mean a literal "{".) 514 515The "*" quantifier is equivalent to C<{0,}>, the "+" 516quantifier to C<{1,}>, and the "?" quantifier to C<{0,1}>. n and m are limited 517to non-negative integral values less than a preset limit defined when perl is built. 518This is usually 32766 on the most common platforms. The actual limit can 519be seen in the error message generated by code such as this: 520 521 $_ **= $_ , / {$_} / for 2 .. 42; 522 523By default, a quantified subpattern is "greedy", that is, it will match as 524many times as possible (given a particular starting location) while still 525allowing the rest of the pattern to match. If you want it to match the 526minimum number of times possible, follow the quantifier with a "?". Note 527that the meanings don't change, just the "greediness": 528X<metacharacter> X<greedy> X<greediness> 529X<?> X<*?> X<+?> X<??> X<{n}?> X<{n,}?> X<{n,m}?> 530 531 *? Match 0 or more times, not greedily 532 +? Match 1 or more times, not greedily 533 ?? Match 0 or 1 time, not greedily 534 {n}? Match exactly n times, not greedily (redundant) 535 {n,}? Match at least n times, not greedily 536 {n,m}? Match at least n but not more than m times, not greedily 537 538By default, when a quantified subpattern does not allow the rest of the 539overall pattern to match, Perl will backtrack. However, this behaviour is 540sometimes undesirable. Thus Perl provides the "possessive" quantifier form 541as well. 542 543 *+ Match 0 or more times and give nothing back 544 ++ Match 1 or more times and give nothing back 545 ?+ Match 0 or 1 time and give nothing back 546 {n}+ Match exactly n times and give nothing back (redundant) 547 {n,}+ Match at least n times and give nothing back 548 {n,m}+ Match at least n but not more than m times and give nothing back 549 550For instance, 551 552 'aaaa' =~ /a++a/ 553 554will never match, as the C<a++> will gobble up all the C<a>'s in the 555string and won't leave any for the remaining part of the pattern. This 556feature can be extremely useful to give perl hints about where it 557shouldn't backtrack. For instance, the typical "match a double-quoted 558string" problem can be most efficiently performed when written as: 559 560 /"(?:[^"\\]++|\\.)*+"/ 561 562as we know that if the final quote does not match, backtracking will not 563help. See the independent subexpression 564L</C<< (?>pattern) >>> for more details; 565possessive quantifiers are just syntactic sugar for that construct. For 566instance the above example could also be written as follows: 567 568 /"(?>(?:(?>[^"\\]+)|\\.)*)"/ 569 570=head3 Escape sequences 571 572Because patterns are processed as double-quoted strings, the following 573also work: 574 575 \t tab (HT, TAB) 576 \n newline (LF, NL) 577 \r return (CR) 578 \f form feed (FF) 579 \a alarm (bell) (BEL) 580 \e escape (think troff) (ESC) 581 \cK control char (example: VT) 582 \x{}, \x00 character whose ordinal is the given hexadecimal number 583 \N{name} named Unicode character or character sequence 584 \N{U+263D} Unicode character (example: FIRST QUARTER MOON) 585 \o{}, \000 character whose ordinal is the given octal number 586 \l lowercase next char (think vi) 587 \u uppercase next char (think vi) 588 \L lowercase till \E (think vi) 589 \U uppercase till \E (think vi) 590 \Q quote (disable) pattern metacharacters till \E 591 \E end either case modification or quoted section, think vi 592 593Details are in L<perlop/Quote and Quote-like Operators>. 594 595=head3 Character Classes and other Special Escapes 596 597In addition, Perl defines the following: 598X<\g> X<\k> X<\K> X<backreference> 599 600 Sequence Note Description 601 [...] [1] Match a character according to the rules of the 602 bracketed character class defined by the "...". 603 Example: [a-z] matches "a" or "b" or "c" ... or "z" 604 [[:...:]] [2] Match a character according to the rules of the POSIX 605 character class "..." within the outer bracketed 606 character class. Example: [[:upper:]] matches any 607 uppercase character. 608 (?[...]) [8] Extended bracketed character class 609 \w [3] Match a "word" character (alphanumeric plus "_", plus 610 other connector punctuation chars plus Unicode 611 marks) 612 \W [3] Match a non-"word" character 613 \s [3] Match a whitespace character 614 \S [3] Match a non-whitespace character 615 \d [3] Match a decimal digit character 616 \D [3] Match a non-digit character 617 \pP [3] Match P, named property. Use \p{Prop} for longer names 618 \PP [3] Match non-P 619 \X [4] Match Unicode "eXtended grapheme cluster" 620 \C Match a single C-language char (octet) even if that is 621 part of a larger UTF-8 character. Thus it breaks up 622 characters into their UTF-8 bytes, so you may end up 623 with malformed pieces of UTF-8. Unsupported in 624 lookbehind. 625 \1 [5] Backreference to a specific capture group or buffer. 626 '1' may actually be any positive integer. 627 \g1 [5] Backreference to a specific or previous group, 628 \g{-1} [5] The number may be negative indicating a relative 629 previous group and may optionally be wrapped in 630 curly brackets for safer parsing. 631 \g{name} [5] Named backreference 632 \k<name> [5] Named backreference 633 \K [6] Keep the stuff left of the \K, don't include it in $& 634 \N [7] Any character but \n. Not affected by /s modifier 635 \v [3] Vertical whitespace 636 \V [3] Not vertical whitespace 637 \h [3] Horizontal whitespace 638 \H [3] Not horizontal whitespace 639 \R [4] Linebreak 640 641=over 4 642 643=item [1] 644 645See L<perlrecharclass/Bracketed Character Classes> for details. 646 647=item [2] 648 649See L<perlrecharclass/POSIX Character Classes> for details. 650 651=item [3] 652 653See L<perlrecharclass/Backslash sequences> for details. 654 655=item [4] 656 657See L<perlrebackslash/Misc> for details. 658 659=item [5] 660 661See L</Capture groups> below for details. 662 663=item [6] 664 665See L</Extended Patterns> below for details. 666 667=item [7] 668 669Note that C<\N> has two meanings. When of the form C<\N{NAME}>, it matches the 670character or character sequence whose name is C<NAME>; and similarly 671when of the form C<\N{U+I<hex>}>, it matches the character whose Unicode 672code point is I<hex>. Otherwise it matches any character but C<\n>. 673 674=item [8] 675 676See L<perlrecharclass/Extended Bracketed Character Classes> for details. 677 678=back 679 680=head3 Assertions 681 682Perl defines the following zero-width assertions: 683X<zero-width assertion> X<assertion> X<regex, zero-width assertion> 684X<regexp, zero-width assertion> 685X<regular expression, zero-width assertion> 686X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G> 687 688 \b Match a word boundary 689 \B Match except at a word boundary 690 \A Match only at beginning of string 691 \Z Match only at end of string, or before newline at the end 692 \z Match only at end of string 693 \G Match only at pos() (e.g. at the end-of-match position 694 of prior m//g) 695 696A word boundary (C<\b>) is a spot between two characters 697that has a C<\w> on one side of it and a C<\W> on the other side 698of it (in either order), counting the imaginary characters off the 699beginning and end of the string as matching a C<\W>. (Within 700character classes C<\b> represents backspace rather than a word 701boundary, just as it normally does in any double-quoted string.) 702The C<\A> and C<\Z> are just like "^" and "$", except that they 703won't match multiple times when the C</m> modifier is used, while 704"^" and "$" will match at every internal line boundary. To match 705the actual end of the string and not ignore an optional trailing 706newline, use C<\z>. 707X<\b> X<\A> X<\Z> X<\z> X</m> 708 709The C<\G> assertion can be used to chain global matches (using 710C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">. 711It is also useful when writing C<lex>-like scanners, when you have 712several patterns that you want to match against consequent substrings 713of your string; see the previous reference. The actual location 714where C<\G> will match can also be influenced by using C<pos()> as 715an lvalue: see L<perlfunc/pos>. Note that the rule for zero-length 716matches (see L</"Repeated Patterns Matching a Zero-length Substring">) 717is modified somewhat, in that contents to the left of C<\G> are 718not counted when determining the length of the match. Thus the following 719will not match forever: 720X<\G> 721 722 my $string = 'ABC'; 723 pos($string) = 1; 724 while ($string =~ /(.\G)/g) { 725 print $1; 726 } 727 728It will print 'A' and then terminate, as it considers the match to 729be zero-width, and thus will not match at the same position twice in a 730row. 731 732It is worth noting that C<\G> improperly used can result in an infinite 733loop. Take care when using patterns that include C<\G> in an alternation. 734 735=head3 Capture groups 736 737The bracketing construct C<( ... )> creates capture groups (also referred to as 738capture buffers). To refer to the current contents of a group later on, within 739the same pattern, use C<\g1> (or C<\g{1}>) for the first, C<\g2> (or C<\g{2}>) 740for the second, and so on. 741This is called a I<backreference>. 742X<regex, capture buffer> X<regexp, capture buffer> 743X<regex, capture group> X<regexp, capture group> 744X<regular expression, capture buffer> X<backreference> 745X<regular expression, capture group> X<backreference> 746X<\g{1}> X<\g{-1}> X<\g{name}> X<relative backreference> X<named backreference> 747X<named capture buffer> X<regular expression, named capture buffer> 748X<named capture group> X<regular expression, named capture group> 749X<%+> X<$+{name}> X<< \k<name> >> 750There is no limit to the number of captured substrings that you may use. 751Groups are numbered with the leftmost open parenthesis being number 1, etc. If 752a group did not match, the associated backreference won't match either. (This 753can happen if the group is optional, or in a different branch of an 754alternation.) 755You can omit the C<"g">, and write C<"\1">, etc, but there are some issues with 756this form, described below. 757 758You can also refer to capture groups relatively, by using a negative number, so 759that C<\g-1> and C<\g{-1}> both refer to the immediately preceding capture 760group, and C<\g-2> and C<\g{-2}> both refer to the group before it. For 761example: 762 763 / 764 (Y) # group 1 765 ( # group 2 766 (X) # group 3 767 \g{-1} # backref to group 3 768 \g{-3} # backref to group 1 769 ) 770 /x 771 772would match the same as C</(Y) ( (X) \g3 \g1 )/x>. This allows you to 773interpolate regexes into larger regexes and not have to worry about the 774capture groups being renumbered. 775 776You can dispense with numbers altogether and create named capture groups. 777The notation is C<(?E<lt>I<name>E<gt>...)> to declare and C<\g{I<name>}> to 778reference. (To be compatible with .Net regular expressions, C<\g{I<name>}> may 779also be written as C<\k{I<name>}>, C<\kE<lt>I<name>E<gt>> or C<\k'I<name>'>.) 780I<name> must not begin with a number, nor contain hyphens. 781When different groups within the same pattern have the same name, any reference 782to that name assumes the leftmost defined group. Named groups count in 783absolute and relative numbering, and so can also be referred to by those 784numbers. 785(It's possible to do things with named capture groups that would otherwise 786require C<(??{})>.) 787 788Capture group contents are dynamically scoped and available to you outside the 789pattern until the end of the enclosing block or until the next successful 790match, whichever comes first. (See L<perlsyn/"Compound Statements">.) 791You can refer to them by absolute number (using C<"$1"> instead of C<"\g1">, 792etc); or by name via the C<%+> hash, using C<"$+{I<name>}">. 793 794Braces are required in referring to named capture groups, but are optional for 795absolute or relative numbered ones. Braces are safer when creating a regex by 796concatenating smaller strings. For example if you have C<qr/$a$b/>, and C<$a> 797contained C<"\g1">, and C<$b> contained C<"37">, you would get C</\g137/> which 798is probably not what you intended. 799 800The C<\g> and C<\k> notations were introduced in Perl 5.10.0. Prior to that 801there were no named nor relative numbered capture groups. Absolute numbered 802groups were referred to using C<\1>, 803C<\2>, etc., and this notation is still 804accepted (and likely always will be). But it leads to some ambiguities if 805there are more than 9 capture groups, as C<\10> could mean either the tenth 806capture group, or the character whose ordinal in octal is 010 (a backspace in 807ASCII). Perl resolves this ambiguity by interpreting C<\10> as a backreference 808only if at least 10 left parentheses have opened before it. Likewise C<\11> is 809a backreference only if at least 11 left parentheses have opened before it. 810And so on. C<\1> through C<\9> are always interpreted as backreferences. 811There are several examples below that illustrate these perils. You can avoid 812the ambiguity by always using C<\g{}> or C<\g> if you mean capturing groups; 813and for octal constants always using C<\o{}>, or for C<\077> and below, using 3 814digits padded with leading zeros, since a leading zero implies an octal 815constant. 816 817The C<\I<digit>> notation also works in certain circumstances outside 818the pattern. See L</Warning on \1 Instead of $1> below for details. 819 820Examples: 821 822 s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words 823 824 /(.)\g1/ # find first doubled char 825 and print "'$1' is the first doubled character\n"; 826 827 /(?<char>.)\k<char>/ # ... a different way 828 and print "'$+{char}' is the first doubled character\n"; 829 830 /(?'char'.)\g1/ # ... mix and match 831 and print "'$1' is the first doubled character\n"; 832 833 if (/Time: (..):(..):(..)/) { # parse out values 834 $hours = $1; 835 $minutes = $2; 836 $seconds = $3; 837 } 838 839 /(.)(.)(.)(.)(.)(.)(.)(.)(.)\g10/ # \g10 is a backreference 840 /(.)(.)(.)(.)(.)(.)(.)(.)(.)\10/ # \10 is octal 841 /((.)(.)(.)(.)(.)(.)(.)(.)(.))\10/ # \10 is a backreference 842 /((.)(.)(.)(.)(.)(.)(.)(.)(.))\010/ # \010 is octal 843 844 $a = '(.)\1'; # Creates problems when concatenated. 845 $b = '(.)\g{1}'; # Avoids the problems. 846 "aa" =~ /${a}/; # True 847 "aa" =~ /${b}/; # True 848 "aa0" =~ /${a}0/; # False! 849 "aa0" =~ /${b}0/; # True 850 "aa\x08" =~ /${a}0/; # True! 851 "aa\x08" =~ /${b}0/; # False 852 853Several special variables also refer back to portions of the previous 854match. C<$+> returns whatever the last bracket match matched. 855C<$&> returns the entire matched string. (At one point C<$0> did 856also, but now it returns the name of the program.) C<$`> returns 857everything before the matched string. C<$'> returns everything 858after the matched string. And C<$^N> contains whatever was matched by 859the most-recently closed group (submatch). C<$^N> can be used in 860extended patterns (see below), for example to assign a submatch to a 861variable. 862X<$+> X<$^N> X<$&> X<$`> X<$'> 863 864These special variables, like the C<%+> hash and the numbered match variables 865(C<$1>, C<$2>, C<$3>, etc.) are dynamically scoped 866until the end of the enclosing block or until the next successful 867match, whichever comes first. (See L<perlsyn/"Compound Statements">.) 868X<$+> X<$^N> X<$&> X<$`> X<$'> 869X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9> 870 871B<NOTE>: Failed matches in Perl do not reset the match variables, 872which makes it easier to write code that tests for a series of more 873specific cases and remembers the best match. 874 875B<WARNING>: Once Perl sees that you need one of C<$&>, C<$`>, or 876C<$'> anywhere in the program, it has to provide them for every 877pattern match. This may substantially slow your program. Perl 878uses the same mechanism to produce C<$1>, C<$2>, etc, so you also pay a 879price for each pattern that contains capturing parentheses. (To 880avoid this cost while retaining the grouping behaviour, use the 881extended regular expression C<(?: ... )> instead.) But if you never 882use C<$&>, C<$`> or C<$'>, then patterns I<without> capturing 883parentheses will not be penalized. So avoid C<$&>, C<$'>, and C<$`> 884if you can, but if you can't (and some algorithms really appreciate 885them), once you've used them once, use them at will, because you've 886already paid the price. As of 5.17.4, the presence of each of the three 887variables in a program is recorded separately, and depending on 888circumstances, perl may be able be more efficient knowing that only C<$&> 889rather than all three have been seen, for example. 890X<$&> X<$`> X<$'> 891 892As a workaround for this problem, Perl 5.10.0 introduces C<${^PREMATCH}>, 893C<${^MATCH}> and C<${^POSTMATCH}>, which are equivalent to C<$`>, C<$&> 894and C<$'>, B<except> that they are only guaranteed to be defined after a 895successful match that was executed with the C</p> (preserve) modifier. 896The use of these variables incurs no global performance penalty, unlike 897their punctuation char equivalents, however at the trade-off that you 898have to tell perl when you want to use them. 899X</p> X<p modifier> 900 901=head2 Quoting metacharacters 902 903Backslashed metacharacters in Perl are alphanumeric, such as C<\b>, 904C<\w>, C<\n>. Unlike some other regular expression languages, there 905are no backslashed symbols that aren't alphanumeric. So anything 906that looks like \\, \(, \), \[, \], \{, or \} is always 907interpreted as a literal character, not a metacharacter. This was 908once used in a common idiom to disable or quote the special meanings 909of regular expression metacharacters in a string that you want to 910use for a pattern. Simply quote all non-"word" characters: 911 912 $pattern =~ s/(\W)/\\$1/g; 913 914(If C<use locale> is set, then this depends on the current locale.) 915Today it is more common to use the quotemeta() function or the C<\Q> 916metaquoting escape sequence to disable all metacharacters' special 917meanings like this: 918 919 /$unquoted\Q$quoted\E$unquoted/ 920 921Beware that if you put literal backslashes (those not inside 922interpolated variables) between C<\Q> and C<\E>, double-quotish 923backslash interpolation may lead to confusing results. If you 924I<need> to use literal backslashes within C<\Q...\E>, 925consult L<perlop/"Gory details of parsing quoted constructs">. 926 927C<quotemeta()> and C<\Q> are fully described in L<perlfunc/quotemeta>. 928 929=head2 Extended Patterns 930 931Perl also defines a consistent extension syntax for features not 932found in standard tools like B<awk> and 933B<lex>. The syntax for most of these is a 934pair of parentheses with a question mark as the first thing within 935the parentheses. The character after the question mark indicates 936the extension. 937 938The stability of these extensions varies widely. Some have been 939part of the core language for many years. Others are experimental 940and may change without warning or be completely removed. Check 941the documentation on an individual feature to verify its current 942status. 943 944A question mark was chosen for this and for the minimal-matching 945construct because 1) question marks are rare in older regular 946expressions, and 2) whenever you see one, you should stop and 947"question" exactly what is going on. That's psychology.... 948 949=over 4 950 951=item C<(?#text)> 952X<(?#)> 953 954A comment. The text is ignored. If the C</x> modifier enables 955whitespace formatting, a simple C<#> will suffice. Note that Perl closes 956the comment as soon as it sees a C<)>, so there is no way to put a literal 957C<)> in the comment. 958 959=item C<(?adlupimsx-imsx)> 960 961=item C<(?^alupimsx)> 962X<(?)> X<(?^)> 963 964One or more embedded pattern-match modifiers, to be turned on (or 965turned off, if preceded by C<->) for the remainder of the pattern or 966the remainder of the enclosing pattern group (if any). 967 968This is particularly useful for dynamic patterns, such as those read in from a 969configuration file, taken from an argument, or specified in a table 970somewhere. Consider the case where some patterns want to be 971case-sensitive and some do not: The case-insensitive ones merely need to 972include C<(?i)> at the front of the pattern. For example: 973 974 $pattern = "foobar"; 975 if ( /$pattern/i ) { } 976 977 # more flexible: 978 979 $pattern = "(?i)foobar"; 980 if ( /$pattern/ ) { } 981 982These modifiers are restored at the end of the enclosing group. For example, 983 984 ( (?i) blah ) \s+ \g1 985 986will match C<blah> in any case, some spaces, and an exact (I<including the case>!) 987repetition of the previous word, assuming the C</x> modifier, and no C</i> 988modifier outside this group. 989 990These modifiers do not carry over into named subpatterns called in the 991enclosing group. In other words, a pattern such as C<((?i)(?&NAME))> does not 992change the case-sensitivity of the "NAME" pattern. 993 994Any of these modifiers can be set to apply globally to all regular 995expressions compiled within the scope of a C<use re>. See 996L<re/"'/flags' mode">. 997 998Starting in Perl 5.14, a C<"^"> (caret or circumflex accent) immediately 999after the C<"?"> is a shorthand equivalent to C<d-imsx>. Flags (except 1000C<"d">) may follow the caret to override it. 1001But a minus sign is not legal with it. 1002 1003Note that the C<a>, C<d>, C<l>, C<p>, and C<u> modifiers are special in 1004that they can only be enabled, not disabled, and the C<a>, C<d>, C<l>, and 1005C<u> modifiers are mutually exclusive: specifying one de-specifies the 1006others, and a maximum of one (or two C<a>'s) may appear in the 1007construct. Thus, for 1008example, C<(?-p)> will warn when compiled under C<use warnings>; 1009C<(?-d:...)> and C<(?dl:...)> are fatal errors. 1010 1011Note also that the C<p> modifier is special in that its presence 1012anywhere in a pattern has a global effect. 1013 1014=item C<(?:pattern)> 1015X<(?:)> 1016 1017=item C<(?adluimsx-imsx:pattern)> 1018 1019=item C<(?^aluimsx:pattern)> 1020X<(?^:)> 1021 1022This is for clustering, not capturing; it groups subexpressions like 1023"()", but doesn't make backreferences as "()" does. So 1024 1025 @fields = split(/\b(?:a|b|c)\b/) 1026 1027is like 1028 1029 @fields = split(/\b(a|b|c)\b/) 1030 1031but doesn't spit out extra fields. It's also cheaper not to capture 1032characters if you don't need to. 1033 1034Any letters between C<?> and C<:> act as flags modifiers as with 1035C<(?adluimsx-imsx)>. For example, 1036 1037 /(?s-i:more.*than).*million/i 1038 1039is equivalent to the more verbose 1040 1041 /(?:(?s-i)more.*than).*million/i 1042 1043Starting in Perl 5.14, a C<"^"> (caret or circumflex accent) immediately 1044after the C<"?"> is a shorthand equivalent to C<d-imsx>. Any positive 1045flags (except C<"d">) may follow the caret, so 1046 1047 (?^x:foo) 1048 1049is equivalent to 1050 1051 (?x-ims:foo) 1052 1053The caret tells Perl that this cluster doesn't inherit the flags of any 1054surrounding pattern, but uses the system defaults (C<d-imsx>), 1055modified by any flags specified. 1056 1057The caret allows for simpler stringification of compiled regular 1058expressions. These look like 1059 1060 (?^:pattern) 1061 1062with any non-default flags appearing between the caret and the colon. 1063A test that looks at such stringification thus doesn't need to have the 1064system default flags hard-coded in it, just the caret. If new flags are 1065added to Perl, the meaning of the caret's expansion will change to include 1066the default for those flags, so the test will still work, unchanged. 1067 1068Specifying a negative flag after the caret is an error, as the flag is 1069redundant. 1070 1071Mnemonic for C<(?^...)>: A fresh beginning since the usual use of a caret is 1072to match at the beginning. 1073 1074=item C<(?|pattern)> 1075X<(?|)> X<Branch reset> 1076 1077This is the "branch reset" pattern, which has the special property 1078that the capture groups are numbered from the same starting point 1079in each alternation branch. It is available starting from perl 5.10.0. 1080 1081Capture groups are numbered from left to right, but inside this 1082construct the numbering is restarted for each branch. 1083 1084The numbering within each branch will be as normal, and any groups 1085following this construct will be numbered as though the construct 1086contained only one branch, that being the one with the most capture 1087groups in it. 1088 1089This construct is useful when you want to capture one of a 1090number of alternative matches. 1091 1092Consider the following pattern. The numbers underneath show in 1093which group the captured content will be stored. 1094 1095 1096 # before ---------------branch-reset----------- after 1097 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x 1098 # 1 2 2 3 2 3 4 1099 1100Be careful when using the branch reset pattern in combination with 1101named captures. Named captures are implemented as being aliases to 1102numbered groups holding the captures, and that interferes with the 1103implementation of the branch reset pattern. If you are using named 1104captures in a branch reset pattern, it's best to use the same names, 1105in the same order, in each of the alternations: 1106 1107 /(?| (?<a> x ) (?<b> y ) 1108 | (?<a> z ) (?<b> w )) /x 1109 1110Not doing so may lead to surprises: 1111 1112 "12" =~ /(?| (?<a> \d+ ) | (?<b> \D+))/x; 1113 say $+ {a}; # Prints '12' 1114 say $+ {b}; # *Also* prints '12'. 1115 1116The problem here is that both the group named C<< a >> and the group 1117named C<< b >> are aliases for the group belonging to C<< $1 >>. 1118 1119=item Look-Around Assertions 1120X<look-around assertion> X<lookaround assertion> X<look-around> X<lookaround> 1121 1122Look-around assertions are zero-width patterns which match a specific 1123pattern without including it in C<$&>. Positive assertions match when 1124their subpattern matches, negative assertions match when their subpattern 1125fails. Look-behind matches text up to the current match position, 1126look-ahead matches text following the current match position. 1127 1128=over 4 1129 1130=item C<(?=pattern)> 1131X<(?=)> X<look-ahead, positive> X<lookahead, positive> 1132 1133A zero-width positive look-ahead assertion. For example, C</\w+(?=\t)/> 1134matches a word followed by a tab, without including the tab in C<$&>. 1135 1136=item C<(?!pattern)> 1137X<(?!)> X<look-ahead, negative> X<lookahead, negative> 1138 1139A zero-width negative look-ahead assertion. For example C</foo(?!bar)/> 1140matches any occurrence of "foo" that isn't followed by "bar". Note 1141however that look-ahead and look-behind are NOT the same thing. You cannot 1142use this for look-behind. 1143 1144If you are looking for a "bar" that isn't preceded by a "foo", C</(?!foo)bar/> 1145will not do what you want. That's because the C<(?!foo)> is just saying that 1146the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will 1147match. Use look-behind instead (see below). 1148 1149=item C<(?<=pattern)> C<\K> 1150X<(?<=)> X<look-behind, positive> X<lookbehind, positive> X<\K> 1151 1152A zero-width positive look-behind assertion. For example, C</(?<=\t)\w+/> 1153matches a word that follows a tab, without including the tab in C<$&>. 1154Works only for fixed-width look-behind. 1155 1156There is a special form of this construct, called C<\K>, which causes the 1157regex engine to "keep" everything it had matched prior to the C<\K> and 1158not include it in C<$&>. This effectively provides variable-length 1159look-behind. The use of C<\K> inside of another look-around assertion 1160is allowed, but the behaviour is currently not well defined. 1161 1162For various reasons C<\K> may be significantly more efficient than the 1163equivalent C<< (?<=...) >> construct, and it is especially useful in 1164situations where you want to efficiently remove something following 1165something else in a string. For instance 1166 1167 s/(foo)bar/$1/g; 1168 1169can be rewritten as the much more efficient 1170 1171 s/foo\Kbar//g; 1172 1173=item C<(?<!pattern)> 1174X<(?<!)> X<look-behind, negative> X<lookbehind, negative> 1175 1176A zero-width negative look-behind assertion. For example C</(?<!bar)foo/> 1177matches any occurrence of "foo" that does not follow "bar". Works 1178only for fixed-width look-behind. 1179 1180=back 1181 1182=item C<(?'NAME'pattern)> 1183 1184=item C<< (?<NAME>pattern) >> 1185X<< (?<NAME>) >> X<(?'NAME')> X<named capture> X<capture> 1186 1187A named capture group. Identical in every respect to normal capturing 1188parentheses C<()> but for the additional fact that the group 1189can be referred to by name in various regular expression 1190constructs (like C<\g{NAME}>) and can be accessed by name 1191after a successful match via C<%+> or C<%->. See L<perlvar> 1192for more details on the C<%+> and C<%-> hashes. 1193 1194If multiple distinct capture groups have the same name then the 1195$+{NAME} will refer to the leftmost defined group in the match. 1196 1197The forms C<(?'NAME'pattern)> and C<< (?<NAME>pattern) >> are equivalent. 1198 1199B<NOTE:> While the notation of this construct is the same as the similar 1200function in .NET regexes, the behavior is not. In Perl the groups are 1201numbered sequentially regardless of being named or not. Thus in the 1202pattern 1203 1204 /(x)(?<foo>y)(z)/ 1205 1206$+{foo} will be the same as $2, and $3 will contain 'z' instead of 1207the opposite which is what a .NET regex hacker might expect. 1208 1209Currently NAME is restricted to simple identifiers only. 1210In other words, it must match C</^[_A-Za-z][_A-Za-z0-9]*\z/> or 1211its Unicode extension (see L<utf8>), 1212though it isn't extended by the locale (see L<perllocale>). 1213 1214B<NOTE:> In order to make things easier for programmers with experience 1215with the Python or PCRE regex engines, the pattern C<< (?PE<lt>NAMEE<gt>pattern) >> 1216may be used instead of C<< (?<NAME>pattern) >>; however this form does not 1217support the use of single quotes as a delimiter for the name. 1218 1219=item C<< \k<NAME> >> 1220 1221=item C<< \k'NAME' >> 1222 1223Named backreference. Similar to numeric backreferences, except that 1224the group is designated by name and not number. If multiple groups 1225have the same name then it refers to the leftmost defined group in 1226the current match. 1227 1228It is an error to refer to a name not defined by a C<< (?<NAME>) >> 1229earlier in the pattern. 1230 1231Both forms are equivalent. 1232 1233B<NOTE:> In order to make things easier for programmers with experience 1234with the Python or PCRE regex engines, the pattern C<< (?P=NAME) >> 1235may be used instead of C<< \k<NAME> >>. 1236 1237=item C<(?{ code })> 1238X<(?{})> X<regex, code in> X<regexp, code in> X<regular expression, code in> 1239 1240B<WARNING>: This extended regular expression feature is considered 1241experimental, and may be changed without notice. Code executed that 1242has side effects may not perform identically from version to version 1243due to the effect of future optimisations in the regex engine. The 1244implementation of this feature was radically overhauled for the 5.18.0 1245release, and its behaviour in earlier versions of perl was much buggier, 1246especially in relation to parsing, lexical vars, scoping, recursion and 1247reentrancy. 1248 1249This zero-width assertion executes any embedded Perl code. It always 1250succeeds, and its return value is set as C<$^R>. 1251 1252In literal patterns, the code is parsed at the same time as the 1253surrounding code. While within the pattern, control is passed temporarily 1254back to the perl parser, until the logically-balancing closing brace is 1255encountered. This is similar to the way that an array index expression in 1256a literal string is handled, for example 1257 1258 "abc$array[ 1 + f('[') + g()]def" 1259 1260In particular, braces do not need to be balanced: 1261 1262 s/abc(?{ f('{'); })/def/ 1263 1264Even in a pattern that is interpolated and compiled at run-time, literal 1265code blocks will be compiled once, at perl compile time; the following 1266prints "ABCD": 1267 1268 print "D"; 1269 my $qr = qr/(?{ BEGIN { print "A" } })/; 1270 my $foo = "foo"; 1271 /$foo$qr(?{ BEGIN { print "B" } })/; 1272 BEGIN { print "C" } 1273 1274In patterns where the text of the code is derived from run-time 1275information rather than appearing literally in a source code /pattern/, 1276the code is compiled at the same time that the pattern is compiled, and 1277for reasons of security, C<use re 'eval'> must be in scope. This is to 1278stop user-supplied patterns containing code snippets from being 1279executable. 1280 1281In situations where you need to enable this with C<use re 'eval'>, you should 1282also have taint checking enabled. Better yet, use the carefully 1283constrained evaluation within a Safe compartment. See L<perlsec> for 1284details about both these mechanisms. 1285 1286From the viewpoint of parsing, lexical variable scope and closures, 1287 1288 /AAA(?{ BBB })CCC/ 1289 1290behaves approximately like 1291 1292 /AAA/ && do { BBB } && /CCC/ 1293 1294Similarly, 1295 1296 qr/AAA(?{ BBB })CCC/ 1297 1298behaves approximately like 1299 1300 sub { /AAA/ && do { BBB } && /CCC/ } 1301 1302In particular: 1303 1304 { my $i = 1; $r = qr/(?{ print $i })/ } 1305 my $i = 2; 1306 /$r/; # prints "1" 1307 1308Inside a C<(?{...})> block, C<$_> refers to the string the regular 1309expression is matching against. You can also use C<pos()> to know what is 1310the current position of matching within this string. 1311 1312The code block introduces a new scope from the perspective of lexical 1313variable declarations, but B<not> from the perspective of C<local> and 1314similar localizing behaviours. So later code blocks within the same 1315pattern will still see the values which were localized in earlier blocks. 1316These accumulated localizations are undone either at the end of a 1317successful match, or if the assertion is backtracked (compare 1318L<"Backtracking">). For example, 1319 1320 $_ = 'a' x 8; 1321 m< 1322 (?{ $cnt = 0 }) # Initialize $cnt. 1323 ( 1324 a 1325 (?{ 1326 local $cnt = $cnt + 1; # Update $cnt, 1327 # backtracking-safe. 1328 }) 1329 )* 1330 aaaa 1331 (?{ $res = $cnt }) # On success copy to 1332 # non-localized location. 1333 >x; 1334 1335will initially increment C<$cnt> up to 8; then during backtracking, its 1336value will be unwound back to 4, which is the value assigned to C<$res>. 1337At the end of the regex execution, $cnt will be wound back to its initial 1338value of 0. 1339 1340This assertion may be used as the condition in a 1341 1342 (?(condition)yes-pattern|no-pattern) 1343 1344switch. If I<not> used in this way, the result of evaluation of C<code> 1345is put into the special variable C<$^R>. This happens immediately, so 1346C<$^R> can be used from other C<(?{ code })> assertions inside the same 1347regular expression. 1348 1349The assignment to C<$^R> above is properly localized, so the old 1350value of C<$^R> is restored if the assertion is backtracked; compare 1351L<"Backtracking">. 1352 1353Note that the special variable C<$^N> is particularly useful with code 1354blocks to capture the results of submatches in variables without having to 1355keep track of the number of nested parentheses. For example: 1356 1357 $_ = "The brown fox jumps over the lazy dog"; 1358 /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i; 1359 print "color = $color, animal = $animal\n"; 1360 1361 1362=item C<(??{ code })> 1363X<(??{})> 1364X<regex, postponed> X<regexp, postponed> X<regular expression, postponed> 1365 1366B<WARNING>: This extended regular expression feature is considered 1367experimental, and may be changed without notice. Code executed that 1368has side effects may not perform identically from version to version 1369due to the effect of future optimisations in the regex engine. 1370 1371This is a "postponed" regular subexpression. It behaves in I<exactly> the 1372same way as a C<(?{ code })> code block as described above, except that 1373its return value, rather than being assigned to C<$^R>, is treated as a 1374pattern, compiled if it's a string (or used as-is if its a qr// object), 1375then matched as if it were inserted instead of this construct. 1376 1377During the matching of this sub-pattern, it has its own set of 1378captures which are valid during the sub-match, but are discarded once 1379control returns to the main pattern. For example, the following matches, 1380with the inner pattern capturing "B" and matching "BB", while the outer 1381pattern captures "A"; 1382 1383 my $inner = '(.)\1'; 1384 "ABBA" =~ /^(.)(??{ $inner })\1/; 1385 print $1; # prints "A"; 1386 1387Note that this means that there is no way for the inner pattern to refer 1388to a capture group defined outside. (The code block itself can use C<$1>, 1389etc., to refer to the enclosing pattern's capture groups.) Thus, although 1390 1391 ('a' x 100)=~/(??{'(.)' x 100})/ 1392 1393I<will> match, it will I<not> set $1 on exit. 1394 1395The following pattern matches a parenthesized group: 1396 1397 $re = qr{ 1398 \( 1399 (?: 1400 (?> [^()]+ ) # Non-parens without backtracking 1401 | 1402 (??{ $re }) # Group with matching parens 1403 )* 1404 \) 1405 }x; 1406 1407See also 1408L<C<(?I<PARNO>)>|/(?PARNO) (?-PARNO) (?+PARNO) (?R) (?0)> 1409for a different, more efficient way to accomplish 1410the same task. 1411 1412Executing a postponed regular expression 50 times without consuming any 1413input string will result in a fatal error. The maximum depth is compiled 1414into perl, so changing it requires a custom build. 1415 1416=item C<(?I<PARNO>)> C<(?-I<PARNO>)> C<(?+I<PARNO>)> C<(?R)> C<(?0)> 1417X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)> 1418X<regex, recursive> X<regexp, recursive> X<regular expression, recursive> 1419X<regex, relative recursion> 1420 1421Similar to C<(??{ code })> except that it does not involve executing any 1422code or potentially compiling a returned pattern string; instead it treats 1423the part of the current pattern contained within a specified capture group 1424as an independent pattern that must match at the current position. 1425Capture groups contained by the pattern will have the value as determined 1426by the outermost recursion. 1427 1428I<PARNO> is a sequence of digits (not starting with 0) whose value reflects 1429the paren-number of the capture group to recurse to. C<(?R)> recurses to 1430the beginning of the whole pattern. C<(?0)> is an alternate syntax for 1431C<(?R)>. If I<PARNO> is preceded by a plus or minus sign then it is assumed 1432to be relative, with negative numbers indicating preceding capture groups 1433and positive ones following. Thus C<(?-1)> refers to the most recently 1434declared group, and C<(?+1)> indicates the next group to be declared. 1435Note that the counting for relative recursion differs from that of 1436relative backreferences, in that with recursion unclosed groups B<are> 1437included. 1438 1439The following pattern matches a function foo() which may contain 1440balanced parentheses as the argument. 1441 1442 $re = qr{ ( # paren group 1 (full function) 1443 foo 1444 ( # paren group 2 (parens) 1445 \( 1446 ( # paren group 3 (contents of parens) 1447 (?: 1448 (?> [^()]+ ) # Non-parens without backtracking 1449 | 1450 (?2) # Recurse to start of paren group 2 1451 )* 1452 ) 1453 \) 1454 ) 1455 ) 1456 }x; 1457 1458If the pattern was used as follows 1459 1460 'foo(bar(baz)+baz(bop))'=~/$re/ 1461 and print "\$1 = $1\n", 1462 "\$2 = $2\n", 1463 "\$3 = $3\n"; 1464 1465the output produced should be the following: 1466 1467 $1 = foo(bar(baz)+baz(bop)) 1468 $2 = (bar(baz)+baz(bop)) 1469 $3 = bar(baz)+baz(bop) 1470 1471If there is no corresponding capture group defined, then it is a 1472fatal error. Recursing deeper than 50 times without consuming any input 1473string will also result in a fatal error. The maximum depth is compiled 1474into perl, so changing it requires a custom build. 1475 1476The following shows how using negative indexing can make it 1477easier to embed recursive patterns inside of a C<qr//> construct 1478for later use: 1479 1480 my $parens = qr/(\((?:[^()]++|(?-1))*+\))/; 1481 if (/foo $parens \s+ \+ \s+ bar $parens/x) { 1482 # do something here... 1483 } 1484 1485B<Note> that this pattern does not behave the same way as the equivalent 1486PCRE or Python construct of the same form. In Perl you can backtrack into 1487a recursed group, in PCRE and Python the recursed into group is treated 1488as atomic. Also, modifiers are resolved at compile time, so constructs 1489like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will 1490be processed. 1491 1492=item C<(?&NAME)> 1493X<(?&NAME)> 1494 1495Recurse to a named subpattern. Identical to C<(?I<PARNO>)> except that the 1496parenthesis to recurse to is determined by name. If multiple parentheses have 1497the same name, then it recurses to the leftmost. 1498 1499It is an error to refer to a name that is not declared somewhere in the 1500pattern. 1501 1502B<NOTE:> In order to make things easier for programmers with experience 1503with the Python or PCRE regex engines the pattern C<< (?P>NAME) >> 1504may be used instead of C<< (?&NAME) >>. 1505 1506=item C<(?(condition)yes-pattern|no-pattern)> 1507X<(?()> 1508 1509=item C<(?(condition)yes-pattern)> 1510 1511Conditional expression. Matches C<yes-pattern> if C<condition> yields 1512a true value, matches C<no-pattern> otherwise. A missing pattern always 1513matches. 1514 1515C<(condition)> should be one of: 1) an integer in 1516parentheses (which is valid if the corresponding pair of parentheses 1517matched); 2) a look-ahead/look-behind/evaluate zero-width assertion; 3) a 1518name in angle brackets or single quotes (which is valid if a group 1519with the given name matched); or 4) the special symbol (R) (true when 1520evaluated inside of recursion or eval). Additionally the R may be 1521followed by a number, (which will be true when evaluated when recursing 1522inside of the appropriate group), or by C<&NAME>, in which case it will 1523be true only when evaluated during recursion in the named group. 1524 1525Here's a summary of the possible predicates: 1526 1527=over 4 1528 1529=item (1) (2) ... 1530 1531Checks if the numbered capturing group has matched something. 1532 1533=item (<NAME>) ('NAME') 1534 1535Checks if a group with the given name has matched something. 1536 1537=item (?=...) (?!...) (?<=...) (?<!...) 1538 1539Checks whether the pattern matches (or does not match, for the '!' 1540variants). 1541 1542=item (?{ CODE }) 1543 1544Treats the return value of the code block as the condition. 1545 1546=item (R) 1547 1548Checks if the expression has been evaluated inside of recursion. 1549 1550=item (R1) (R2) ... 1551 1552Checks if the expression has been evaluated while executing directly 1553inside of the n-th capture group. This check is the regex equivalent of 1554 1555 if ((caller(0))[3] eq 'subname') { ... } 1556 1557In other words, it does not check the full recursion stack. 1558 1559=item (R&NAME) 1560 1561Similar to C<(R1)>, this predicate checks to see if we're executing 1562directly inside of the leftmost group with a given name (this is the same 1563logic used by C<(?&NAME)> to disambiguate). It does not check the full 1564stack, but only the name of the innermost active recursion. 1565 1566=item (DEFINE) 1567 1568In this case, the yes-pattern is never directly executed, and no 1569no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient. 1570See below for details. 1571 1572=back 1573 1574For example: 1575 1576 m{ ( \( )? 1577 [^()]+ 1578 (?(1) \) ) 1579 }x 1580 1581matches a chunk of non-parentheses, possibly included in parentheses 1582themselves. 1583 1584A special form is the C<(DEFINE)> predicate, which never executes its 1585yes-pattern directly, and does not allow a no-pattern. This allows one to 1586define subpatterns which will be executed only by the recursion mechanism. 1587This way, you can define a set of regular expression rules that can be 1588bundled into any pattern you choose. 1589 1590It is recommended that for this usage you put the DEFINE block at the 1591end of the pattern, and that you name any subpatterns defined within it. 1592 1593Also, it's worth noting that patterns defined this way probably will 1594not be as efficient, as the optimiser is not very clever about 1595handling them. 1596 1597An example of how this might be used is as follows: 1598 1599 /(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT)) 1600 (?(DEFINE) 1601 (?<NAME_PAT>....) 1602 (?<ADRESS_PAT>....) 1603 )/x 1604 1605Note that capture groups matched inside of recursion are not accessible 1606after the recursion returns, so the extra layer of capturing groups is 1607necessary. Thus C<$+{NAME_PAT}> would not be defined even though 1608C<$+{NAME}> would be. 1609 1610Finally, keep in mind that subpatterns created inside a DEFINE block 1611count towards the absolute and relative number of captures, so this: 1612 1613 my @captures = "a" =~ /(.) # First capture 1614 (?(DEFINE) 1615 (?<EXAMPLE> 1 ) # Second capture 1616 )/x; 1617 say scalar @captures; 1618 1619Will output 2, not 1. This is particularly important if you intend to 1620compile the definitions with the C<qr//> operator, and later 1621interpolate them in another pattern. 1622 1623=item C<< (?>pattern) >> 1624X<backtrack> X<backtracking> X<atomic> X<possessive> 1625 1626An "independent" subexpression, one which matches the substring 1627that a I<standalone> C<pattern> would match if anchored at the given 1628position, and it matches I<nothing other than this substring>. This 1629construct is useful for optimizations of what would otherwise be 1630"eternal" matches, because it will not backtrack (see L<"Backtracking">). 1631It may also be useful in places where the "grab all you can, and do not 1632give anything back" semantic is desirable. 1633 1634For example: C<< ^(?>a*)ab >> will never match, since C<< (?>a*) >> 1635(anchored at the beginning of string, as above) will match I<all> 1636characters C<a> at the beginning of string, leaving no C<a> for 1637C<ab> to match. In contrast, C<a*ab> will match the same as C<a+b>, 1638since the match of the subgroup C<a*> is influenced by the following 1639group C<ab> (see L<"Backtracking">). In particular, C<a*> inside 1640C<a*ab> will match fewer characters than a standalone C<a*>, since 1641this makes the tail match. 1642 1643C<< (?>pattern) >> does not disable backtracking altogether once it has 1644matched. It is still possible to backtrack past the construct, but not 1645into it. So C<< ((?>a*)|(?>b*))ar >> will still match "bar". 1646 1647An effect similar to C<< (?>pattern) >> may be achieved by writing 1648C<(?=(pattern))\g{-1}>. This matches the same substring as a standalone 1649C<a+>, and the following C<\g{-1}> eats the matched string; it therefore 1650makes a zero-length assertion into an analogue of C<< (?>...) >>. 1651(The difference between these two constructs is that the second one 1652uses a capturing group, thus shifting ordinals of backreferences 1653in the rest of a regular expression.) 1654 1655Consider this pattern: 1656 1657 m{ \( 1658 ( 1659 [^()]+ # x+ 1660 | 1661 \( [^()]* \) 1662 )+ 1663 \) 1664 }x 1665 1666That will efficiently match a nonempty group with matching parentheses 1667two levels deep or less. However, if there is no such group, it 1668will take virtually forever on a long string. That's because there 1669are so many different ways to split a long string into several 1670substrings. This is what C<(.+)+> is doing, and C<(.+)+> is similar 1671to a subpattern of the above pattern. Consider how the pattern 1672above detects no-match on C<((()aaaaaaaaaaaaaaaaaa> in several 1673seconds, but that each extra letter doubles this time. This 1674exponential performance will make it appear that your program has 1675hung. However, a tiny change to this pattern 1676 1677 m{ \( 1678 ( 1679 (?> [^()]+ ) # change x+ above to (?> x+ ) 1680 | 1681 \( [^()]* \) 1682 )+ 1683 \) 1684 }x 1685 1686which uses C<< (?>...) >> matches exactly when the one above does (verifying 1687this yourself would be a productive exercise), but finishes in a fourth 1688the time when used on a similar string with 1000000 C<a>s. Be aware, 1689however, that, when this construct is followed by a 1690quantifier, it currently triggers a warning message under 1691the C<use warnings> pragma or B<-w> switch saying it 1692C<"matches null string many times in regex">. 1693 1694On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable 1695effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>. 1696This was only 4 times slower on a string with 1000000 C<a>s. 1697 1698The "grab all you can, and do not give anything back" semantic is desirable 1699in many situations where on the first sight a simple C<()*> looks like 1700the correct solution. Suppose we parse text with comments being delimited 1701by C<#> followed by some optional (horizontal) whitespace. Contrary to 1702its appearance, C<#[ \t]*> I<is not> the correct subexpression to match 1703the comment delimiter, because it may "give up" some whitespace if 1704the remainder of the pattern can be made to match that way. The correct 1705answer is either one of these: 1706 1707 (?>#[ \t]*) 1708 #[ \t]*(?![ \t]) 1709 1710For example, to grab non-empty comments into $1, one should use either 1711one of these: 1712 1713 / (?> \# [ \t]* ) ( .+ ) /x; 1714 / \# [ \t]* ( [^ \t] .* ) /x; 1715 1716Which one you pick depends on which of these expressions better reflects 1717the above specification of comments. 1718 1719In some literature this construct is called "atomic matching" or 1720"possessive matching". 1721 1722Possessive quantifiers are equivalent to putting the item they are applied 1723to inside of one of these constructs. The following equivalences apply: 1724 1725 Quantifier Form Bracketing Form 1726 --------------- --------------- 1727 PAT*+ (?>PAT*) 1728 PAT++ (?>PAT+) 1729 PAT?+ (?>PAT?) 1730 PAT{min,max}+ (?>PAT{min,max}) 1731 1732=item C<(?[ ])> 1733 1734See L<perlrecharclass/Extended Bracketed Character Classes>. 1735 1736=back 1737 1738=head2 Special Backtracking Control Verbs 1739 1740B<WARNING:> These patterns are experimental and subject to change or 1741removal in a future version of Perl. Their usage in production code should 1742be noted to avoid problems during upgrades. 1743 1744These special patterns are generally of the form C<(*VERB:ARG)>. Unless 1745otherwise stated the ARG argument is optional; in some cases, it is 1746forbidden. 1747 1748Any pattern containing a special backtracking verb that allows an argument 1749has the special behaviour that when executed it sets the current package's 1750C<$REGERROR> and C<$REGMARK> variables. When doing so the following 1751rules apply: 1752 1753On failure, the C<$REGERROR> variable will be set to the ARG value of the 1754verb pattern, if the verb was involved in the failure of the match. If the 1755ARG part of the pattern was omitted, then C<$REGERROR> will be set to the 1756name of the last C<(*MARK:NAME)> pattern executed, or to TRUE if there was 1757none. Also, the C<$REGMARK> variable will be set to FALSE. 1758 1759On a successful match, the C<$REGERROR> variable will be set to FALSE, and 1760the C<$REGMARK> variable will be set to the name of the last 1761C<(*MARK:NAME)> pattern executed. See the explanation for the 1762C<(*MARK:NAME)> verb below for more details. 1763 1764B<NOTE:> C<$REGERROR> and C<$REGMARK> are not magic variables like C<$1> 1765and most other regex-related variables. They are not local to a scope, nor 1766readonly, but instead are volatile package variables similar to C<$AUTOLOAD>. 1767Use C<local> to localize changes to them to a specific scope if necessary. 1768 1769If a pattern does not contain a special backtracking verb that allows an 1770argument, then C<$REGERROR> and C<$REGMARK> are not touched at all. 1771 1772=over 3 1773 1774=item Verbs that take an argument 1775 1776=over 4 1777 1778=item C<(*PRUNE)> C<(*PRUNE:NAME)> 1779X<(*PRUNE)> X<(*PRUNE:NAME)> 1780 1781This zero-width pattern prunes the backtracking tree at the current point 1782when backtracked into on failure. Consider the pattern C<A (*PRUNE) B>, 1783where A and B are complex patterns. Until the C<(*PRUNE)> verb is reached, 1784A may backtrack as necessary to match. Once it is reached, matching 1785continues in B, which may also backtrack as necessary; however, should B 1786not match, then no further backtracking will take place, and the pattern 1787will fail outright at the current starting position. 1788 1789The following example counts all the possible matching strings in a 1790pattern (without actually matching any of them). 1791 1792 'aaab' =~ /a+b?(?{print "$&\n"; $count++})(*FAIL)/; 1793 print "Count=$count\n"; 1794 1795which produces: 1796 1797 aaab 1798 aaa 1799 aa 1800 a 1801 aab 1802 aa 1803 a 1804 ab 1805 a 1806 Count=9 1807 1808If we add a C<(*PRUNE)> before the count like the following 1809 1810 'aaab' =~ /a+b?(*PRUNE)(?{print "$&\n"; $count++})(*FAIL)/; 1811 print "Count=$count\n"; 1812 1813we prevent backtracking and find the count of the longest matching string 1814at each matching starting point like so: 1815 1816 aaab 1817 aab 1818 ab 1819 Count=3 1820 1821Any number of C<(*PRUNE)> assertions may be used in a pattern. 1822 1823See also C<< (?>pattern) >> and possessive quantifiers for other ways to 1824control backtracking. In some cases, the use of C<(*PRUNE)> can be 1825replaced with a C<< (?>pattern) >> with no functional difference; however, 1826C<(*PRUNE)> can be used to handle cases that cannot be expressed using a 1827C<< (?>pattern) >> alone. 1828 1829=item C<(*SKIP)> C<(*SKIP:NAME)> 1830X<(*SKIP)> 1831 1832This zero-width pattern is similar to C<(*PRUNE)>, except that on 1833failure it also signifies that whatever text that was matched leading up 1834to the C<(*SKIP)> pattern being executed cannot be part of I<any> match 1835of this pattern. This effectively means that the regex engine "skips" forward 1836to this position on failure and tries to match again, (assuming that 1837there is sufficient room to match). 1838 1839The name of the C<(*SKIP:NAME)> pattern has special significance. If a 1840C<(*MARK:NAME)> was encountered while matching, then it is that position 1841which is used as the "skip point". If no C<(*MARK)> of that name was 1842encountered, then the C<(*SKIP)> operator has no effect. When used 1843without a name the "skip point" is where the match point was when 1844executing the (*SKIP) pattern. 1845 1846Compare the following to the examples in C<(*PRUNE)>; note the string 1847is twice as long: 1848 1849 'aaabaaab' =~ /a+b?(*SKIP)(?{print "$&\n"; $count++})(*FAIL)/; 1850 print "Count=$count\n"; 1851 1852outputs 1853 1854 aaab 1855 aaab 1856 Count=2 1857 1858Once the 'aaab' at the start of the string has matched, and the C<(*SKIP)> 1859executed, the next starting point will be where the cursor was when the 1860C<(*SKIP)> was executed. 1861 1862=item C<(*MARK:NAME)> C<(*:NAME)> 1863X<(*MARK)> X<(*MARK:NAME)> X<(*:NAME)> 1864 1865This zero-width pattern can be used to mark the point reached in a string 1866when a certain part of the pattern has been successfully matched. This 1867mark may be given a name. A later C<(*SKIP)> pattern will then skip 1868forward to that point if backtracked into on failure. Any number of 1869C<(*MARK)> patterns are allowed, and the NAME portion may be duplicated. 1870 1871In addition to interacting with the C<(*SKIP)> pattern, C<(*MARK:NAME)> 1872can be used to "label" a pattern branch, so that after matching, the 1873program can determine which branches of the pattern were involved in the 1874match. 1875 1876When a match is successful, the C<$REGMARK> variable will be set to the 1877name of the most recently executed C<(*MARK:NAME)> that was involved 1878in the match. 1879 1880This can be used to determine which branch of a pattern was matched 1881without using a separate capture group for each branch, which in turn 1882can result in a performance improvement, as perl cannot optimize 1883C</(?:(x)|(y)|(z))/> as efficiently as something like 1884C</(?:x(*MARK:x)|y(*MARK:y)|z(*MARK:z))/>. 1885 1886When a match has failed, and unless another verb has been involved in 1887failing the match and has provided its own name to use, the C<$REGERROR> 1888variable will be set to the name of the most recently executed 1889C<(*MARK:NAME)>. 1890 1891See L</(*SKIP)> for more details. 1892 1893As a shortcut C<(*MARK:NAME)> can be written C<(*:NAME)>. 1894 1895=item C<(*THEN)> C<(*THEN:NAME)> 1896 1897This is similar to the "cut group" operator C<::> from Perl 6. Like 1898C<(*PRUNE)>, this verb always matches, and when backtracked into on 1899failure, it causes the regex engine to try the next alternation in the 1900innermost enclosing group (capturing or otherwise) that has alternations. 1901The two branches of a C<(?(condition)yes-pattern|no-pattern)> do not 1902count as an alternation, as far as C<(*THEN)> is concerned. 1903 1904Its name comes from the observation that this operation combined with the 1905alternation operator (C<|>) can be used to create what is essentially a 1906pattern-based if/then/else block: 1907 1908 ( COND (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) 1909 1910Note that if this operator is used and NOT inside of an alternation then 1911it acts exactly like the C<(*PRUNE)> operator. 1912 1913 / A (*PRUNE) B / 1914 1915is the same as 1916 1917 / A (*THEN) B / 1918 1919but 1920 1921 / ( A (*THEN) B | C ) / 1922 1923is not the same as 1924 1925 / ( A (*PRUNE) B | C ) / 1926 1927as after matching the A but failing on the B the C<(*THEN)> verb will 1928backtrack and try C; but the C<(*PRUNE)> verb will simply fail. 1929 1930=back 1931 1932=item Verbs without an argument 1933 1934=over 4 1935 1936=item C<(*COMMIT)> 1937X<(*COMMIT)> 1938 1939This is the Perl 6 "commit pattern" C<< <commit> >> or C<:::>. It's a 1940zero-width pattern similar to C<(*SKIP)>, except that when backtracked 1941into on failure it causes the match to fail outright. No further attempts 1942to find a valid match by advancing the start pointer will occur again. 1943For example, 1944 1945 'aaabaaab' =~ /a+b?(*COMMIT)(?{print "$&\n"; $count++})(*FAIL)/; 1946 print "Count=$count\n"; 1947 1948outputs 1949 1950 aaab 1951 Count=1 1952 1953In other words, once the C<(*COMMIT)> has been entered, and if the pattern 1954does not match, the regex engine will not try any further matching on the 1955rest of the string. 1956 1957=item C<(*FAIL)> C<(*F)> 1958X<(*FAIL)> X<(*F)> 1959 1960This pattern matches nothing and always fails. It can be used to force the 1961engine to backtrack. It is equivalent to C<(?!)>, but easier to read. In 1962fact, C<(?!)> gets optimised into C<(*FAIL)> internally. 1963 1964It is probably useful only when combined with C<(?{})> or C<(??{})>. 1965 1966=item C<(*ACCEPT)> 1967X<(*ACCEPT)> 1968 1969B<WARNING:> This feature is highly experimental. It is not recommended 1970for production code. 1971 1972This pattern matches nothing and causes the end of successful matching at 1973the point at which the C<(*ACCEPT)> pattern was encountered, regardless of 1974whether there is actually more to match in the string. When inside of a 1975nested pattern, such as recursion, or in a subpattern dynamically generated 1976via C<(??{})>, only the innermost pattern is ended immediately. 1977 1978If the C<(*ACCEPT)> is inside of capturing groups then the groups are 1979marked as ended at the point at which the C<(*ACCEPT)> was encountered. 1980For instance: 1981 1982 'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x; 1983 1984will match, and C<$1> will be C<AB> and C<$2> will be C<B>, C<$3> will not 1985be set. If another branch in the inner parentheses was matched, such as in the 1986string 'ACDE', then the C<D> and C<E> would have to be matched as well. 1987 1988=back 1989 1990=back 1991 1992=head2 Backtracking 1993X<backtrack> X<backtracking> 1994 1995NOTE: This section presents an abstract approximation of regular 1996expression behavior. For a more rigorous (and complicated) view of 1997the rules involved in selecting a match among possible alternatives, 1998see L<Combining RE Pieces>. 1999 2000A fundamental feature of regular expression matching involves the 2001notion called I<backtracking>, which is currently used (when needed) 2002by all regular non-possessive expression quantifiers, namely C<*>, C<*?>, C<+>, 2003C<+?>, C<{n,m}>, and C<{n,m}?>. Backtracking is often optimized 2004internally, but the general principle outlined here is valid. 2005 2006For a regular expression to match, the I<entire> regular expression must 2007match, not just part of it. So if the beginning of a pattern containing a 2008quantifier succeeds in a way that causes later parts in the pattern to 2009fail, the matching engine backs up and recalculates the beginning 2010part--that's why it's called backtracking. 2011 2012Here is an example of backtracking: Let's say you want to find the 2013word following "foo" in the string "Food is on the foo table.": 2014 2015 $_ = "Food is on the foo table."; 2016 if ( /\b(foo)\s+(\w+)/i ) { 2017 print "$2 follows $1.\n"; 2018 } 2019 2020When the match runs, the first part of the regular expression (C<\b(foo)>) 2021finds a possible match right at the beginning of the string, and loads up 2022$1 with "Foo". However, as soon as the matching engine sees that there's 2023no whitespace following the "Foo" that it had saved in $1, it realizes its 2024mistake and starts over again one character after where it had the 2025tentative match. This time it goes all the way until the next occurrence 2026of "foo". The complete regular expression matches this time, and you get 2027the expected output of "table follows foo." 2028 2029Sometimes minimal matching can help a lot. Imagine you'd like to match 2030everything between "foo" and "bar". Initially, you write something 2031like this: 2032 2033 $_ = "The food is under the bar in the barn."; 2034 if ( /foo(.*)bar/ ) { 2035 print "got <$1>\n"; 2036 } 2037 2038Which perhaps unexpectedly yields: 2039 2040 got <d is under the bar in the > 2041 2042That's because C<.*> was greedy, so you get everything between the 2043I<first> "foo" and the I<last> "bar". Here it's more effective 2044to use minimal matching to make sure you get the text between a "foo" 2045and the first "bar" thereafter. 2046 2047 if ( /foo(.*?)bar/ ) { print "got <$1>\n" } 2048 got <d is under the > 2049 2050Here's another example. Let's say you'd like to match a number at the end 2051of a string, and you also want to keep the preceding part of the match. 2052So you write this: 2053 2054 $_ = "I have 2 numbers: 53147"; 2055 if ( /(.*)(\d*)/ ) { # Wrong! 2056 print "Beginning is <$1>, number is <$2>.\n"; 2057 } 2058 2059That won't work at all, because C<.*> was greedy and gobbled up the 2060whole string. As C<\d*> can match on an empty string the complete 2061regular expression matched successfully. 2062 2063 Beginning is <I have 2 numbers: 53147>, number is <>. 2064 2065Here are some variants, most of which don't work: 2066 2067 $_ = "I have 2 numbers: 53147"; 2068 @pats = qw{ 2069 (.*)(\d*) 2070 (.*)(\d+) 2071 (.*?)(\d*) 2072 (.*?)(\d+) 2073 (.*)(\d+)$ 2074 (.*?)(\d+)$ 2075 (.*)\b(\d+)$ 2076 (.*\D)(\d+)$ 2077 }; 2078 2079 for $pat (@pats) { 2080 printf "%-12s ", $pat; 2081 if ( /$pat/ ) { 2082 print "<$1> <$2>\n"; 2083 } else { 2084 print "FAIL\n"; 2085 } 2086 } 2087 2088That will print out: 2089 2090 (.*)(\d*) <I have 2 numbers: 53147> <> 2091 (.*)(\d+) <I have 2 numbers: 5314> <7> 2092 (.*?)(\d*) <> <> 2093 (.*?)(\d+) <I have > <2> 2094 (.*)(\d+)$ <I have 2 numbers: 5314> <7> 2095 (.*?)(\d+)$ <I have 2 numbers: > <53147> 2096 (.*)\b(\d+)$ <I have 2 numbers: > <53147> 2097 (.*\D)(\d+)$ <I have 2 numbers: > <53147> 2098 2099As you see, this can be a bit tricky. It's important to realize that a 2100regular expression is merely a set of assertions that gives a definition 2101of success. There may be 0, 1, or several different ways that the 2102definition might succeed against a particular string. And if there are 2103multiple ways it might succeed, you need to understand backtracking to 2104know which variety of success you will achieve. 2105 2106When using look-ahead assertions and negations, this can all get even 2107trickier. Imagine you'd like to find a sequence of non-digits not 2108followed by "123". You might try to write that as 2109 2110 $_ = "ABC123"; 2111 if ( /^\D*(?!123)/ ) { # Wrong! 2112 print "Yup, no 123 in $_\n"; 2113 } 2114 2115But that isn't going to match; at least, not the way you're hoping. It 2116claims that there is no 123 in the string. Here's a clearer picture of 2117why that pattern matches, contrary to popular expectations: 2118 2119 $x = 'ABC123'; 2120 $y = 'ABC445'; 2121 2122 print "1: got $1\n" if $x =~ /^(ABC)(?!123)/; 2123 print "2: got $1\n" if $y =~ /^(ABC)(?!123)/; 2124 2125 print "3: got $1\n" if $x =~ /^(\D*)(?!123)/; 2126 print "4: got $1\n" if $y =~ /^(\D*)(?!123)/; 2127 2128This prints 2129 2130 2: got ABC 2131 3: got AB 2132 4: got ABC 2133 2134You might have expected test 3 to fail because it seems to a more 2135general purpose version of test 1. The important difference between 2136them is that test 3 contains a quantifier (C<\D*>) and so can use 2137backtracking, whereas test 1 will not. What's happening is 2138that you've asked "Is it true that at the start of $x, following 0 or more 2139non-digits, you have something that's not 123?" If the pattern matcher had 2140let C<\D*> expand to "ABC", this would have caused the whole pattern to 2141fail. 2142 2143The search engine will initially match C<\D*> with "ABC". Then it will 2144try to match C<(?!123)> with "123", which fails. But because 2145a quantifier (C<\D*>) has been used in the regular expression, the 2146search engine can backtrack and retry the match differently 2147in the hope of matching the complete regular expression. 2148 2149The pattern really, I<really> wants to succeed, so it uses the 2150standard pattern back-off-and-retry and lets C<\D*> expand to just "AB" this 2151time. Now there's indeed something following "AB" that is not 2152"123". It's "C123", which suffices. 2153 2154We can deal with this by using both an assertion and a negation. 2155We'll say that the first part in $1 must be followed both by a digit 2156and by something that's not "123". Remember that the look-aheads 2157are zero-width expressions--they only look, but don't consume any 2158of the string in their match. So rewriting this way produces what 2159you'd expect; that is, case 5 will fail, but case 6 succeeds: 2160 2161 print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/; 2162 print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/; 2163 2164 6: got ABC 2165 2166In other words, the two zero-width assertions next to each other work as though 2167they're ANDed together, just as you'd use any built-in assertions: C</^$/> 2168matches only if you're at the beginning of the line AND the end of the 2169line simultaneously. The deeper underlying truth is that juxtaposition in 2170regular expressions always means AND, except when you write an explicit OR 2171using the vertical bar. C</ab/> means match "a" AND (then) match "b", 2172although the attempted matches are made at different positions because "a" 2173is not a zero-width assertion, but a one-width assertion. 2174 2175B<WARNING>: Particularly complicated regular expressions can take 2176exponential time to solve because of the immense number of possible 2177ways they can use backtracking to try for a match. For example, without 2178internal optimizations done by the regular expression engine, this will 2179take a painfully long time to run: 2180 2181 'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/ 2182 2183And if you used C<*>'s in the internal groups instead of limiting them 2184to 0 through 5 matches, then it would take forever--or until you ran 2185out of stack space. Moreover, these internal optimizations are not 2186always applicable. For example, if you put C<{0,5}> instead of C<*> 2187on the external group, no current optimization is applicable, and the 2188match takes a long time to finish. 2189 2190A powerful tool for optimizing such beasts is what is known as an 2191"independent group", 2192which does not backtrack (see L</C<< (?>pattern) >>>). Note also that 2193zero-length look-ahead/look-behind assertions will not backtrack to make 2194the tail match, since they are in "logical" context: only 2195whether they match is considered relevant. For an example 2196where side-effects of look-ahead I<might> have influenced the 2197following match, see L</C<< (?>pattern) >>>. 2198 2199=head2 Version 8 Regular Expressions 2200X<regular expression, version 8> X<regex, version 8> X<regexp, version 8> 2201 2202In case you're not familiar with the "regular" Version 8 regex 2203routines, here are the pattern-matching rules not described above. 2204 2205Any single character matches itself, unless it is a I<metacharacter> 2206with a special meaning described here or above. You can cause 2207characters that normally function as metacharacters to be interpreted 2208literally by prefixing them with a "\" (e.g., "\." matches a ".", not any 2209character; "\\" matches a "\"). This escape mechanism is also required 2210for the character used as the pattern delimiter. 2211 2212A series of characters matches that series of characters in the target 2213string, so the pattern C<blurfl> would match "blurfl" in the target 2214string. 2215 2216You can specify a character class, by enclosing a list of characters 2217in C<[]>, which will match any character from the list. If the 2218first character after the "[" is "^", the class matches any character not 2219in the list. Within a list, the "-" character specifies a 2220range, so that C<a-z> represents all characters between "a" and "z", 2221inclusive. If you want either "-" or "]" itself to be a member of a 2222class, put it at the start of the list (possibly after a "^"), or 2223escape it with a backslash. "-" is also taken literally when it is 2224at the end of the list, just before the closing "]". (The 2225following all specify the same class of three characters: C<[-az]>, 2226C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which 2227specifies a class containing twenty-six characters, even on EBCDIC-based 2228character sets.) Also, if you try to use the character 2229classes C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, or C<\D> as endpoints of 2230a range, the "-" is understood literally. 2231 2232Note also that the whole range idea is rather unportable between 2233character sets--and even within character sets they may cause results 2234you probably didn't expect. A sound principle is to use only ranges 2235that begin from and end at either alphabetics of equal case ([a-e], 2236[A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt, 2237spell out the character sets in full. 2238 2239Characters may be specified using a metacharacter syntax much like that 2240used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return, 2241"\f" a form feed, etc. More generally, \I<nnn>, where I<nnn> is a string 2242of three octal digits, matches the character whose coded character set value 2243is I<nnn>. Similarly, \xI<nn>, where I<nn> are hexadecimal digits, 2244matches the character whose ordinal is I<nn>. The expression \cI<x> 2245matches the character control-I<x>. Finally, the "." metacharacter 2246matches any character except "\n" (unless you use C</s>). 2247 2248You can specify a series of alternatives for a pattern using "|" to 2249separate them, so that C<fee|fie|foe> will match any of "fee", "fie", 2250or "foe" in the target string (as would C<f(e|i|o)e>). The 2251first alternative includes everything from the last pattern delimiter 2252("(", "(?:", etc. or the beginning of the pattern) up to the first "|", and 2253the last alternative contains everything from the last "|" to the next 2254closing pattern delimiter. That's why it's common practice to include 2255alternatives in parentheses: to minimize confusion about where they 2256start and end. 2257 2258Alternatives are tried from left to right, so the first 2259alternative found for which the entire expression matches, is the one that 2260is chosen. This means that alternatives are not necessarily greedy. For 2261example: when matching C<foo|foot> against "barefoot", only the "foo" 2262part will match, as that is the first alternative tried, and it successfully 2263matches the target string. (This might not seem important, but it is 2264important when you are capturing matched text using parentheses.) 2265 2266Also remember that "|" is interpreted as a literal within square brackets, 2267so if you write C<[fee|fie|foe]> you're really only matching C<[feio|]>. 2268 2269Within a pattern, you may designate subpatterns for later reference 2270by enclosing them in parentheses, and you may refer back to the 2271I<n>th subpattern later in the pattern using the metacharacter 2272\I<n> or \gI<n>. Subpatterns are numbered based on the left to right order 2273of their opening parenthesis. A backreference matches whatever 2274actually matched the subpattern in the string being examined, not 2275the rules for that subpattern. Therefore, C<(0|0x)\d*\s\g1\d*> will 2276match "0x1234 0x4321", but not "0x1234 01234", because subpattern 22771 matched "0x", even though the rule C<0|0x> could potentially match 2278the leading 0 in the second number. 2279 2280=head2 Warning on \1 Instead of $1 2281 2282Some people get too used to writing things like: 2283 2284 $pattern =~ s/(\W)/\\\1/g; 2285 2286This is grandfathered (for \1 to \9) for the RHS of a substitute to avoid 2287shocking the 2288B<sed> addicts, but it's a dirty habit to get into. That's because in 2289PerlThink, the righthand side of an C<s///> is a double-quoted string. C<\1> in 2290the usual double-quoted string means a control-A. The customary Unix 2291meaning of C<\1> is kludged in for C<s///>. However, if you get into the habit 2292of doing that, you get yourself into trouble if you then add an C</e> 2293modifier. 2294 2295 s/(\d+)/ \1 + 1 /eg; # causes warning under -w 2296 2297Or if you try to do 2298 2299 s/(\d+)/\1000/; 2300 2301You can't disambiguate that by saying C<\{1}000>, whereas you can fix it with 2302C<${1}000>. The operation of interpolation should not be confused 2303with the operation of matching a backreference. Certainly they mean two 2304different things on the I<left> side of the C<s///>. 2305 2306=head2 Repeated Patterns Matching a Zero-length Substring 2307 2308B<WARNING>: Difficult material (and prose) ahead. This section needs a rewrite. 2309 2310Regular expressions provide a terse and powerful programming language. As 2311with most other power tools, power comes together with the ability 2312to wreak havoc. 2313 2314A common abuse of this power stems from the ability to make infinite 2315loops using regular expressions, with something as innocuous as: 2316 2317 'foo' =~ m{ ( o? )* }x; 2318 2319The C<o?> matches at the beginning of C<'foo'>, and since the position 2320in the string is not moved by the match, C<o?> would match again and again 2321because of the C<*> quantifier. Another common way to create a similar cycle 2322is with the looping modifier C<//g>: 2323 2324 @matches = ( 'foo' =~ m{ o? }xg ); 2325 2326or 2327 2328 print "match: <$&>\n" while 'foo' =~ m{ o? }xg; 2329 2330or the loop implied by split(). 2331 2332However, long experience has shown that many programming tasks may 2333be significantly simplified by using repeated subexpressions that 2334may match zero-length substrings. Here's a simple example being: 2335 2336 @chars = split //, $string; # // is not magic in split 2337 ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// / 2338 2339Thus Perl allows such constructs, by I<forcefully breaking 2340the infinite loop>. The rules for this are different for lower-level 2341loops given by the greedy quantifiers C<*+{}>, and for higher-level 2342ones like the C</g> modifier or split() operator. 2343 2344The lower-level loops are I<interrupted> (that is, the loop is 2345broken) when Perl detects that a repeated expression matched a 2346zero-length substring. Thus 2347 2348 m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x; 2349 2350is made equivalent to 2351 2352 m{ (?: NON_ZERO_LENGTH )* (?: ZERO_LENGTH )? }x; 2353 2354For example, this program 2355 2356 #!perl -l 2357 "aaaaab" =~ / 2358 (?: 2359 a # non-zero 2360 | # or 2361 (?{print "hello"}) # print hello whenever this 2362 # branch is tried 2363 (?=(b)) # zero-width assertion 2364 )* # any number of times 2365 /x; 2366 print $&; 2367 print $1; 2368 2369prints 2370 2371 hello 2372 aaaaa 2373 b 2374 2375Notice that "hello" is only printed once, as when Perl sees that the sixth 2376iteration of the outermost C<(?:)*> matches a zero-length string, it stops 2377the C<*>. 2378 2379The higher-level loops preserve an additional state between iterations: 2380whether the last match was zero-length. To break the loop, the following 2381match after a zero-length match is prohibited to have a length of zero. 2382This prohibition interacts with backtracking (see L<"Backtracking">), 2383and so the I<second best> match is chosen if the I<best> match is of 2384zero length. 2385 2386For example: 2387 2388 $_ = 'bar'; 2389 s/\w??/<$&>/g; 2390 2391results in C<< <><b><><a><><r><> >>. At each position of the string the best 2392match given by non-greedy C<??> is the zero-length match, and the I<second 2393best> match is what is matched by C<\w>. Thus zero-length matches 2394alternate with one-character-long matches. 2395 2396Similarly, for repeated C<m/()/g> the second-best match is the match at the 2397position one notch further in the string. 2398 2399The additional state of being I<matched with zero-length> is associated with 2400the matched string, and is reset by each assignment to pos(). 2401Zero-length matches at the end of the previous match are ignored 2402during C<split>. 2403 2404=head2 Combining RE Pieces 2405 2406Each of the elementary pieces of regular expressions which were described 2407before (such as C<ab> or C<\Z>) could match at most one substring 2408at the given position of the input string. However, in a typical regular 2409expression these elementary pieces are combined into more complicated 2410patterns using combining operators C<ST>, C<S|T>, C<S*> etc. 2411(in these examples C<S> and C<T> are regular subexpressions). 2412 2413Such combinations can include alternatives, leading to a problem of choice: 2414if we match a regular expression C<a|ab> against C<"abc">, will it match 2415substring C<"a"> or C<"ab">? One way to describe which substring is 2416actually matched is the concept of backtracking (see L<"Backtracking">). 2417However, this description is too low-level and makes you think 2418in terms of a particular implementation. 2419 2420Another description starts with notions of "better"/"worse". All the 2421substrings which may be matched by the given regular expression can be 2422sorted from the "best" match to the "worst" match, and it is the "best" 2423match which is chosen. This substitutes the question of "what is chosen?" 2424by the question of "which matches are better, and which are worse?". 2425 2426Again, for elementary pieces there is no such question, since at most 2427one match at a given position is possible. This section describes the 2428notion of better/worse for combining operators. In the description 2429below C<S> and C<T> are regular subexpressions. 2430 2431=over 4 2432 2433=item C<ST> 2434 2435Consider two possible matches, C<AB> and C<A'B'>, C<A> and C<A'> are 2436substrings which can be matched by C<S>, C<B> and C<B'> are substrings 2437which can be matched by C<T>. 2438 2439If C<A> is a better match for C<S> than C<A'>, C<AB> is a better 2440match than C<A'B'>. 2441 2442If C<A> and C<A'> coincide: C<AB> is a better match than C<AB'> if 2443C<B> is a better match for C<T> than C<B'>. 2444 2445=item C<S|T> 2446 2447When C<S> can match, it is a better match than when only C<T> can match. 2448 2449Ordering of two matches for C<S> is the same as for C<S>. Similar for 2450two matches for C<T>. 2451 2452=item C<S{REPEAT_COUNT}> 2453 2454Matches as C<SSS...S> (repeated as many times as necessary). 2455 2456=item C<S{min,max}> 2457 2458Matches as C<S{max}|S{max-1}|...|S{min+1}|S{min}>. 2459 2460=item C<S{min,max}?> 2461 2462Matches as C<S{min}|S{min+1}|...|S{max-1}|S{max}>. 2463 2464=item C<S?>, C<S*>, C<S+> 2465 2466Same as C<S{0,1}>, C<S{0,BIG_NUMBER}>, C<S{1,BIG_NUMBER}> respectively. 2467 2468=item C<S??>, C<S*?>, C<S+?> 2469 2470Same as C<S{0,1}?>, C<S{0,BIG_NUMBER}?>, C<S{1,BIG_NUMBER}?> respectively. 2471 2472=item C<< (?>S) >> 2473 2474Matches the best match for C<S> and only that. 2475 2476=item C<(?=S)>, C<(?<=S)> 2477 2478Only the best match for C<S> is considered. (This is important only if 2479C<S> has capturing parentheses, and backreferences are used somewhere 2480else in the whole regular expression.) 2481 2482=item C<(?!S)>, C<(?<!S)> 2483 2484For this grouping operator there is no need to describe the ordering, since 2485only whether or not C<S> can match is important. 2486 2487=item C<(??{ EXPR })>, C<(?I<PARNO>)> 2488 2489The ordering is the same as for the regular expression which is 2490the result of EXPR, or the pattern contained by capture group I<PARNO>. 2491 2492=item C<(?(condition)yes-pattern|no-pattern)> 2493 2494Recall that which of C<yes-pattern> or C<no-pattern> actually matches is 2495already determined. The ordering of the matches is the same as for the 2496chosen subexpression. 2497 2498=back 2499 2500The above recipes describe the ordering of matches I<at a given position>. 2501One more rule is needed to understand how a match is determined for the 2502whole regular expression: a match at an earlier position is always better 2503than a match at a later position. 2504 2505=head2 Creating Custom RE Engines 2506 2507As of Perl 5.10.0, one can create custom regular expression engines. This 2508is not for the faint of heart, as they have to plug in at the C level. See 2509L<perlreapi> for more details. 2510 2511As an alternative, overloaded constants (see L<overload>) provide a simple 2512way to extend the functionality of the RE engine, by substituting one 2513pattern for another. 2514 2515Suppose that we want to enable a new RE escape-sequence C<\Y|> which 2516matches at a boundary between whitespace characters and non-whitespace 2517characters. Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly 2518at these positions, so we want to have each C<\Y|> in the place of the 2519more complicated version. We can create a module C<customre> to do 2520this: 2521 2522 package customre; 2523 use overload; 2524 2525 sub import { 2526 shift; 2527 die "No argument to customre::import allowed" if @_; 2528 overload::constant 'qr' => \&convert; 2529 } 2530 2531 sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"} 2532 2533 # We must also take care of not escaping the legitimate \\Y| 2534 # sequence, hence the presence of '\\' in the conversion rules. 2535 my %rules = ( '\\' => '\\\\', 2536 'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ ); 2537 sub convert { 2538 my $re = shift; 2539 $re =~ s{ 2540 \\ ( \\ | Y . ) 2541 } 2542 { $rules{$1} or invalid($re,$1) }sgex; 2543 return $re; 2544 } 2545 2546Now C<use customre> enables the new escape in constant regular 2547expressions, i.e., those without any runtime variable interpolations. 2548As documented in L<overload>, this conversion will work only over 2549literal parts of regular expressions. For C<\Y|$re\Y|> the variable 2550part of this regular expression needs to be converted explicitly 2551(but only if the special meaning of C<\Y|> should be enabled inside $re): 2552 2553 use customre; 2554 $re = <>; 2555 chomp $re; 2556 $re = customre::convert $re; 2557 /\Y|$re\Y|/; 2558 2559=head2 PCRE/Python Support 2560 2561As of Perl 5.10.0, Perl supports several Python/PCRE-specific extensions 2562to the regex syntax. While Perl programmers are encouraged to use the 2563Perl-specific syntax, the following are also accepted: 2564 2565=over 4 2566 2567=item C<< (?PE<lt>NAMEE<gt>pattern) >> 2568 2569Define a named capture group. Equivalent to C<< (?<NAME>pattern) >>. 2570 2571=item C<< (?P=NAME) >> 2572 2573Backreference to a named capture group. Equivalent to C<< \g{NAME} >>. 2574 2575=item C<< (?P>NAME) >> 2576 2577Subroutine call to a named capture group. Equivalent to C<< (?&NAME) >>. 2578 2579=back 2580 2581=head1 BUGS 2582 2583Many regular expression constructs don't work on EBCDIC platforms. 2584 2585There are a number of issues with regard to case-insensitive matching 2586in Unicode rules. See C<i> under L</Modifiers> above. 2587 2588This document varies from difficult to understand to completely 2589and utterly opaque. The wandering prose riddled with jargon is 2590hard to fathom in several places. 2591 2592This document needs a rewrite that separates the tutorial content 2593from the reference content. 2594 2595=head1 SEE ALSO 2596 2597L<perlrequick>. 2598 2599L<perlretut>. 2600 2601L<perlop/"Regexp Quote-Like Operators">. 2602 2603L<perlop/"Gory details of parsing quoted constructs">. 2604 2605L<perlfaq6>. 2606 2607L<perlfunc/pos>. 2608 2609L<perllocale>. 2610 2611L<perlebcdic>. 2612 2613I<Mastering Regular Expressions> by Jeffrey Friedl, published 2614by O'Reilly and Associates. 2615