1=head1 NAME 2X<regular expression> X<regex> X<regexp> 3 4perlre - Perl regular expressions 5 6=head1 DESCRIPTION 7 8This page describes the syntax of regular expressions in Perl. 9 10If you haven't used regular expressions before, a tutorial introduction 11is available in L<perlretut>. If you know just a little about them, 12a quick-start introduction is available in L<perlrequick>. 13 14Except for L</The Basics> section, this page assumes you are familiar 15with regular expression basics, like what is a "pattern", what does it 16look like, and how it is basically used. For a reference on how they 17are used, plus various examples of the same, see discussions of C<m//>, 18C<s///>, C<qr//> and C<"??"> in L<perlop/"Regexp Quote-Like Operators">. 19 20New in v5.22, L<C<use re 'strict'>|re/'strict' mode> applies stricter 21rules than otherwise when compiling regular expression patterns. It can 22find things that, while legal, may not be what you intended. 23 24=head2 The Basics 25X<regular expression, version 8> X<regex, version 8> X<regexp, version 8> 26 27Regular expressions are strings with the very particular syntax and 28meaning described in this document and auxiliary documents referred to 29by this one. The strings are called "patterns". Patterns are used to 30determine if some other string, called the "target", has (or doesn't 31have) the characteristics specified by the pattern. We call this 32"matching" the target string against the pattern. Usually the match is 33done by having the target be the first operand, and the pattern be the 34second operand, of one of the two binary operators C<=~> and C<!~>, 35listed in L<perlop/Binding Operators>; and the pattern will have been 36converted from an ordinary string by one of the operators in 37L<perlop/"Regexp Quote-Like Operators">, like so: 38 39 $foo =~ m/abc/ 40 41This evaluates to true if and only if the string in the variable C<$foo> 42contains somewhere in it, the sequence of characters "a", "b", then "c". 43(The C<=~ m>, or match operator, is described in 44L<perlop/m/PATTERN/msixpodualngc>.) 45 46Patterns that aren't already stored in some variable must be delimited, 47at both ends, by delimiter characters. These are often, as in the 48example above, forward slashes, and the typical way a pattern is written 49in documentation is with those slashes. In most cases, the delimiter 50is the same character, fore and aft, but there are a few cases where a 51character looks like it has a mirror-image mate, where the opening 52version is the beginning delimiter, and the closing one is the ending 53delimiter, like 54 55 $foo =~ m<abc> 56 57Most times, the pattern is evaluated in double-quotish context, but it 58is possible to choose delimiters to force single-quotish, like 59 60 $foo =~ m'abc' 61 62If the pattern contains its delimiter within it, that delimiter must be 63escaped. Prefixing it with a backslash (I<e.g.>, C<"/foo\/bar/">) 64serves this purpose. 65 66Any single character in a pattern matches that same character in the 67target string, unless the character is a I<metacharacter> with a special 68meaning described in this document. A sequence of non-metacharacters 69matches the same sequence in the target string, as we saw above with 70C<m/abc/>. 71 72Only a few characters (all of them being ASCII punctuation characters) 73are metacharacters. The most commonly used one is a dot C<".">, which 74normally matches almost any character (including a dot itself). 75 76You can cause characters that normally function as metacharacters to be 77interpreted literally by prefixing them with a C<"\">, just like the 78pattern's delimiter must be escaped if it also occurs within the 79pattern. Thus, C<"\."> matches just a literal dot, C<"."> instead of 80its normal meaning. This means that the backslash is also a 81metacharacter, so C<"\\"> matches a single C<"\">. And a sequence that 82contains an escaped metacharacter matches the same sequence (but without 83the escape) in the target string. So, the pattern C</blur\\fl/> would 84match any target string that contains the sequence C<"blur\fl">. 85 86The metacharacter C<"|"> is used to match one thing or another. Thus 87 88 $foo =~ m/this|that/ 89 90is TRUE if and only if C<$foo> contains either the sequence C<"this"> or 91the sequence C<"that">. Like all metacharacters, prefixing the C<"|"> 92with a backslash makes it match the plain punctuation character; in its 93case, the VERTICAL LINE. 94 95 $foo =~ m/this\|that/ 96 97is TRUE if and only if C<$foo> contains the sequence C<"this|that">. 98 99You aren't limited to just a single C<"|">. 100 101 $foo =~ m/fee|fie|foe|fum/ 102 103is TRUE if and only if C<$foo> contains any of those 4 sequences from 104the children's story "Jack and the Beanstalk". 105 106As you can see, the C<"|"> binds less tightly than a sequence of 107ordinary characters. We can override this by using the grouping 108metacharacters, the parentheses C<"("> and C<")">. 109 110 $foo =~ m/th(is|at) thing/ 111 112is TRUE if and only if C<$foo> contains either the sequence S<C<"this 113thing">> or the sequence S<C<"that thing">>. The portions of the string 114that match the portions of the pattern enclosed in parentheses are 115normally made available separately for use later in the pattern, 116substitution, or program. This is called "capturing", and it can get 117complicated. See L</Capture groups>. 118 119The first alternative includes everything from the last pattern 120delimiter (C<"(">, C<"(?:"> (described later), I<etc>. or the beginning 121of the pattern) up to the first C<"|">, and the last alternative 122contains everything from the last C<"|"> to the next closing pattern 123delimiter. That's why it's common practice to include alternatives in 124parentheses: to minimize confusion about where they start and end. 125 126Alternatives are tried from left to right, so the first 127alternative found for which the entire expression matches, is the one that 128is chosen. This means that alternatives are not necessarily greedy. For 129example: when matching C<foo|foot> against C<"barefoot">, only the C<"foo"> 130part will match, as that is the first alternative tried, and it successfully 131matches the target string. (This might not seem important, but it is 132important when you are capturing matched text using parentheses.) 133 134Besides taking away the special meaning of a metacharacter, a prefixed 135backslash changes some letter and digit characters away from matching 136just themselves to instead have special meaning. These are called 137"escape sequences", and all such are described in L<perlrebackslash>. A 138backslash sequence (of a letter or digit) that doesn't currently have 139special meaning to Perl will raise a warning if warnings are enabled, 140as those are reserved for potential future use. 141 142One such sequence is C<\b>, which matches a boundary of some sort. 143C<\b{wb}> and a few others give specialized types of boundaries. 144(They are all described in detail starting at 145L<perlrebackslash/\b{}, \b, \B{}, \B>.) Note that these don't match 146characters, but the zero-width spaces between characters. They are an 147example of a L<zero-width assertion|/Assertions>. Consider again, 148 149 $foo =~ m/fee|fie|foe|fum/ 150 151It evaluates to TRUE if, besides those 4 words, any of the sequences 152"feed", "field", "Defoe", "fume", and many others are in C<$foo>. By 153judicious use of C<\b> (or better (because it is designed to handle 154natural language) C<\b{wb}>), we can make sure that only the Giant's 155words are matched: 156 157 $foo =~ m/\b(fee|fie|foe|fum)\b/ 158 $foo =~ m/\b{wb}(fee|fie|foe|fum)\b{wb}/ 159 160The final example shows that the characters C<"{"> and C<"}"> are 161metacharacters. 162 163Another use for escape sequences is to specify characters that cannot 164(or which you prefer not to) be written literally. These are described 165in detail in L<perlrebackslash/Character Escapes>, but the next three 166paragraphs briefly describe some of them. 167 168Various control characters can be written in C language style: C<"\n"> 169matches a newline, C<"\t"> a tab, C<"\r"> a carriage return, C<"\f"> a 170form feed, I<etc>. 171 172More generally, C<\I<nnn>>, where I<nnn> is a string of three octal 173digits, matches the character whose native code point is I<nnn>. You 174can easily run into trouble if you don't have exactly three digits. So 175always use three, or since Perl 5.14, you can use C<\o{...}> to specify 176any number of octal digits. 177 178Similarly, C<\xI<nn>>, where I<nn> are hexadecimal digits, matches the 179character whose native ordinal is I<nn>. Again, not using exactly two 180digits is a recipe for disaster, but you can use C<\x{...}> to specify 181any number of hex digits. 182 183Besides being a metacharacter, the C<"."> is an example of a "character 184class", something that can match any single character of a given set of 185them. In its case, the set is just about all possible characters. Perl 186predefines several character classes besides the C<".">; there is a 187separate reference page about just these, L<perlrecharclass>. 188 189You can define your own custom character classes, by putting into your 190pattern in the appropriate place(s), a list of all the characters you 191want in the set. You do this by enclosing the list within C<[]> bracket 192characters. These are called "bracketed character classes" when we are 193being precise, but often the word "bracketed" is dropped. (Dropping it 194usually doesn't cause confusion.) This means that the C<"["> character 195is another metacharacter. It doesn't match anything just by itself; it 196is used only to tell Perl that what follows it is a bracketed character 197class. If you want to match a literal left square bracket, you must 198escape it, like C<"\[">. The matching C<"]"> is also a metacharacter; 199again it doesn't match anything by itself, but just marks the end of 200your custom class to Perl. It is an example of a "sometimes 201metacharacter". It isn't a metacharacter if there is no corresponding 202C<"[">, and matches its literal self: 203 204 print "]" =~ /]/; # prints 1 205 206The list of characters within the character class gives the set of 207characters matched by the class. C<"[abc]"> matches a single "a" or "b" 208or "c". But if the first character after the C<"["> is C<"^">, the 209class instead matches any character not in the list. Within a list, the 210C<"-"> character specifies a range of characters, so that C<a-z> 211represents all characters between "a" and "z", inclusive. If you want 212either C<"-"> or C<"]"> itself to be a member of a class, put it at the 213start of the list (possibly after a C<"^">), or escape it with a 214backslash. C<"-"> is also taken literally when it is at the end of the 215list, just before the closing C<"]">. (The following all specify the 216same class of three characters: C<[-az]>, C<[az-]>, and C<[a\-z]>. All 217are different from C<[a-z]>, which specifies a class containing 218twenty-six characters, even on EBCDIC-based character sets.) 219 220There is lots more to bracketed character classes; full details are in 221L<perlrecharclass/Bracketed Character Classes>. 222 223=head3 Metacharacters 224X<metacharacter> 225X<\> X<^> X<.> X<$> X<|> X<(> X<()> X<[> X<[]> 226 227L</The Basics> introduced some of the metacharacters. This section 228gives them all. Most of them have the same meaning as in the I<egrep> 229command. 230 231Only the C<"\"> is always a metacharacter. The others are metacharacters 232just sometimes. The following tables lists all of them, summarizes 233their use, and gives the contexts where they are metacharacters. 234Outside those contexts or if prefixed by a C<"\">, they match their 235corresponding punctuation character. In some cases, their meaning 236varies depending on various pattern modifiers that alter the default 237behaviors. See L</Modifiers>. 238 239 240 PURPOSE WHERE 241 \ Escape the next character Always, except when 242 escaped by another \ 243 ^ Match the beginning of the string Not in [] 244 (or line, if /m is used) 245 ^ Complement the [] class At the beginning of [] 246 . Match any single character except newline Not in [] 247 (under /s, includes newline) 248 $ Match the end of the string Not in [], but can 249 (or before newline at the end of the mean interpolate a 250 string; or before any newline if /m is scalar 251 used) 252 | Alternation Not in [] 253 () Grouping Not in [] 254 [ Start Bracketed Character class Not in [] 255 ] End Bracketed Character class Only in [], and 256 not first 257 * Matches the preceding element 0 or more Not in [] 258 times 259 + Matches the preceding element 1 or more Not in [] 260 times 261 ? Matches the preceding element 0 or 1 Not in [] 262 times 263 { Starts a sequence that gives number(s) Not in [] 264 of times the preceding element can be 265 matched 266 { when following certain escape sequences 267 starts a modifier to the meaning of the 268 sequence 269 } End sequence started by { 270 - Indicates a range Only in [] interior 271 # Beginning of comment, extends to line end Only with /x modifier 272 273Notice that most of the metacharacters lose their special meaning when 274they occur in a bracketed character class, except C<"^"> has a different 275meaning when it is at the beginning of such a class. And C<"-"> and C<"]"> 276are metacharacters only at restricted positions within bracketed 277character classes; while C<"}"> is a metacharacter only when closing a 278special construct started by C<"{">. 279 280In double-quotish context, as is usually the case, you need to be 281careful about C<"$"> and the non-metacharacter C<"@">. Those could 282interpolate variables, which may or may not be what you intended. 283 284These rules were designed for compactness of expression, rather than 285legibility and maintainability. The L</E<sol>x and E<sol>xx> pattern 286modifiers allow you to insert white space to improve readability. And 287use of S<C<L<re 'strict'|re/'strict' mode>>> adds extra checking to 288catch some typos that might silently compile into something unintended. 289 290By default, the C<"^"> character is guaranteed to match only the 291beginning of the string, the C<"$"> character only the end (or before the 292newline at the end), and Perl does certain optimizations with the 293assumption that the string contains only one line. Embedded newlines 294will not be matched by C<"^"> or C<"$">. You may, however, wish to treat a 295string as a multi-line buffer, such that the C<"^"> will match after any 296newline within the string (except if the newline is the last character in 297the string), and C<"$"> will match before any newline. At the 298cost of a little more overhead, you can do this by using the 299C<L</E<sol>m>> modifier on the pattern match operator. (Older programs 300did this by setting C<$*>, but this option was removed in perl 5.10.) 301X<^> X<$> X</m> 302 303To simplify multi-line substitutions, the C<"."> character never matches a 304newline unless you use the L<C<E<sol>s>|/s> modifier, which in effect tells 305Perl to pretend the string is a single line--even if it isn't. 306X<.> X</s> 307 308=head2 Modifiers 309 310=head3 Overview 311 312The default behavior for matching can be changed, using various 313modifiers. Modifiers that relate to the interpretation of the pattern 314are listed just below. Modifiers that alter the way a pattern is used 315by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and 316L<perlop/"Gory details of parsing quoted constructs">. Modifiers can be added 317dynamically; see L</Extended Patterns> below. 318 319=over 4 320 321=item B<C<m>> 322X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline> 323 324Treat the string being matched against as multiple lines. That is, change C<"^"> and C<"$"> from matching 325the start of the string's first line and the end of its last line to 326matching the start and end of each line within the string. 327 328=item B<C<s>> 329X</s> X<regex, single-line> X<regexp, single-line> 330X<regular expression, single-line> 331 332Treat the string as single line. That is, change C<"."> to match any character 333whatsoever, even a newline, which normally it would not match. 334 335Used together, as C</ms>, they let the C<"."> match any character whatsoever, 336while still allowing C<"^"> and C<"$"> to match, respectively, just after 337and just before newlines within the string. 338 339=item B<C<i>> 340X</i> X<regex, case-insensitive> X<regexp, case-insensitive> 341X<regular expression, case-insensitive> 342 343Do case-insensitive pattern matching. For example, "A" will match "a" 344under C</i>. 345 346If locale matching rules are in effect, the case map is taken from the 347current 348locale for code points less than 255, and from Unicode rules for larger 349code points. However, matches that would cross the Unicode 350rules/non-Unicode rules boundary (ords 255/256) will not succeed, unless 351the locale is a UTF-8 one. See L<perllocale>. 352 353There are a number of Unicode characters that match a sequence of 354multiple characters under C</i>. For example, 355C<LATIN SMALL LIGATURE FI> should match the sequence C<fi>. Perl is not 356currently able to do this when the multiple characters are in the pattern and 357are split between groupings, or when one or more are quantified. Thus 358 359 "\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches 360 "\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match! 361 "\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match! 362 363 # The below doesn't match, and it isn't clear what $1 and $2 would 364 # be even if it did!! 365 "\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match! 366 367Perl doesn't match multiple characters in a bracketed 368character class unless the character that maps to them is explicitly 369mentioned, and it doesn't match them at all if the character class is 370inverted, which otherwise could be highly confusing. See 371L<perlrecharclass/Bracketed Character Classes>, and 372L<perlrecharclass/Negation>. 373 374=item B<C<x>> and B<C<xx>> 375X</x> 376 377Extend your pattern's legibility by permitting whitespace and comments. 378Details in L</E<sol>x and E<sol>xx> 379 380=item B<C<p>> 381X</p> X<regex, preserve> X<regexp, preserve> 382 383Preserve the string matched such that C<${^PREMATCH}>, C<${^MATCH}>, and 384C<${^POSTMATCH}> are available for use after matching. 385 386In Perl 5.20 and higher this is ignored. Due to a new copy-on-write 387mechanism, C<${^PREMATCH}>, C<${^MATCH}>, and C<${^POSTMATCH}> will be available 388after the match regardless of the modifier. 389 390=item B<C<a>>, B<C<d>>, B<C<l>>, and B<C<u>> 391X</a> X</d> X</l> X</u> 392 393These modifiers, all new in 5.14, affect which character-set rules 394(Unicode, I<etc>.) are used, as described below in 395L</Character set modifiers>. 396 397=item B<C<n>> 398X</n> X<regex, non-capture> X<regexp, non-capture> 399X<regular expression, non-capture> 400 401Prevent the grouping metacharacters C<()> from capturing. This modifier, 402new in 5.22, will stop C<$1>, C<$2>, I<etc>... from being filled in. 403 404 "hello" =~ /(hi|hello)/; # $1 is "hello" 405 "hello" =~ /(hi|hello)/n; # $1 is undef 406 407This is equivalent to putting C<?:> at the beginning of every capturing group: 408 409 "hello" =~ /(?:hi|hello)/; # $1 is undef 410 411C</n> can be negated on a per-group basis. Alternatively, named captures 412may still be used. 413 414 "hello" =~ /(?-n:(hi|hello))/n; # $1 is "hello" 415 "hello" =~ /(?<greet>hi|hello)/n; # $1 is "hello", $+{greet} is 416 # "hello" 417 418=item Other Modifiers 419 420There are a number of flags that can be found at the end of regular 421expression constructs that are I<not> generic regular expression flags, but 422apply to the operation being performed, like matching or substitution (C<m//> 423or C<s///> respectively). 424 425Flags described further in 426L<perlretut/"Using regular expressions in Perl"> are: 427 428 c - keep the current position during repeated matching 429 g - globally match the pattern repeatedly in the string 430 431Substitution-specific modifiers described in 432L<perlop/"s/PATTERN/REPLACEMENT/msixpodualngcer"> are: 433 434 e - evaluate the right-hand side as an expression 435 ee - evaluate the right side as a string then eval the result 436 o - pretend to optimize your code, but actually introduce bugs 437 r - perform non-destructive substitution and return the new value 438 439=back 440 441Regular expression modifiers are usually written in documentation 442as I<e.g.>, "the C</x> modifier", even though the delimiter 443in question might not really be a slash. The modifiers C</imnsxadlup> 444may also be embedded within the regular expression itself using 445the C<(?...)> construct, see L</Extended Patterns> below. 446 447=head3 Details on some modifiers 448 449Some of the modifiers require more explanation than given in the 450L</Overview> above. 451 452=head4 C</x> and C</xx> 453 454A single C</x> tells 455the regular expression parser to ignore most whitespace that is neither 456backslashed nor within a bracketed character class, nor within the characters 457of a multi-character metapattern like C<(?i: ... )>. You can use this to 458break up your regular expression into more readable parts. 459Also, the C<"#"> character is treated as a metacharacter introducing a 460comment that runs up to the pattern's closing delimiter, or to the end 461of the current line if the pattern extends onto the next line. Hence, 462this is very much like an ordinary Perl code comment. (You can include 463the closing delimiter within the comment only if you precede it with a 464backslash, so be careful!) 465 466Use of C</x> means that if you want real 467whitespace or C<"#"> characters in the pattern (outside a bracketed character 468class, which is unaffected by C</x>), then you'll either have to 469escape them (using backslashes or C<\Q...\E>) or encode them using octal, 470hex, or C<\N{}> or C<\p{name=...}> escapes. 471It is ineffective to try to continue a comment onto the next line by 472escaping the C<\n> with a backslash or C<\Q>. 473 474You can use L</(?#text)> to create a comment that ends earlier than the 475end of the current line, but C<text> also can't contain the closing 476delimiter unless escaped with a backslash. 477 478A common pitfall is to forget that C<"#"> characters (outside a 479bracketed character class) begin a comment under C</x> and are not 480matched literally. Just keep that in mind when trying to puzzle out why 481a particular C</x> pattern isn't working as expected. 482Inside a bracketed character class, C<"#"> retains its non-special, 483literal meaning. 484 485Starting in Perl v5.26, if the modifier has a second C<"x"> within it, 486the effect of a single C</x> is increased. The only difference is that 487inside bracketed character classes, non-escaped (by a backslash) SPACE 488and TAB characters are not added to the class, and hence can be inserted 489to make the classes more readable: 490 491 / [d-e g-i 3-7]/xx 492 /[ ! @ " # $ % ^ & * () = ? <> ' ]/xx 493 494may be easier to grasp than the squashed equivalents 495 496 /[d-eg-i3-7]/ 497 /[!@"#$%^&*()=?<>']/ 498 499Note that this unfortunately doesn't mean that your bracketed classes 500can contain comments or extend over multiple lines. A C<#> inside a 501character class is still just a literal C<#>, and doesn't introduce a 502comment. And, unless the closing bracket is on the same line as the 503opening one, the newline character (and everything on the next line(s) 504until terminated by a C<]> will be part of the class, just as if you'd 505written C<\n>. 506 507Taken together, these features go a long way towards 508making Perl's regular expressions more readable. Here's an example: 509 510 # Delete (most) C comments. 511 $program =~ s { 512 /\* # Match the opening delimiter. 513 .*? # Match a minimal number of characters. 514 \*/ # Match the closing delimiter. 515 } []gsx; 516 517Note that anything inside 518a C<\Q...\E> stays unaffected by C</x>. And note that C</x> doesn't affect 519space interpretation within a single multi-character construct. For 520example C<(?:...)> can't have a space between the C<"(">, 521C<"?">, and C<":">. Within any delimiters for such a construct, allowed 522spaces are not affected by C</x>, and depend on the construct. For 523example, all constructs using curly braces as delimiters, such as 524C<\x{...}> can have blanks within but adjacent to the braces, but not 525elsewhere, and no non-blank space characters. An exception are Unicode 526properties which follow Unicode rules, for which see 527L<perluniprops/Properties accessible through \p{} and \P{}>. 528X</x> 529 530The set of characters that are deemed whitespace are those that Unicode 531calls "Pattern White Space", namely: 532 533 U+0009 CHARACTER TABULATION 534 U+000A LINE FEED 535 U+000B LINE TABULATION 536 U+000C FORM FEED 537 U+000D CARRIAGE RETURN 538 U+0020 SPACE 539 U+0085 NEXT LINE 540 U+200E LEFT-TO-RIGHT MARK 541 U+200F RIGHT-TO-LEFT MARK 542 U+2028 LINE SEPARATOR 543 U+2029 PARAGRAPH SEPARATOR 544 545=head4 Character set modifiers 546 547C</d>, C</u>, C</a>, and C</l>, available starting in 5.14, are called 548the character set modifiers; they affect the character set rules 549used for the regular expression. 550 551The C</d>, C</u>, and C</l> modifiers are not likely to be of much use 552to you, and so you need not worry about them very much. They exist for 553Perl's internal use, so that complex regular expression data structures 554can be automatically serialized and later exactly reconstituted, 555including all their nuances. But, since Perl can't keep a secret, and 556there may be rare instances where they are useful, they are documented 557here. 558 559The C</a> modifier, on the other hand, may be useful. Its purpose is to 560allow code that is to work mostly on ASCII data to not have to concern 561itself with Unicode. 562 563Briefly, C</l> sets the character set to that of whatever B<L>ocale is in 564effect at the time of the execution of the pattern match. 565 566C</u> sets the character set to B<U>nicode. 567 568C</a> also sets the character set to Unicode, BUT adds several 569restrictions for B<A>SCII-safe matching. 570 571C</d> is the old, problematic, pre-5.14 B<D>efault character set 572behavior. Its only use is to force that old behavior. 573 574At any given time, exactly one of these modifiers is in effect. Their 575existence allows Perl to keep the originally compiled behavior of a 576regular expression, regardless of what rules are in effect when it is 577actually executed. And if it is interpolated into a larger regex, the 578original's rules continue to apply to it, and don't affect the other 579parts. 580 581The C</l> and C</u> modifiers are automatically selected for 582regular expressions compiled within the scope of various pragmas, 583and we recommend that in general, you use those pragmas instead of 584specifying these modifiers explicitly. For one thing, the modifiers 585affect only pattern matching, and do not extend to even any replacement 586done, whereas using the pragmas gives consistent results for all 587appropriate operations within their scopes. For example, 588 589 s/foo/\Ubar/il 590 591will match "foo" using the locale's rules for case-insensitive matching, 592but the C</l> does not affect how the C<\U> operates. Most likely you 593want both of them to use locale rules. To do this, instead compile the 594regular expression within the scope of C<use locale>. This both 595implicitly adds the C</l>, and applies locale rules to the C<\U>. The 596lesson is to C<use locale>, and not C</l> explicitly. 597 598Similarly, it would be better to use C<use feature 'unicode_strings'> 599instead of, 600 601 s/foo/\Lbar/iu 602 603to get Unicode rules, as the C<\L> in the former (but not necessarily 604the latter) would also use Unicode rules. 605 606More detail on each of the modifiers follows. Most likely you don't 607need to know this detail for C</l>, C</u>, and C</d>, and can skip ahead 608to L<E<sol>a|/E<sol>a (and E<sol>aa)>. 609 610=head4 /l 611 612means to use the current locale's rules (see L<perllocale>) when pattern 613matching. For example, C<\w> will match the "word" characters of that 614locale, and C<"/i"> case-insensitive matching will match according to 615the locale's case folding rules. The locale used will be the one in 616effect at the time of execution of the pattern match. This may not be 617the same as the compilation-time locale, and can differ from one match 618to another if there is an intervening call of the 619L<setlocale() function|perllocale/The setlocale function>. 620 621Prior to v5.20, Perl did not support multi-byte locales. Starting then, 622UTF-8 locales are supported. No other multi byte locales are ever 623likely to be supported. However, in all locales, one can have code 624points above 255 and these will always be treated as Unicode no matter 625what locale is in effect. 626 627Under Unicode rules, there are a few case-insensitive matches that cross 628the 255/256 boundary. Except for UTF-8 locales in Perls v5.20 and 629later, these are disallowed under C</l>. For example, 0xFF (on ASCII 630platforms) does not caselessly match the character at 0x178, C<LATIN 631CAPITAL LETTER Y WITH DIAERESIS>, because 0xFF may not be C<LATIN SMALL 632LETTER Y WITH DIAERESIS> in the current locale, and Perl has no way of 633knowing if that character even exists in the locale, much less what code 634point it is. 635 636In a UTF-8 locale in v5.20 and later, the only visible difference 637between locale and non-locale in regular expressions should be tainting, 638if your perl supports taint checking (see L<perlsec>). 639 640This modifier may be specified to be the default by C<use locale>, but 641see L</Which character set modifier is in effect?>. 642X</l> 643 644=head4 /u 645 646means to use Unicode rules when pattern matching. On ASCII platforms, 647this means that the code points between 128 and 255 take on their 648Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's). 649(Otherwise Perl considers their meanings to be undefined.) Thus, 650under this modifier, the ASCII platform effectively becomes a Unicode 651platform; and hence, for example, C<\w> will match any of the more than 652100_000 word characters in Unicode. 653 654Unlike most locales, which are specific to a language and country pair, 655Unicode classifies all the characters that are letters I<somewhere> in 656the world as 657C<\w>. For example, your locale might not think that C<LATIN SMALL 658LETTER ETH> is a letter (unless you happen to speak Icelandic), but 659Unicode does. Similarly, all the characters that are decimal digits 660somewhere in the world will match C<\d>; this is hundreds, not 10, 661possible matches. And some of those digits look like some of the 10 662ASCII digits, but mean a different number, so a human could easily think 663a number is a different quantity than it really is. For example, 664C<BENGALI DIGIT FOUR> (U+09EA) looks very much like an 665C<ASCII DIGIT EIGHT> (U+0038), and C<LEPCHA DIGIT SIX> (U+1C46) looks 666very much like an C<ASCII DIGIT FIVE> (U+0035). And, C<\d+>, may match 667strings of digits that are a mixture from different writing systems, 668creating a security issue. A fraudulent website, for example, could 669display the price of something using U+1C46, and it would appear to the 670user that something cost 500 units, but it really costs 600. A browser 671that enforced script runs (L</Script Runs>) would prevent that 672fraudulent display. L<Unicode::UCD/num()> can also be used to sort this 673out. Or the C</a> modifier can be used to force C<\d> to match just the 674ASCII 0 through 9. 675 676Also, under this modifier, case-insensitive matching works on the full 677set of Unicode 678characters. The C<KELVIN SIGN>, for example matches the letters "k" and 679"K"; and C<LATIN SMALL LIGATURE FF> matches the sequence "ff", which, 680if you're not prepared, might make it look like a hexadecimal constant, 681presenting another potential security issue. See 682L<https://unicode.org/reports/tr36> for a detailed discussion of Unicode 683security issues. 684 685This modifier may be specified to be the default by C<use feature 686'unicode_strings>, C<use locale ':not_characters'>, or 687C<L<use v5.12|perlfunc/use VERSION>> (or higher), 688but see L</Which character set modifier is in effect?>. 689X</u> 690 691=head4 /d 692 693B<IMPORTANT:> Because of the unpredictable behaviors this 694modifier causes, only use it to maintain weird backward compatibilities. 695Use the 696L<< C<unicode_strings>|feature/"The 'unicode_strings' feature" >> 697feature 698in new code to avoid inadvertently enabling this modifier by default. 699 700What does this modifier do? It "Depends"! 701 702This modifier means to use platform-native matching rules 703except when there is cause to use Unicode rules instead, as follows: 704 705=over 4 706 707=item 1 708 709the target string's L<UTF8 flag|perlunifaq/What is "the UTF8 flag"?> 710(see below) is set; or 711 712=item 2 713 714the pattern's L<UTF8 flag|perlunifaq/What is "the UTF8 flag"?> 715(see below) is set; or 716 717=item 3 718 719the pattern explicitly mentions a code point that is above 255 (say by 720C<\x{100}>); or 721 722=item 4 723 724the pattern uses a Unicode name (C<\N{...}>); or 725 726=item 5 727 728the pattern uses a Unicode property (C<\p{...}> or C<\P{...}>); or 729 730=item 6 731 732the pattern uses a Unicode break (C<\b{...}> or C<\B{...}>); or 733 734=item 7 735 736the pattern uses C<L</(?[ ])>> 737 738=item 8 739 740the pattern uses L<C<(*script_run: ...)>|/Script Runs> 741 742=back 743 744Regarding the "UTF8 flag" references above: normally Perl applications 745shouldn't think about that flag. It's part of Perl's internals, 746so it can change whenever Perl wants. C</d> may thus cause unpredictable 747results. See L<perlunicode/The "Unicode Bug">. This bug 748has become rather infamous, leading to yet other (without swearing) names 749for this modifier like "Dicey" and "Dodgy". 750 751Here are some examples of how that works on an ASCII platform: 752 753 $str = "\xDF"; # 754 utf8::downgrade($str); # $str is not UTF8-flagged. 755 $str =~ /^\w/; # No match, since no UTF8 flag. 756 757 $str .= "\x{0e0b}"; # Now $str is UTF8-flagged. 758 $str =~ /^\w/; # Match! $str is now UTF8-flagged. 759 chop $str; 760 $str =~ /^\w/; # Still a match! $str retains its UTF8 flag. 761 762Under Perl's default configuration this modifier is automatically 763selected by default when none of the others are, so yet another name 764for it (unfortunately) is "Default". 765 766Whenever you can, use the 767L<< C<unicode_strings>|feature/"The 'unicode_strings' feature" >> 768to cause X</u> to be the default instead. 769 770=head4 /a (and /aa) 771 772This modifier stands for ASCII-restrict (or ASCII-safe). This modifier 773may be doubled-up to increase its effect. 774 775When it appears singly, it causes the sequences C<\d>, C<\s>, C<\w>, and 776the Posix character classes to match only in the ASCII range. They thus 777revert to their pre-5.6, pre-Unicode meanings. Under C</a>, C<\d> 778always means precisely the digits C<"0"> to C<"9">; C<\s> means the five 779characters C<[ \f\n\r\t]>, and starting in Perl v5.18, the vertical tab; 780C<\w> means the 63 characters 781C<[A-Za-z0-9_]>; and likewise, all the Posix classes such as 782C<[[:print:]]> match only the appropriate ASCII-range characters. 783 784This modifier is useful for people who only incidentally use Unicode, 785and who do not wish to be burdened with its complexities and security 786concerns. 787 788With C</a>, one can write C<\d> with confidence that it will only match 789ASCII characters, and should the need arise to match beyond ASCII, you 790can instead use C<\p{Digit}> (or C<\p{Word}> for C<\w>). There are 791similar C<\p{...}> constructs that can match beyond ASCII both white 792space (see L<perlrecharclass/Whitespace>), and Posix classes (see 793L<perlrecharclass/POSIX Character Classes>). Thus, this modifier 794doesn't mean you can't use Unicode, it means that to get Unicode 795matching you must explicitly use a construct (C<\p{}>, C<\P{}>) that 796signals Unicode. 797 798As you would expect, this modifier causes, for example, C<\D> to mean 799the same thing as C<[^0-9]>; in fact, all non-ASCII characters match 800C<\D>, C<\S>, and C<\W>. C<\b> still means to match at the boundary 801between C<\w> and C<\W>, using the C</a> definitions of them (similarly 802for C<\B>). 803 804Otherwise, C</a> behaves like the C</u> modifier, in that 805case-insensitive matching uses Unicode rules; for example, "k" will 806match the Unicode C<\N{KELVIN SIGN}> under C</i> matching, and code 807points in the Latin1 range, above ASCII will have Unicode rules when it 808comes to case-insensitive matching. 809 810To forbid ASCII/non-ASCII matches (like "k" with C<\N{KELVIN SIGN}>), 811specify the C<"a"> twice, for example C</aai> or C</aia>. (The first 812occurrence of C<"a"> restricts the C<\d>, I<etc>., and the second occurrence 813adds the C</i> restrictions.) But, note that code points outside the 814ASCII range will use Unicode rules for C</i> matching, so the modifier 815doesn't really restrict things to just ASCII; it just forbids the 816intermixing of ASCII and non-ASCII. 817 818To summarize, this modifier provides protection for applications that 819don't wish to be exposed to all of Unicode. Specifying it twice 820gives added protection. 821 822This modifier may be specified to be the default by C<use re '/a'> 823or C<use re '/aa'>. If you do so, you may actually have occasion to use 824the C</u> modifier explicitly if there are a few regular expressions 825where you do want full Unicode rules (but even here, it's best if 826everything were under feature C<"unicode_strings">, along with the 827C<use re '/aa'>). Also see L</Which character set modifier is in 828effect?>. 829X</a> 830X</aa> 831 832=head4 Which character set modifier is in effect? 833 834Which of these modifiers is in effect at any given point in a regular 835expression depends on a fairly complex set of interactions. These have 836been designed so that in general you don't have to worry about it, but 837this section gives the gory details. As 838explained below in L</Extended Patterns> it is possible to explicitly 839specify modifiers that apply only to portions of a regular expression. 840The innermost always has priority over any outer ones, and one applying 841to the whole expression has priority over any of the default settings that are 842described in the remainder of this section. 843 844The C<L<use re 'E<sol>foo'|re/"'/flags' mode">> pragma can be used to set 845default modifiers (including these) for regular expressions compiled 846within its scope. This pragma has precedence over the other pragmas 847listed below that also change the defaults. Note that the /x modifier does 848NOT affect C<split STR> patterns. 849 850Otherwise, C<L<use locale|perllocale>> sets the default modifier to C</l>; 851and C<L<use feature 'unicode_strings|feature>>, or 852C<L<use v5.12|perlfunc/use VERSION>> (or higher) set the default to 853C</u> when not in the same scope as either C<L<use locale|perllocale>> 854or C<L<use bytes|bytes>>. 855(C<L<use locale ':not_characters'|perllocale/Unicode and UTF-8>> also 856sets the default to C</u>, overriding any plain C<use locale>.) 857Unlike the mechanisms mentioned above, these 858affect operations besides regular expressions pattern matching, and so 859give more consistent results with other operators, including using 860C<\U>, C<\l>, I<etc>. in substitution replacements. 861 862If none of the above apply, for backwards compatibility reasons, the 863C</d> modifier is the one in effect by default. As this can lead to 864unexpected results, it is best to specify which other rule set should be 865used. 866 867=head4 Character set modifier behavior prior to Perl 5.14 868 869Prior to 5.14, there were no explicit modifiers, but C</l> was implied 870for regexes compiled within the scope of C<use locale>, and C</d> was 871implied otherwise. However, interpolating a regex into a larger regex 872would ignore the original compilation in favor of whatever was in effect 873at the time of the second compilation. There were a number of 874inconsistencies (bugs) with the C</d> modifier, where Unicode rules 875would be used when inappropriate, and vice versa. C<\p{}> did not imply 876Unicode rules, and neither did all occurrences of C<\N{}>, until 5.12. 877 878=head2 Regular Expressions 879 880=head3 Quantifiers 881 882Quantifiers are used when a particular portion of a pattern needs to 883match a certain number (or numbers) of times. If there isn't a 884quantifier the number of times to match is exactly one. The following 885standard quantifiers are recognized: 886X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}> 887 888 * Match 0 or more times 889 + Match 1 or more times 890 ? Match 1 or 0 times 891 {n} Match exactly n times 892 {n,} Match at least n times 893 {,n} Match at most n times 894 {n,m} Match at least n but not more than m times 895 896(If a non-escaped curly bracket occurs in a context other than one of 897the quantifiers listed above, where it does not form part of a 898backslashed sequence like C<\x{...}>, it is either a fatal syntax error, 899or treated as a regular character, generally with a deprecation warning 900raised. To escape it, you can precede it with a backslash (C<"\{">) or 901enclose it within square brackets (C<"[{]">). 902This change will allow for future syntax extensions (like making the 903lower bound of a quantifier optional), and better error checking of 904quantifiers). 905 906The C<"*"> quantifier is equivalent to C<{0,}>, the C<"+"> 907quantifier to C<{1,}>, and the C<"?"> quantifier to C<{0,1}>. I<n> and I<m> are limited 908to non-negative integral values less than a preset limit defined when perl is built. 909This is usually 65534 on the most common platforms. The actual limit can 910be seen in the error message generated by code such as this: 911 912 $_ **= $_ , / {$_} / for 2 .. 42; 913 914By default, a quantified subpattern is "greedy", that is, it will match as 915many times as possible (given a particular starting location) while still 916allowing the rest of the pattern to match. If you want it to match the 917minimum number of times possible, follow the quantifier with a C<"?">. Note 918that the meanings don't change, just the "greediness": 919X<metacharacter> X<greedy> X<greediness> 920X<?> X<*?> X<+?> X<??> X<{n}?> X<{n,}?> X<{,n}?> X<{n,m}?> 921 922 *? Match 0 or more times, not greedily 923 +? Match 1 or more times, not greedily 924 ?? Match 0 or 1 time, not greedily 925 {n}? Match exactly n times, not greedily (redundant) 926 {n,}? Match at least n times, not greedily 927 {,n}? Match at most n times, not greedily 928 {n,m}? Match at least n but not more than m times, not greedily 929 930Normally when a quantified subpattern does not allow the rest of the 931overall pattern to match, Perl will backtrack. However, this behaviour is 932sometimes undesirable. Thus Perl provides the "possessive" quantifier form 933as well. 934 935 *+ Match 0 or more times and give nothing back 936 ++ Match 1 or more times and give nothing back 937 ?+ Match 0 or 1 time and give nothing back 938 {n}+ Match exactly n times and give nothing back (redundant) 939 {n,}+ Match at least n times and give nothing back 940 {,n}+ Match at most n times and give nothing back 941 {n,m}+ Match at least n but not more than m times and give nothing back 942 943For instance, 944 945 'aaaa' =~ /a++a/ 946 947will never match, as the C<a++> will gobble up all the C<"a">'s in the 948string and won't leave any for the remaining part of the pattern. This 949feature can be extremely useful to give perl hints about where it 950shouldn't backtrack. For instance, the typical "match a double-quoted 951string" problem can be most efficiently performed when written as: 952 953 /"(?:[^"\\]++|\\.)*+"/ 954 955as we know that if the final quote does not match, backtracking will not 956help. See the independent subexpression 957C<L</(?E<gt>I<pattern>)>> for more details; 958possessive quantifiers are just syntactic sugar for that construct. For 959instance the above example could also be written as follows: 960 961 /"(?>(?:(?>[^"\\]+)|\\.)*)"/ 962 963Note that the possessive quantifier modifier can not be combined 964with the non-greedy modifier. This is because it would make no sense. 965Consider the follow equivalency table: 966 967 Illegal Legal 968 ------------ ------ 969 X??+ X{0} 970 X+?+ X{1} 971 X{min,max}?+ X{min} 972 973=head3 Escape sequences 974 975Because patterns are processed as double-quoted strings, the following 976also work: 977 978 \t tab (HT, TAB) 979 \n newline (LF, NL) 980 \r return (CR) 981 \f form feed (FF) 982 \a alarm (bell) (BEL) 983 \e escape (think troff) (ESC) 984 \cK control char (example: VT) 985 \x{}, \x00 character whose ordinal is the given hexadecimal number 986 \N{name} named Unicode character or character sequence 987 \N{U+263D} Unicode character (example: FIRST QUARTER MOON) 988 \o{}, \000 character whose ordinal is the given octal number 989 \l lowercase next char (think vi) 990 \u uppercase next char (think vi) 991 \L lowercase until \E (think vi) 992 \U uppercase until \E (think vi) 993 \Q quote (disable) pattern metacharacters until \E 994 \E end either case modification or quoted section, think vi 995 996Details are in L<perlop/Quote and Quote-like Operators>. 997 998=head3 Character Classes and other Special Escapes 999 1000In addition, Perl defines the following: 1001X<\g> X<\k> X<\K> X<backreference> 1002 1003 Sequence Note Description 1004 [...] [1] Match a character according to the rules of the 1005 bracketed character class defined by the "...". 1006 Example: [a-z] matches "a" or "b" or "c" ... or "z" 1007 [[:...:]] [2] Match a character according to the rules of the POSIX 1008 character class "..." within the outer bracketed 1009 character class. Example: [[:upper:]] matches any 1010 uppercase character. 1011 (?[...]) [8] Extended bracketed character class 1012 \w [3] Match a "word" character (alphanumeric plus "_", plus 1013 other connector punctuation chars plus Unicode 1014 marks) 1015 \W [3] Match a non-"word" character 1016 \s [3] Match a whitespace character 1017 \S [3] Match a non-whitespace character 1018 \d [3] Match a decimal digit character 1019 \D [3] Match a non-digit character 1020 \pP [3] Match P, named property. Use \p{Prop} for longer names 1021 \PP [3] Match non-P 1022 \X [4] Match Unicode "eXtended grapheme cluster" 1023 \1 [5] Backreference to a specific capture group or buffer. 1024 '1' may actually be any positive integer. 1025 \g1 [5] Backreference to a specific or previous group, 1026 \g{-1} [5] The number may be negative indicating a relative 1027 previous group and may optionally be wrapped in 1028 curly brackets for safer parsing. 1029 \g{name} [5] Named backreference 1030 \k<name> [5] Named backreference 1031 \k'name' [5] Named backreference 1032 \k{name} [5] Named backreference 1033 \K [6] Keep the stuff left of the \K, don't include it in $& 1034 \N [7] Any character but \n. Not affected by /s modifier 1035 \v [3] Vertical whitespace 1036 \V [3] Not vertical whitespace 1037 \h [3] Horizontal whitespace 1038 \H [3] Not horizontal whitespace 1039 \R [4] Linebreak 1040 1041=over 4 1042 1043=item [1] 1044 1045See L<perlrecharclass/Bracketed Character Classes> for details. 1046 1047=item [2] 1048 1049See L<perlrecharclass/POSIX Character Classes> for details. 1050 1051=item [3] 1052 1053See L<perlunicode/Unicode Character Properties> for details 1054 1055=item [4] 1056 1057See L<perlrebackslash/Misc> for details. 1058 1059=item [5] 1060 1061See L</Capture groups> below for details. 1062 1063=item [6] 1064 1065See L</Extended Patterns> below for details. 1066 1067=item [7] 1068 1069Note that C<\N> has two meanings. When of the form C<\N{I<NAME>}>, it 1070matches the character or character sequence whose name is I<NAME>; and 1071similarly 1072when of the form C<\N{U+I<hex>}>, it matches the character whose Unicode 1073code point is I<hex>. Otherwise it matches any character but C<\n>. 1074 1075=item [8] 1076 1077See L<perlrecharclass/Extended Bracketed Character Classes> for details. 1078 1079=back 1080 1081=head3 Assertions 1082 1083Besides L<C<"^"> and C<"$">|/Metacharacters>, Perl defines the following 1084zero-width assertions: 1085X<zero-width assertion> X<assertion> X<regex, zero-width assertion> 1086X<regexp, zero-width assertion> 1087X<regular expression, zero-width assertion> 1088X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G> 1089 1090 \b{} Match at Unicode boundary of specified type 1091 \B{} Match where corresponding \b{} doesn't match 1092 \b Match a \w\W or \W\w boundary 1093 \B Match except at a \w\W or \W\w boundary 1094 \A Match only at beginning of string 1095 \Z Match only at end of string, or before newline at the end 1096 \z Match only at end of string 1097 \G Match only at pos() (e.g. at the end-of-match position 1098 of prior m//g) 1099 1100A Unicode boundary (C<\b{}>), available starting in v5.22, is a spot 1101between two characters, or before the first character in the string, or 1102after the final character in the string where certain criteria defined 1103by Unicode are met. See L<perlrebackslash/\b{}, \b, \B{}, \B> for 1104details. 1105 1106A word boundary (C<\b>) is a spot between two characters 1107that has a C<\w> on one side of it and a C<\W> on the other side 1108of it (in either order), counting the imaginary characters off the 1109beginning and end of the string as matching a C<\W>. (Within 1110character classes C<\b> represents backspace rather than a word 1111boundary, just as it normally does in any double-quoted string.) 1112The C<\A> and C<\Z> are just like C<"^"> and C<"$">, except that they 1113won't match multiple times when the C</m> modifier is used, while 1114C<"^"> and C<"$"> will match at every internal line boundary. To match 1115the actual end of the string and not ignore an optional trailing 1116newline, use C<\z>. 1117X<\b> X<\A> X<\Z> X<\z> X</m> 1118 1119The C<\G> assertion can be used to chain global matches (using 1120C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">. 1121It is also useful when writing C<lex>-like scanners, when you have 1122several patterns that you want to match against consequent substrings 1123of your string; see the previous reference. The actual location 1124where C<\G> will match can also be influenced by using C<pos()> as 1125an lvalue: see L<perlfunc/pos>. Note that the rule for zero-length 1126matches (see L</"Repeated Patterns Matching a Zero-length Substring">) 1127is modified somewhat, in that contents to the left of C<\G> are 1128not counted when determining the length of the match. Thus the following 1129will not match forever: 1130X<\G> 1131 1132 my $string = 'ABC'; 1133 pos($string) = 1; 1134 while ($string =~ /(.\G)/g) { 1135 print $1; 1136 } 1137 1138It will print 'A' and then terminate, as it considers the match to 1139be zero-width, and thus will not match at the same position twice in a 1140row. 1141 1142It is worth noting that C<\G> improperly used can result in an infinite 1143loop. Take care when using patterns that include C<\G> in an alternation. 1144 1145Note also that C<s///> will refuse to overwrite part of a substitution 1146that has already been replaced; so for example this will stop after the 1147first iteration, rather than iterating its way backwards through the 1148string: 1149 1150 $_ = "123456789"; 1151 pos = 6; 1152 s/.(?=.\G)/X/g; 1153 print; # prints 1234X6789, not XXXXX6789 1154 1155 1156=head3 Capture groups 1157 1158The grouping construct C<( ... )> creates capture groups (also referred to as 1159capture buffers). To refer to the current contents of a group later on, within 1160the same pattern, use C<\g1> (or C<\g{1}>) for the first, C<\g2> (or C<\g{2}>) 1161for the second, and so on. 1162This is called a I<backreference>. 1163X<regex, capture buffer> X<regexp, capture buffer> 1164X<regex, capture group> X<regexp, capture group> 1165X<regular expression, capture buffer> X<backreference> 1166X<regular expression, capture group> X<backreference> 1167X<\g{1}> X<\g{-1}> X<\g{name}> X<relative backreference> X<named backreference> 1168X<named capture buffer> X<regular expression, named capture buffer> 1169X<named capture group> X<regular expression, named capture group> 1170X<%+> X<$+{name}> X<< \k<name> >> 1171There is no limit to the number of captured substrings that you may use. 1172Groups are numbered with the leftmost open parenthesis being number 1, I<etc>. If 1173a group did not match, the associated backreference won't match either. (This 1174can happen if the group is optional, or in a different branch of an 1175alternation.) 1176You can omit the C<"g">, and write C<"\1">, I<etc>, but there are some issues with 1177this form, described below. 1178 1179You can also refer to capture groups relatively, by using a negative number, so 1180that C<\g-1> and C<\g{-1}> both refer to the immediately preceding capture 1181group, and C<\g-2> and C<\g{-2}> both refer to the group before it. For 1182example: 1183 1184 / 1185 (Y) # group 1 1186 ( # group 2 1187 (X) # group 3 1188 \g{-1} # backref to group 3 1189 \g{-3} # backref to group 1 1190 ) 1191 /x 1192 1193would match the same as C</(Y) ( (X) \g3 \g1 )/x>. This allows you to 1194interpolate regexes into larger regexes and not have to worry about the 1195capture groups being renumbered. 1196 1197You can dispense with numbers altogether and create named capture groups. 1198The notation is C<(?E<lt>I<name>E<gt>...)> to declare and C<\g{I<name>}> to 1199reference. (To be compatible with .Net regular expressions, C<\g{I<name>}> may 1200also be written as C<\k{I<name>}>, C<\kE<lt>I<name>E<gt>> or C<\k'I<name>'>.) 1201I<name> must not begin with a number, nor contain hyphens. 1202When different groups within the same pattern have the same name, any reference 1203to that name assumes the leftmost defined group. Named groups count in 1204absolute and relative numbering, and so can also be referred to by those 1205numbers. 1206(It's possible to do things with named capture groups that would otherwise 1207require C<(??{})>.) 1208 1209Capture group contents are dynamically scoped and available to you outside the 1210pattern until the end of the enclosing block or until the next successful 1211match in the same scope, whichever comes first. 1212See L<perlsyn/"Compound Statements"> and 1213L<perlvar/"Scoping Rules of Regex Variables"> for more details. 1214 1215You can access the contents of a capture group by absolute number (using 1216C<"$1"> instead of C<"\g1">, I<etc>); or by name via the C<%+> hash, 1217using C<"$+{I<name>}">. 1218 1219Braces are required in referring to named capture groups, but are optional for 1220absolute or relative numbered ones. Braces are safer when creating a regex by 1221concatenating smaller strings. For example if you have C<qr/$x$y/>, and C<$x> 1222contained C<"\g1">, and C<$y> contained C<"37">, you would get C</\g137/> which 1223is probably not what you intended. 1224 1225If you use braces, you may also optionally add any number of blank 1226(space or tab) characters within but adjacent to the braces, like 1227S<C<\g{ -1 }>>, or S<C<\k{ I<name> }>>. 1228 1229The C<\g> and C<\k> notations were introduced in Perl 5.10.0. Prior to that 1230there were no named nor relative numbered capture groups. Absolute numbered 1231groups were referred to using C<\1>, 1232C<\2>, I<etc>., and this notation is still 1233accepted (and likely always will be). But it leads to some ambiguities if 1234there are more than 9 capture groups, as C<\10> could mean either the tenth 1235capture group, or the character whose ordinal in octal is 010 (a backspace in 1236ASCII). Perl resolves this ambiguity by interpreting C<\10> as a backreference 1237only if at least 10 left parentheses have opened before it. Likewise C<\11> is 1238a backreference only if at least 11 left parentheses have opened before it. 1239And so on. C<\1> through C<\9> are always interpreted as backreferences. 1240There are several examples below that illustrate these perils. You can avoid 1241the ambiguity by always using C<\g{}> or C<\g> if you mean capturing groups; 1242and for octal constants always using C<\o{}>, or for C<\077> and below, using 3 1243digits padded with leading zeros, since a leading zero implies an octal 1244constant. 1245 1246The C<\I<digit>> notation also works in certain circumstances outside 1247the pattern. See L</Warning on \1 Instead of $1> below for details. 1248 1249Examples: 1250 1251 s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words 1252 1253 /(.)\g1/ # find first doubled char 1254 and print "'$1' is the first doubled character\n"; 1255 1256 /(?<char>.)\k<char>/ # ... a different way 1257 and print "'$+{char}' is the first doubled character\n"; 1258 1259 /(?'char'.)\g1/ # ... mix and match 1260 and print "'$1' is the first doubled character\n"; 1261 1262 if (/Time: (..):(..):(..)/) { # parse out values 1263 $hours = $1; 1264 $minutes = $2; 1265 $seconds = $3; 1266 } 1267 1268 /(.)(.)(.)(.)(.)(.)(.)(.)(.)\g10/ # \g10 is a backreference 1269 /(.)(.)(.)(.)(.)(.)(.)(.)(.)\10/ # \10 is octal 1270 /((.)(.)(.)(.)(.)(.)(.)(.)(.))\10/ # \10 is a backreference 1271 /((.)(.)(.)(.)(.)(.)(.)(.)(.))\010/ # \010 is octal 1272 1273 $x = '(.)\1'; # Creates problems when concatenated. 1274 $y = '(.)\g{1}'; # Avoids the problems. 1275 "aa" =~ /${x}/; # True 1276 "aa" =~ /${y}/; # True 1277 "aa0" =~ /${x}0/; # False! 1278 "aa0" =~ /${y}0/; # True 1279 "aa\x08" =~ /${x}0/; # True! 1280 "aa\x08" =~ /${y}0/; # False 1281 1282Several special variables also refer back to portions of the previous 1283match. C<$+> returns whatever the last bracket match matched. 1284C<$&> returns the entire matched string. (At one point C<$0> did 1285also, but now it returns the name of the program.) C<$`> returns 1286everything before the matched string. C<$'> returns everything 1287after the matched string. And C<$^N> contains whatever was matched by 1288the most-recently closed group (submatch). C<$^N> can be used in 1289extended patterns (see below), for example to assign a submatch to a 1290variable. 1291X<$+> X<$^N> X<$&> X<$`> X<$'> 1292 1293These special variables, like the C<%+> hash and the numbered match variables 1294(C<$1>, C<$2>, C<$3>, I<etc>.) are dynamically scoped 1295until the end of the enclosing block or until the next successful 1296match, whichever comes first. (See L<perlsyn/"Compound Statements">.) 1297X<$+> X<$^N> X<$&> X<$`> X<$'> 1298X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9> 1299X<@{^CAPTURE}> 1300 1301The C<@{^CAPTURE}> array may be used to access ALL of the capture buffers 1302as an array without needing to know how many there are. For instance 1303 1304 $string=~/$pattern/ and @captured = @{^CAPTURE}; 1305 1306will place a copy of each capture variable, C<$1>, C<$2> etc, into the 1307C<@captured> array. 1308 1309Be aware that when interpolating a subscript of the C<@{^CAPTURE}> 1310array you must use demarcated curly brace notation: 1311 1312 print "${^CAPTURE[0]}"; 1313 1314See L<perldata/"Demarcated variable names using braces"> for more on 1315this notation. 1316 1317B<NOTE>: Failed matches in Perl do not reset the match variables, 1318which makes it easier to write code that tests for a series of more 1319specific cases and remembers the best match. 1320 1321B<WARNING>: If your code is to run on Perl 5.16 or earlier, 1322beware that once Perl sees that you need one of C<$&>, C<$`>, or 1323C<$'> anywhere in the program, it has to provide them for every 1324pattern match. This may substantially slow your program. 1325 1326Perl uses the same mechanism to produce C<$1>, C<$2>, I<etc>, so you also 1327pay a price for each pattern that contains capturing parentheses. 1328(To avoid this cost while retaining the grouping behaviour, use the 1329extended regular expression C<(?: ... )> instead.) But if you never 1330use C<$&>, C<$`> or C<$'>, then patterns I<without> capturing 1331parentheses will not be penalized. So avoid C<$&>, C<$'>, and C<$`> 1332if you can, but if you can't (and some algorithms really appreciate 1333them), once you've used them once, use them at will, because you've 1334already paid the price. 1335X<$&> X<$`> X<$'> 1336 1337Perl 5.16 introduced a slightly more efficient mechanism that notes 1338separately whether each of C<$`>, C<$&>, and C<$'> have been seen, and 1339thus may only need to copy part of the string. Perl 5.20 introduced a 1340much more efficient copy-on-write mechanism which eliminates any slowdown. 1341 1342As another workaround for this problem, Perl 5.10.0 introduced C<${^PREMATCH}>, 1343C<${^MATCH}> and C<${^POSTMATCH}>, which are equivalent to C<$`>, C<$&> 1344and C<$'>, B<except> that they are only guaranteed to be defined after a 1345successful match that was executed with the C</p> (preserve) modifier. 1346The use of these variables incurs no global performance penalty, unlike 1347their punctuation character equivalents, however at the trade-off that you 1348have to tell perl when you want to use them. 1349X</p> X<p modifier> 1350 1351=head2 Quoting metacharacters 1352 1353Backslashed metacharacters in Perl are alphanumeric, such as C<\b>, 1354C<\w>, C<\n>. Unlike some other regular expression languages, there 1355are no backslashed symbols that aren't alphanumeric. So anything 1356that looks like C<\\>, C<\(>, C<\)>, C<\[>, C<\]>, C<\{>, or C<\}> is 1357always 1358interpreted as a literal character, not a metacharacter. This was 1359once used in a common idiom to disable or quote the special meanings 1360of regular expression metacharacters in a string that you want to 1361use for a pattern. Simply quote all non-"word" characters: 1362 1363 $pattern =~ s/(\W)/\\$1/g; 1364 1365(If C<use locale> is set, then this depends on the current locale.) 1366Today it is more common to use the C<L<quotemeta()|perlfunc/quotemeta>> 1367function or the C<\Q> metaquoting escape sequence to disable all 1368metacharacters' special meanings like this: 1369 1370 /$unquoted\Q$quoted\E$unquoted/ 1371 1372Beware that if you put literal backslashes (those not inside 1373interpolated variables) between C<\Q> and C<\E>, double-quotish 1374backslash interpolation may lead to confusing results. If you 1375I<need> to use literal backslashes within C<\Q...\E>, 1376consult L<perlop/"Gory details of parsing quoted constructs">. 1377 1378C<quotemeta()> and C<\Q> are fully described in L<perlfunc/quotemeta>. 1379 1380=head2 Extended Patterns 1381 1382Perl also defines a consistent extension syntax for features not 1383found in standard tools like B<awk> and 1384B<lex>. The syntax for most of these is a 1385pair of parentheses with a question mark as the first thing within 1386the parentheses. The character after the question mark indicates 1387the extension. 1388 1389A question mark was chosen for this and for the minimal-matching 1390construct because 1) question marks are rare in older regular 1391expressions, and 2) whenever you see one, you should stop and 1392"question" exactly what is going on. That's psychology.... 1393 1394=over 4 1395 1396=item C<(?#I<text>)> 1397X<(?#)> 1398 1399A comment. The I<text> is ignored. 1400Note that Perl closes 1401the comment as soon as it sees a C<")">, so there is no way to put a literal 1402C<")"> in the comment. The pattern's closing delimiter must be escaped by 1403a backslash if it appears in the comment. 1404 1405See L</E<sol>x> for another way to have comments in patterns. 1406 1407Note that a comment can go just about anywhere, except in the middle of 1408an escape sequence. Examples: 1409 1410 qr/foo(?#comment)bar/' # Matches 'foobar' 1411 1412 # The pattern below matches 'abcd', 'abccd', or 'abcccd' 1413 qr/abc(?#comment between literal and its quantifier){1,3}d/ 1414 1415 # The pattern below generates a syntax error, because the '\p' must 1416 # be followed immediately by a '{'. 1417 qr/\p(?#comment between \p and its property name){Any}/ 1418 1419 # The pattern below generates a syntax error, because the initial 1420 # '\(' is a literal opening parenthesis, and so there is nothing 1421 # for the closing ')' to match 1422 qr/\(?#the backslash means this isn't a comment)p{Any}/ 1423 1424 # Comments can be used to fold long patterns into multiple lines 1425 qr/First part of a long regex(?# 1426 )remaining part/ 1427 1428=item C<(?adlupimnsx-imnsx)> 1429 1430=item C<(?^alupimnsx)> 1431X<(?)> X<(?^)> 1432 1433Zero or more embedded pattern-match modifiers, to be turned on (or 1434turned off if preceded by C<"-">) for the remainder of the pattern or 1435the remainder of the enclosing pattern group (if any). 1436 1437This is particularly useful for dynamically-generated patterns, 1438such as those read in from a 1439configuration file, taken from an argument, or specified in a table 1440somewhere. Consider the case where some patterns want to be 1441case-sensitive and some do not: The case-insensitive ones merely need to 1442include C<(?i)> at the front of the pattern. For example: 1443 1444 $pattern = "foobar"; 1445 if ( /$pattern/i ) { } 1446 1447 # more flexible: 1448 1449 $pattern = "(?i)foobar"; 1450 if ( /$pattern/ ) { } 1451 1452These modifiers are restored at the end of the enclosing group. For example, 1453 1454 ( (?i) blah ) \s+ \g1 1455 1456will match C<blah> in any case, some spaces, and an exact (I<including the case>!) 1457repetition of the previous word, assuming the C</x> modifier, and no C</i> 1458modifier outside this group. 1459 1460These modifiers do not carry over into named subpatterns called in the 1461enclosing group. In other words, a pattern such as C<((?i)(?&I<NAME>))> does not 1462change the case-sensitivity of the I<NAME> pattern. 1463 1464A modifier is overridden by later occurrences of this construct in the 1465same scope containing the same modifier, so that 1466 1467 /((?im)foo(?-m)bar)/ 1468 1469matches all of C<foobar> case insensitively, but uses C</m> rules for 1470only the C<foo> portion. The C<"a"> flag overrides C<aa> as well; 1471likewise C<aa> overrides C<"a">. The same goes for C<"x"> and C<xx>. 1472Hence, in 1473 1474 /(?-x)foo/xx 1475 1476both C</x> and C</xx> are turned off during matching C<foo>. And in 1477 1478 /(?x)foo/x 1479 1480C</x> but NOT C</xx> is turned on for matching C<foo>. (One might 1481mistakenly think that since the inner C<(?x)> is already in the scope of 1482C</x>, that the result would effectively be the sum of them, yielding 1483C</xx>. It doesn't work that way.) Similarly, doing something like 1484C<(?xx-x)foo> turns off all C<"x"> behavior for matching C<foo>, it is not 1485that you subtract 1 C<"x"> from 2 to get 1 C<"x"> remaining. 1486 1487Any of these modifiers can be set to apply globally to all regular 1488expressions compiled within the scope of a C<use re>. See 1489L<re/"'/flags' mode">. 1490 1491Starting in Perl 5.14, a C<"^"> (caret or circumflex accent) immediately 1492after the C<"?"> is a shorthand equivalent to C<d-imnsx>. Flags (except 1493C<"d">) may follow the caret to override it. 1494But a minus sign is not legal with it. 1495 1496Note that the C<"a">, C<"d">, C<"l">, C<"p">, and C<"u"> modifiers are special in 1497that they can only be enabled, not disabled, and the C<"a">, C<"d">, C<"l">, and 1498C<"u"> modifiers are mutually exclusive: specifying one de-specifies the 1499others, and a maximum of one (or two C<"a">'s) may appear in the 1500construct. Thus, for 1501example, C<(?-p)> will warn when compiled under C<use warnings>; 1502C<(?-d:...)> and C<(?dl:...)> are fatal errors. 1503 1504Note also that the C<"p"> modifier is special in that its presence 1505anywhere in a pattern has a global effect. 1506 1507Having zero modifiers makes this a no-op (so why did you specify it, 1508unless it's generated code), and starting in v5.30, warns under L<C<use 1509re 'strict'>|re/'strict' mode>. 1510 1511=item C<(?:I<pattern>)> 1512X<(?:)> 1513 1514=item C<(?adluimnsx-imnsx:I<pattern>)> 1515 1516=item C<(?^aluimnsx:I<pattern>)> 1517X<(?^:)> 1518 1519This is for clustering, not capturing; it groups subexpressions like 1520C<"()">, but doesn't make backreferences as C<"()"> does. So 1521 1522 @fields = split(/\b(?:a|b|c)\b/) 1523 1524matches the same field delimiters as 1525 1526 @fields = split(/\b(a|b|c)\b/) 1527 1528but doesn't spit out the delimiters themselves as extra fields (even though 1529that's the behaviour of L<perlfunc/split> when its pattern contains capturing 1530groups). It's also cheaper not to capture 1531characters if you don't need to. 1532 1533Any letters between C<"?"> and C<":"> act as flags modifiers as with 1534C<(?adluimnsx-imnsx)>. For example, 1535 1536 /(?s-i:more.*than).*million/i 1537 1538is equivalent to the more verbose 1539 1540 /(?:(?s-i)more.*than).*million/i 1541 1542Note that any C<()> constructs enclosed within this one will still 1543capture unless the C</n> modifier is in effect. 1544 1545Like the L</(?adlupimnsx-imnsx)> construct, C<aa> and C<"a"> override each 1546other, as do C<xx> and C<"x">. They are not additive. So, doing 1547something like C<(?xx-x:foo)> turns off all C<"x"> behavior for matching 1548C<foo>. 1549 1550Starting in Perl 5.14, a C<"^"> (caret or circumflex accent) immediately 1551after the C<"?"> is a shorthand equivalent to C<d-imnsx>. Any positive 1552flags (except C<"d">) may follow the caret, so 1553 1554 (?^x:foo) 1555 1556is equivalent to 1557 1558 (?x-imns:foo) 1559 1560The caret tells Perl that this cluster doesn't inherit the flags of any 1561surrounding pattern, but uses the system defaults (C<d-imnsx>), 1562modified by any flags specified. 1563 1564The caret allows for simpler stringification of compiled regular 1565expressions. These look like 1566 1567 (?^:pattern) 1568 1569with any non-default flags appearing between the caret and the colon. 1570A test that looks at such stringification thus doesn't need to have the 1571system default flags hard-coded in it, just the caret. If new flags are 1572added to Perl, the meaning of the caret's expansion will change to include 1573the default for those flags, so the test will still work, unchanged. 1574 1575Specifying a negative flag after the caret is an error, as the flag is 1576redundant. 1577 1578Mnemonic for C<(?^...)>: A fresh beginning since the usual use of a caret is 1579to match at the beginning. 1580 1581=item C<(?|I<pattern>)> 1582X<(?|)> X<Branch reset> 1583 1584This is the "branch reset" pattern, which has the special property 1585that the capture groups are numbered from the same starting point 1586in each alternation branch. It is available starting from perl 5.10.0. 1587 1588Capture groups are numbered from left to right, but inside this 1589construct the numbering is restarted for each branch. 1590 1591The numbering within each branch will be as normal, and any groups 1592following this construct will be numbered as though the construct 1593contained only one branch, that being the one with the most capture 1594groups in it. 1595 1596This construct is useful when you want to capture one of a 1597number of alternative matches. 1598 1599Consider the following pattern. The numbers underneath show in 1600which group the captured content will be stored. 1601 1602 1603 # before ---------------branch-reset----------- after 1604 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x 1605 # 1 2 2 3 2 3 4 1606 1607Be careful when using the branch reset pattern in combination with 1608named captures. Named captures are implemented as being aliases to 1609numbered groups holding the captures, and that interferes with the 1610implementation of the branch reset pattern. If you are using named 1611captures in a branch reset pattern, it's best to use the same names, 1612in the same order, in each of the alternations: 1613 1614 /(?| (?<a> x ) (?<b> y ) 1615 | (?<a> z ) (?<b> w )) /x 1616 1617Not doing so may lead to surprises: 1618 1619 "12" =~ /(?| (?<a> \d+ ) | (?<b> \D+))/x; 1620 say $+{a}; # Prints '12' 1621 say $+{b}; # *Also* prints '12'. 1622 1623The problem here is that both the group named C<< a >> and the group 1624named C<< b >> are aliases for the group belonging to C<< $1 >>. 1625 1626=item Lookaround Assertions 1627X<look-around assertion> X<lookaround assertion> X<look-around> X<lookaround> 1628 1629Lookaround assertions are zero-width patterns which match a specific 1630pattern without including it in C<$&>. Positive assertions match when 1631their subpattern matches, negative assertions match when their subpattern 1632fails. Lookbehind matches text up to the current match position, 1633lookahead matches text following the current match position. 1634 1635=over 4 1636 1637=item C<(?=I<pattern>)> 1638 1639=item C<(*pla:I<pattern>)> 1640 1641=item C<(*positive_lookahead:I<pattern>)> 1642X<(?=)> 1643X<(*pla> 1644X<(*positive_lookahead> 1645X<look-ahead, positive> X<lookahead, positive> 1646 1647A zero-width positive lookahead assertion. For example, C</\w+(?=\t)/> 1648matches a word followed by a tab, without including the tab in C<$&>. 1649 1650=item C<(?!I<pattern>)> 1651 1652=item C<(*nla:I<pattern>)> 1653 1654=item C<(*negative_lookahead:I<pattern>)> 1655X<(?!)> 1656X<(*nla> 1657X<(*negative_lookahead> 1658X<look-ahead, negative> X<lookahead, negative> 1659 1660A zero-width negative lookahead assertion. For example C</foo(?!bar)/> 1661matches any occurrence of "foo" that isn't followed by "bar". Note 1662however that lookahead and lookbehind are NOT the same thing. You cannot 1663use this for lookbehind. 1664 1665If you are looking for a "bar" that isn't preceded by a "foo", C</(?!foo)bar/> 1666will not do what you want. That's because the C<(?!foo)> is just saying that 1667the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will 1668match. Use lookbehind instead (see below). 1669 1670=item C<(?<=I<pattern>)> 1671 1672=item C<\K> 1673 1674=item C<(*plb:I<pattern>)> 1675 1676=item C<(*positive_lookbehind:I<pattern>)> 1677X<(?<=)> 1678X<(*plb> 1679X<(*positive_lookbehind> 1680X<look-behind, positive> X<lookbehind, positive> X<\K> 1681 1682A zero-width positive lookbehind assertion. For example, C</(?<=\t)\w+/> 1683matches a word that follows a tab, without including the tab in C<$&>. 1684 1685Prior to Perl 5.30, it worked only for fixed-width lookbehind, but 1686starting in that release, it can handle variable lengths from 1 to 255 1687characters as an experimental feature. The feature is enabled 1688automatically if you use a variable length positive lookbehind assertion. 1689 1690In Perl 5.35.10 the scope of the experimental nature of this construct 1691has been reduced, and experimental warnings will only be produced when 1692the construct contains capturing parenthesis. The warnings will be 1693raised at pattern compilation time, unless turned off, in the 1694C<experimental::vlb> category. This is to warn you that the exact 1695contents of capturing buffers in a variable length positive lookbehind 1696is not well defined and is subject to change in a future release of perl. 1697 1698Currently if you use capture buffers inside of a positive variable length 1699lookbehind the result will be the longest and thus leftmost match possible. 1700This means that 1701 1702 "aax" =~ /(?=x)(?<=(a|aa))/ 1703 "aax" =~ /(?=x)(?<=(aa|a))/ 1704 "aax" =~ /(?=x)(?<=(a{1,2}?)/ 1705 "aax" =~ /(?=x)(?<=(a{1,2})/ 1706 1707will all result in C<$1> containing C<"aa">. It is possible in a future 1708release of perl we will change this behavior. 1709 1710There is a special form of this construct, called C<\K> 1711(available since Perl 5.10.0), which causes the 1712regex engine to "keep" everything it had matched prior to the C<\K> and 1713not include it in C<$&>. This effectively provides non-experimental 1714variable-length lookbehind of any length. 1715 1716And, there is a technique that can be used to handle variable length 1717lookbehinds on earlier releases, and longer than 255 characters. It is 1718described in 1719L<http://www.drregex.com/2019/02/variable-length-lookbehinds-actually.html>. 1720 1721Note that under C</i>, a few single characters match two or three other 1722characters. This makes them variable length, and the 255 length applies 1723to the maximum number of characters in the match. For 1724example C<qr/\N{LATIN SMALL LETTER SHARP S}/i> matches the sequence 1725C<"ss">. Your lookbehind assertion could contain 127 Sharp S 1726characters under C</i>, but adding a 128th would generate a compilation 1727error, as that could match 256 C<"s"> characters in a row. 1728 1729The use of C<\K> inside of another lookaround assertion 1730is allowed, but the behaviour is currently not well defined. 1731 1732For various reasons C<\K> may be significantly more efficient than the 1733equivalent C<< (?<=...) >> construct, and it is especially useful in 1734situations where you want to efficiently remove something following 1735something else in a string. For instance 1736 1737 s/(foo)bar/$1/g; 1738 1739can be rewritten as the much more efficient 1740 1741 s/foo\Kbar//g; 1742 1743Use of the non-greedy modifier C<"?"> may not give you the expected 1744results if it is within a capturing group within the construct. 1745 1746=item C<(?<!I<pattern>)> 1747 1748=item C<(*nlb:I<pattern>)> 1749 1750=item C<(*negative_lookbehind:I<pattern>)> 1751X<(?<!)> 1752X<(*nlb> 1753X<(*negative_lookbehind> 1754X<look-behind, negative> X<lookbehind, negative> 1755 1756A zero-width negative lookbehind assertion. For example C</(?<!bar)foo/> 1757matches any occurrence of "foo" that does not follow "bar". 1758 1759Prior to Perl 5.30, it worked only for fixed-width lookbehind, but 1760starting in that release, it can handle variable lengths from 1 to 255 1761characters as an experimental feature. The feature is enabled 1762automatically if you use a variable length negative lookbehind assertion. 1763 1764In Perl 5.35.10 the scope of the experimental nature of this construct 1765has been reduced, and experimental warnings will only be produced when 1766the construct contains capturing parentheses. The warnings will be 1767raised at pattern compilation time, unless turned off, in the 1768C<experimental::vlb> category. This is to warn you that the exact 1769contents of capturing buffers in a variable length negative lookbehind 1770is not well defined and is subject to change in a future release of perl. 1771 1772Currently if you use capture buffers inside of a negative variable length 1773lookbehind the result may not be what you expect, for instance: 1774 1775 say "axfoo"=~/(?=foo)(?<!(a|ax)(?{ say $1 }))/ ? "y" : "n"; 1776 1777will output the following: 1778 1779 a 1780 no 1781 1782which does not make sense as this should print out "ax" as the "a" does 1783not line up at the correct place. Another example would be: 1784 1785 say "yes: '$1-$2'" if "aayfoo"=~/(?=foo)(?<!(a|aa)(a|aa)x)/; 1786 1787will output the following: 1788 1789 yes: 'aa-a' 1790 1791It is possible in a future release of perl we will change this behavior 1792so both of these examples produced more reasonable output. 1793 1794Note that we are confident that the construct will match and reject 1795patterns appropriately, the undefined behavior strictly relates to the 1796value of the capture buffer during or after matching. 1797 1798There is a technique that can be used to handle variable length 1799lookbehind on earlier releases, and longer than 255 characters. It is 1800described in 1801L<http://www.drregex.com/2019/02/variable-length-lookbehinds-actually.html>. 1802 1803Note that under C</i>, a few single characters match two or three other 1804characters. This makes them variable length, and the 255 length applies 1805to the maximum number of characters in the match. For 1806example C<qr/\N{LATIN SMALL LETTER SHARP S}/i> matches the sequence 1807C<"ss">. Your lookbehind assertion could contain 127 Sharp S 1808characters under C</i>, but adding a 128th would generate a compilation 1809error, as that could match 256 C<"s"> characters in a row. 1810 1811Use of the non-greedy modifier C<"?"> may not give you the expected 1812results if it is within a capturing group within the construct. 1813 1814=back 1815 1816=item C<< (?<I<NAME>>I<pattern>) >> 1817 1818=item C<(?'I<NAME>'I<pattern>)> 1819X<< (?<NAME>) >> X<(?'NAME')> X<named capture> X<capture> 1820 1821A named capture group. Identical in every respect to normal capturing 1822parentheses C<()> but for the additional fact that the group 1823can be referred to by name in various regular expression 1824constructs (like C<\g{I<NAME>}>) and can be accessed by name 1825after a successful match via C<%+> or C<%->. See L<perlvar> 1826for more details on the C<%+> and C<%-> hashes. 1827 1828If multiple distinct capture groups have the same name, then 1829C<$+{I<NAME>}> will refer to the leftmost defined group in the match. 1830 1831The forms C<(?'I<NAME>'I<pattern>)> and C<< (?<I<NAME>>I<pattern>) >> 1832are equivalent. 1833 1834B<NOTE:> While the notation of this construct is the same as the similar 1835function in .NET regexes, the behavior is not. In Perl the groups are 1836numbered sequentially regardless of being named or not. Thus in the 1837pattern 1838 1839 /(x)(?<foo>y)(z)/ 1840 1841C<$+{foo}> will be the same as C<$2>, and C<$3> will contain 'z' instead of 1842the opposite which is what a .NET regex hacker might expect. 1843 1844Currently I<NAME> is restricted to simple identifiers only. 1845In other words, it must match C</^[_A-Za-z][_A-Za-z0-9]*\z/> or 1846its Unicode extension (see L<utf8>), 1847though it isn't extended by the locale (see L<perllocale>). 1848 1849B<NOTE:> In order to make things easier for programmers with experience 1850with the Python or PCRE regex engines, the pattern C<< 1851(?PE<lt>I<NAME>E<gt>I<pattern>) >> 1852may be used instead of C<< (?<I<NAME>>I<pattern>) >>; however this form does not 1853support the use of single quotes as a delimiter for the name. 1854 1855=item C<< \k<I<NAME>> >> 1856 1857=item C<< \k'I<NAME>' >> 1858 1859=item C<< \k{I<NAME>} >> 1860 1861Named backreference. Similar to numeric backreferences, except that 1862the group is designated by name and not number. If multiple groups 1863have the same name then it refers to the leftmost defined group in 1864the current match. 1865 1866It is an error to refer to a name not defined by a C<< (?<I<NAME>>) >> 1867earlier in the pattern. 1868 1869All three forms are equivalent, although with C<< \k{ I<NAME> } >>, 1870you may optionally have blanks within but adjacent to the braces, as 1871shown. 1872 1873B<NOTE:> In order to make things easier for programmers with experience 1874with the Python or PCRE regex engines, the pattern C<< (?P=I<NAME>) >> 1875may be used instead of C<< \k<I<NAME>> >>. 1876 1877=item C<(?{ I<code> })> 1878X<(?{})> X<regex, code in> X<regexp, code in> X<regular expression, code in> 1879 1880B<WARNING>: Using this feature safely requires that you understand its 1881limitations. Code executed that has side effects may not perform identically 1882from version to version due to the effect of future optimisations in the regex 1883engine. For more information on this, see L</Embedded Code Execution 1884Frequency>. 1885 1886This zero-width assertion executes any embedded Perl code. It always 1887succeeds, and its return value is set as C<$^R>. 1888 1889In literal patterns, the code is parsed at the same time as the 1890surrounding code. While within the pattern, control is passed temporarily 1891back to the perl parser, until the logically-balancing closing brace is 1892encountered. This is similar to the way that an array index expression in 1893a literal string is handled, for example 1894 1895 "abc$array[ 1 + f('[') + g()]def" 1896 1897In particular, braces do not need to be balanced: 1898 1899 s/abc(?{ f('{'); })/def/ 1900 1901Even in a pattern that is interpolated and compiled at run-time, literal 1902code blocks will be compiled once, at perl compile time; the following 1903prints "ABCD": 1904 1905 print "D"; 1906 my $qr = qr/(?{ BEGIN { print "A" } })/; 1907 my $foo = "foo"; 1908 /$foo$qr(?{ BEGIN { print "B" } })/; 1909 BEGIN { print "C" } 1910 1911In patterns where the text of the code is derived from run-time 1912information rather than appearing literally in a source code /pattern/, 1913the code is compiled at the same time that the pattern is compiled, and 1914for reasons of security, C<use re 'eval'> must be in scope. This is to 1915stop user-supplied patterns containing code snippets from being 1916executable. 1917 1918In situations where you need to enable this with C<use re 'eval'>, you should 1919also have taint checking enabled, if your perl supports it. 1920Better yet, use the carefully constrained evaluation within a Safe compartment. 1921See L<perlsec> for details about both these mechanisms. 1922 1923From the viewpoint of parsing, lexical variable scope and closures, 1924 1925 /AAA(?{ BBB })CCC/ 1926 1927behaves approximately like 1928 1929 /AAA/ && do { BBB } && /CCC/ 1930 1931Similarly, 1932 1933 qr/AAA(?{ BBB })CCC/ 1934 1935behaves approximately like 1936 1937 sub { /AAA/ && do { BBB } && /CCC/ } 1938 1939In particular: 1940 1941 { my $i = 1; $r = qr/(?{ print $i })/ } 1942 my $i = 2; 1943 /$r/; # prints "1" 1944 1945Inside a C<(?{...})> block, C<$_> refers to the string the regular 1946expression is matching against. You can also use C<pos()> to know what is 1947the current position of matching within this string. 1948 1949The code block introduces a new scope from the perspective of lexical 1950variable declarations, but B<not> from the perspective of C<local> and 1951similar localizing behaviours. So later code blocks within the same 1952pattern will still see the values which were localized in earlier blocks. 1953These accumulated localizations are undone either at the end of a 1954successful match, or if the assertion is backtracked (compare 1955L</"Backtracking">). For example, 1956 1957 $_ = 'a' x 8; 1958 m< 1959 (?{ $cnt = 0 }) # Initialize $cnt. 1960 ( 1961 a 1962 (?{ 1963 local $cnt = $cnt + 1; # Update $cnt, 1964 # backtracking-safe. 1965 }) 1966 )* 1967 aaaa 1968 (?{ $res = $cnt }) # On success copy to 1969 # non-localized location. 1970 >x; 1971 1972will initially increment C<$cnt> up to 8; then during backtracking, its 1973value will be unwound back to 4, which is the value assigned to C<$res>. 1974At the end of the regex execution, C<$cnt> will be wound back to its initial 1975value of 0. 1976 1977This assertion may be used as the condition in a 1978 1979 (?(condition)yes-pattern|no-pattern) 1980 1981switch. If I<not> used in this way, the result of evaluation of I<code> 1982is put into the special variable C<$^R>. This happens immediately, so 1983C<$^R> can be used from other C<(?{ I<code> })> assertions inside the same 1984regular expression. 1985 1986The assignment to C<$^R> above is properly localized, so the old 1987value of C<$^R> is restored if the assertion is backtracked; compare 1988L</"Backtracking">. 1989 1990Note that the special variable C<$^N> is particularly useful with code 1991blocks to capture the results of submatches in variables without having to 1992keep track of the number of nested parentheses. For example: 1993 1994 $_ = "The brown fox jumps over the lazy dog"; 1995 /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i; 1996 print "color = $color, animal = $animal\n"; 1997 1998The use of this construct disables some optimisations globally in the 1999pattern, and the pattern may execute much slower as a consequence. 2000Use a C<*> instead of the C<?> block to create an optimistic form of 2001this construct. C<(*{ ... })> should not disable any optimisations. 2002 2003=item C<(*{ I<code> })> 2004X<(*{})> X<regex, optimistic code> 2005 2006This is *exactly* the same as C<(?{ I<code> })> with the exception 2007that it does not disable B<any> optimisations at all in the regex engine. 2008How often it is executed may vary from perl release to perl release. 2009In a failing match it may not even be executed at all. 2010 2011=item C<(??{ I<code> })> 2012X<(??{})> 2013X<regex, postponed> X<regexp, postponed> X<regular expression, postponed> 2014 2015B<WARNING>: Using this feature safely requires that you understand its 2016limitations. Code executed that has side effects may not perform 2017identically from version to version due to the effect of future 2018optimisations in the regex engine. For more information on this, see 2019L</Embedded Code Execution Frequency>. 2020 2021This is a "postponed" regular subexpression. It behaves in I<exactly> the 2022same way as a C<(?{ I<code> })> code block as described above, except that 2023its return value, rather than being assigned to C<$^R>, is treated as a 2024pattern, compiled if it's a string (or used as-is if its a qr// object), 2025then matched as if it were inserted instead of this construct. 2026 2027During the matching of this sub-pattern, it has its own set of 2028captures which are valid during the sub-match, but are discarded once 2029control returns to the main pattern. For example, the following matches, 2030with the inner pattern capturing "B" and matching "BB", while the outer 2031pattern captures "A"; 2032 2033 my $inner = '(.)\1'; 2034 "ABBA" =~ /^(.)(??{ $inner })\1/; 2035 print $1; # prints "A"; 2036 2037Note that this means that there is no way for the inner pattern to refer 2038to a capture group defined outside. (The code block itself can use C<$1>, 2039I<etc>., to refer to the enclosing pattern's capture groups.) Thus, although 2040 2041 ('a' x 100)=~/(??{'(.)' x 100})/ 2042 2043I<will> match, it will I<not> set C<$1> on exit. 2044 2045The following pattern matches a parenthesized group: 2046 2047 $re = qr{ 2048 \( 2049 (?: 2050 (?> [^()]+ ) # Non-parens without backtracking 2051 | 2052 (??{ $re }) # Group with matching parens 2053 )* 2054 \) 2055 }x; 2056 2057See also 2058L<C<(?I<PARNO>)>|/(?I<PARNO>) (?-I<PARNO>) (?+I<PARNO>) (?R) (?0)> 2059for a different, more efficient way to accomplish 2060the same task. 2061 2062Executing a postponed regular expression too many times without 2063consuming any input string will also result in a fatal error. The depth 2064at which that happens is compiled into perl, so it can be changed with a 2065custom build. 2066 2067The use of this construct disables some optimisations globally in the pattern, 2068and the pattern may execute much slower as a consequence. 2069 2070=item C<(?I<PARNO>)> C<(?-I<PARNO>)> C<(?+I<PARNO>)> C<(?R)> C<(?0)> 2071X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)> 2072X<regex, recursive> X<regexp, recursive> X<regular expression, recursive> 2073X<regex, relative recursion> X<GOSUB> X<GOSTART> 2074 2075Recursive subpattern. Treat the contents of a given capture buffer in the 2076current pattern as an independent subpattern and attempt to match it at 2077the current position in the string. Information about capture state from 2078the caller for things like backreferences is available to the subpattern, 2079but capture buffers set by the subpattern are not visible to the caller. 2080 2081Similar to C<(??{ I<code> })> except that it does not involve executing any 2082code or potentially compiling a returned pattern string; instead it treats 2083the part of the current pattern contained within a specified capture group 2084as an independent pattern that must match at the current position. Also 2085different is the treatment of capture buffers, unlike C<(??{ I<code> })> 2086recursive patterns have access to their caller's match state, so one can 2087use backreferences safely. 2088 2089I<PARNO> is a sequence of digits (not starting with 0) whose value reflects 2090the paren-number of the capture group to recurse to. C<(?R)> recurses to 2091the beginning of the whole pattern. C<(?0)> is an alternate syntax for 2092C<(?R)>. If I<PARNO> is preceded by a plus or minus sign then it is assumed 2093to be relative, with negative numbers indicating preceding capture groups 2094and positive ones following. Thus C<(?-1)> refers to the most recently 2095declared group, and C<(?+1)> indicates the next group to be declared. 2096Note that the counting for relative recursion differs from that of 2097relative backreferences, in that with recursion unclosed groups B<are> 2098included. 2099 2100The following pattern matches a function C<foo()> which may contain 2101balanced parentheses as the argument. 2102 2103 $re = qr{ ( # paren group 1 (full function) 2104 foo 2105 ( # paren group 2 (parens) 2106 \( 2107 ( # paren group 3 (contents of parens) 2108 (?: 2109 (?> [^()]+ ) # Non-parens without backtracking 2110 | 2111 (?2) # Recurse to start of paren group 2 2112 )* 2113 ) 2114 \) 2115 ) 2116 ) 2117 }x; 2118 2119If the pattern was used as follows 2120 2121 'foo(bar(baz)+baz(bop))'=~/$re/ 2122 and print "\$1 = $1\n", 2123 "\$2 = $2\n", 2124 "\$3 = $3\n"; 2125 2126the output produced should be the following: 2127 2128 $1 = foo(bar(baz)+baz(bop)) 2129 $2 = (bar(baz)+baz(bop)) 2130 $3 = bar(baz)+baz(bop) 2131 2132If there is no corresponding capture group defined, then it is a 2133fatal error. Recursing deeply without consuming any input string will 2134also result in a fatal error. The depth at which that happens is 2135compiled into perl, so it can be changed with a custom build. 2136 2137The following shows how using negative indexing can make it 2138easier to embed recursive patterns inside of a C<qr//> construct 2139for later use: 2140 2141 my $parens = qr/(\((?:[^()]++|(?-1))*+\))/; 2142 if (/foo $parens \s+ \+ \s+ bar $parens/x) { 2143 # do something here... 2144 } 2145 2146B<Note> that this pattern does not behave the same way as the equivalent 2147PCRE or Python construct of the same form. In Perl you can backtrack into 2148a recursed group, in PCRE and Python the recursed into group is treated 2149as atomic. Also, modifiers are resolved at compile time, so constructs 2150like C<(?i:(?1))> or C<(?:(?i)(?1))> do not affect how the sub-pattern will 2151be processed. 2152 2153=item C<(?&I<NAME>)> 2154X<(?&NAME)> 2155 2156Recurse to a named subpattern. Identical to C<(?I<PARNO>)> except that the 2157parenthesis to recurse to is determined by name. If multiple parentheses have 2158the same name, then it recurses to the leftmost. 2159 2160It is an error to refer to a name that is not declared somewhere in the 2161pattern. 2162 2163B<NOTE:> In order to make things easier for programmers with experience 2164with the Python or PCRE regex engines the pattern C<< (?P>I<NAME>) >> 2165may be used instead of C<< (?&I<NAME>) >>. 2166 2167=item C<(?(I<condition>)I<yes-pattern>|I<no-pattern>)> 2168X<(?()> 2169 2170=item C<(?(I<condition>)I<yes-pattern>)> 2171 2172Conditional expression. Matches I<yes-pattern> if I<condition> yields 2173a true value, matches I<no-pattern> otherwise. A missing pattern always 2174matches. 2175 2176C<(I<condition>)> should be one of: 2177 2178=over 4 2179 2180=item an integer in parentheses 2181 2182(which is valid if the corresponding pair of parentheses 2183matched); 2184 2185=item a lookahead/lookbehind/evaluate zero-width assertion; 2186 2187=item a name in angle brackets or single quotes 2188 2189(which is valid if a group with the given name matched); 2190 2191=item the special symbol C<(R)> 2192 2193(true when evaluated inside of recursion or eval). Additionally the 2194C<"R"> may be 2195followed by a number, (which will be true when evaluated when recursing 2196inside of the appropriate group), or by C<&I<NAME>>, in which case it will 2197be true only when evaluated during recursion in the named group. 2198 2199=back 2200 2201Here's a summary of the possible predicates: 2202 2203=over 4 2204 2205=item C<(1)> C<(2)> ... 2206 2207Checks if the numbered capturing group has matched something. 2208Full syntax: C<< (?(1)then|else) >> 2209 2210=item C<(E<lt>I<NAME>E<gt>)> C<('I<NAME>')> 2211 2212Checks if a group with the given name has matched something. 2213Full syntax: C<< (?(<name>)then|else) >> 2214 2215=item C<(?=...)> C<(?!...)> C<(?<=...)> C<(?<!...)> 2216 2217Checks whether the pattern matches (or does not match, for the C<"!"> 2218variants). 2219Full syntax: C<< (?(?=I<lookahead>)I<then>|I<else>) >> 2220 2221=item C<(?{ I<CODE> })> 2222 2223Treats the return value of the code block as the condition. 2224Full syntax: C<< (?(?{ I<CODE> })I<then>|I<else>) >> 2225 2226Note use of this construct may globally affect the performance 2227of the pattern. Consider using C<(*{ I<CODE> })> 2228 2229=item C<(*{ I<CODE> })> 2230 2231Treats the return value of the code block as the condition. 2232Full syntax: C<< (?(*{ I<CODE> })I<then>|I<else>) >> 2233 2234=item C<(R)> 2235 2236Checks if the expression has been evaluated inside of recursion. 2237Full syntax: C<< (?(R)I<then>|I<else>) >> 2238 2239=item C<(R1)> C<(R2)> ... 2240 2241Checks if the expression has been evaluated while executing directly 2242inside of the n-th capture group. This check is the regex equivalent of 2243 2244 if ((caller(0))[3] eq 'subname') { ... } 2245 2246In other words, it does not check the full recursion stack. 2247 2248Full syntax: C<< (?(R1)I<then>|I<else>) >> 2249 2250=item C<(R&I<NAME>)> 2251 2252Similar to C<(R1)>, this predicate checks to see if we're executing 2253directly inside of the leftmost group with a given name (this is the same 2254logic used by C<(?&I<NAME>)> to disambiguate). It does not check the full 2255stack, but only the name of the innermost active recursion. 2256Full syntax: C<< (?(R&I<name>)I<then>|I<else>) >> 2257 2258=item C<(DEFINE)> 2259 2260In this case, the yes-pattern is never directly executed, and no 2261no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient. 2262See below for details. 2263Full syntax: C<< (?(DEFINE)I<definitions>...) >> 2264 2265=back 2266 2267For example: 2268 2269 m{ ( \( )? 2270 [^()]+ 2271 (?(1) \) ) 2272 }x 2273 2274matches a chunk of non-parentheses, possibly included in parentheses 2275themselves. 2276 2277A special form is the C<(DEFINE)> predicate, which never executes its 2278yes-pattern directly, and does not allow a no-pattern. This allows one to 2279define subpatterns which will be executed only by the recursion mechanism. 2280This way, you can define a set of regular expression rules that can be 2281bundled into any pattern you choose. 2282 2283It is recommended that for this usage you put the DEFINE block at the 2284end of the pattern, and that you name any subpatterns defined within it. 2285 2286Also, it's worth noting that patterns defined this way probably will 2287not be as efficient, as the optimizer is not very clever about 2288handling them. 2289 2290An example of how this might be used is as follows: 2291 2292 /(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT)) 2293 (?(DEFINE) 2294 (?<NAME_PAT>....) 2295 (?<ADDRESS_PAT>....) 2296 )/x 2297 2298Note that capture groups matched inside of recursion are not accessible 2299after the recursion returns, so the extra layer of capturing groups is 2300necessary. Thus C<$+{NAME_PAT}> would not be defined even though 2301C<$+{NAME}> would be. 2302 2303Finally, keep in mind that subpatterns created inside a DEFINE block 2304count towards the absolute and relative number of captures, so this: 2305 2306 my @captures = "a" =~ /(.) # First capture 2307 (?(DEFINE) 2308 (?<EXAMPLE> 1 ) # Second capture 2309 )/x; 2310 say scalar @captures; 2311 2312Will output 2, not 1. This is particularly important if you intend to 2313compile the definitions with the C<qr//> operator, and later 2314interpolate them in another pattern. 2315 2316=item C<< (?>I<pattern>) >> 2317 2318=item C<< (*atomic:I<pattern>) >> 2319X<(?E<gt>pattern)> 2320X<(*atomic> 2321X<backtrack> X<backtracking> X<atomic> X<possessive> 2322 2323An "independent" subexpression, one which matches the substring 2324that a standalone I<pattern> would match if anchored at the given 2325position, and it matches I<nothing other than this substring>. This 2326construct is useful for optimizations of what would otherwise be 2327"eternal" matches, because it will not backtrack (see L</"Backtracking">). 2328It may also be useful in places where the "grab all you can, and do not 2329give anything back" semantic is desirable. 2330 2331For example: C<< ^(?>a*)ab >> will never match, since C<< (?>a*) >> 2332(anchored at the beginning of string, as above) will match I<all> 2333characters C<"a"> at the beginning of string, leaving no C<"a"> for 2334C<ab> to match. In contrast, C<a*ab> will match the same as C<a+b>, 2335since the match of the subgroup C<a*> is influenced by the following 2336group C<ab> (see L</"Backtracking">). In particular, C<a*> inside 2337C<a*ab> will match fewer characters than a standalone C<a*>, since 2338this makes the tail match. 2339 2340C<< (?>I<pattern>) >> does not disable backtracking altogether once it has 2341matched. It is still possible to backtrack past the construct, but not 2342into it. So C<< ((?>a*)|(?>b*))ar >> will still match "bar". 2343 2344An effect similar to C<< (?>I<pattern>) >> may be achieved by writing 2345C<(?=(I<pattern>))\g{-1}>. This matches the same substring as a standalone 2346C<a+>, and the following C<\g{-1}> eats the matched string; it therefore 2347makes a zero-length assertion into an analogue of C<< (?>...) >>. 2348(The difference between these two constructs is that the second one 2349uses a capturing group, thus shifting ordinals of backreferences 2350in the rest of a regular expression.) 2351 2352Consider this pattern: 2353 2354 m{ \( 2355 ( 2356 [^()]+ # x+ 2357 | 2358 \( [^()]* \) 2359 )+ 2360 \) 2361 }x 2362 2363That will efficiently match a nonempty group with matching parentheses 2364two levels deep or less. However, if there is no such group, it 2365will take virtually forever on a long string. That's because there 2366are so many different ways to split a long string into several 2367substrings. This is what C<(.+)+> is doing, and C<(.+)+> is similar 2368to a subpattern of the above pattern. Consider how the pattern 2369above detects no-match on C<((()aaaaaaaaaaaaaaaaaa> in several 2370seconds, but that each extra letter doubles this time. This 2371exponential performance will make it appear that your program has 2372hung. However, a tiny change to this pattern 2373 2374 m{ \( 2375 ( 2376 (?> [^()]+ ) # change x+ above to (?> x+ ) 2377 | 2378 \( [^()]* \) 2379 )+ 2380 \) 2381 }x 2382 2383which uses C<< (?>...) >> matches exactly when the one above does (verifying 2384this yourself would be a productive exercise), but finishes in a fourth 2385the time when used on a similar string with 1000000 C<"a">s. Be aware, 2386however, that, when this construct is followed by a 2387quantifier, it currently triggers a warning message under 2388the C<use warnings> pragma or B<-w> switch saying it 2389C<"matches null string many times in regex">. 2390 2391On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable 2392effect may be achieved by negative lookahead, as in C<[^()]+ (?! [^()] )>. 2393This was only 4 times slower on a string with 1000000 C<"a">s. 2394 2395The "grab all you can, and do not give anything back" semantic is desirable 2396in many situations where on the first sight a simple C<()*> looks like 2397the correct solution. Suppose we parse text with comments being delimited 2398by C<"#"> followed by some optional (horizontal) whitespace. Contrary to 2399its appearance, C<#[ \t]*> I<is not> the correct subexpression to match 2400the comment delimiter, because it may "give up" some whitespace if 2401the remainder of the pattern can be made to match that way. The correct 2402answer is either one of these: 2403 2404 (?>#[ \t]*) 2405 #[ \t]*(?![ \t]) 2406 2407For example, to grab non-empty comments into C<$1>, one should use either 2408one of these: 2409 2410 / (?> \# [ \t]* ) ( .+ ) /x; 2411 / \# [ \t]* ( [^ \t] .* ) /x; 2412 2413Which one you pick depends on which of these expressions better reflects 2414the above specification of comments. 2415 2416In some literature this construct is called "atomic matching" or 2417"possessive matching". 2418 2419Possessive quantifiers are equivalent to putting the item they are applied 2420to inside of one of these constructs. The following equivalences apply: 2421 2422 Quantifier Form Bracketing Form 2423 --------------- --------------- 2424 PAT*+ (?>PAT*) 2425 PAT++ (?>PAT+) 2426 PAT?+ (?>PAT?) 2427 PAT{min,max}+ (?>PAT{min,max}) 2428 2429Nested C<(?E<gt>...)> constructs are not no-ops, even if at first glance 2430they might seem to be. This is because the nested C<(?E<gt>...)> can 2431restrict internal backtracking that otherwise might occur. For example, 2432 2433 "abc" =~ /(?>a[bc]*c)/ 2434 2435matches, but 2436 2437 "abc" =~ /(?>a(?>[bc]*)c)/ 2438 2439does not. 2440 2441=item C<(?[ ])> 2442 2443See L<perlrecharclass/Extended Bracketed Character Classes>. 2444 2445=back 2446 2447=head2 Backtracking 2448X<backtrack> X<backtracking> 2449 2450NOTE: This section presents an abstract approximation of regular 2451expression behavior. For a more rigorous (and complicated) view of 2452the rules involved in selecting a match among possible alternatives, 2453see L</Combining RE Pieces>. 2454 2455A fundamental feature of regular expression matching involves the 2456notion called I<backtracking>, which is currently used (when needed) 2457by all regular non-possessive expression quantifiers, namely C<"*">, 2458C<*?>, C<"+">, C<+?>, C<{n,m}>, and C<{n,m}?>. Backtracking is often 2459optimized internally, but the general principle outlined here is valid. 2460 2461For a regular expression to match, the I<entire> regular expression must 2462match, not just part of it. So if the beginning of a pattern containing a 2463quantifier succeeds in a way that causes later parts in the pattern to 2464fail, the matching engine backs up and recalculates the beginning 2465part--that's why it's called backtracking. 2466 2467Here is an example of backtracking: Let's say you want to find the 2468word following "foo" in the string "Food is on the foo table.": 2469 2470 $_ = "Food is on the foo table."; 2471 if ( /\b(foo)\s+(\w+)/i ) { 2472 print "$2 follows $1.\n"; 2473 } 2474 2475When the match runs, the first part of the regular expression (C<\b(foo)>) 2476finds a possible match right at the beginning of the string, and loads up 2477C<$1> with "Foo". However, as soon as the matching engine sees that there's 2478no whitespace following the "Foo" that it had saved in C<$1>, it realizes its 2479mistake and starts over again one character after where it had the 2480tentative match. This time it goes all the way until the next occurrence 2481of "foo". The complete regular expression matches this time, and you get 2482the expected output of "table follows foo." 2483 2484Sometimes minimal matching can help a lot. Imagine you'd like to match 2485everything between "foo" and "bar". Initially, you write something 2486like this: 2487 2488 $_ = "The food is under the bar in the barn."; 2489 if ( /foo(.*)bar/ ) { 2490 print "got <$1>\n"; 2491 } 2492 2493Which perhaps unexpectedly yields: 2494 2495 got <d is under the bar in the > 2496 2497That's because C<.*> was greedy, so you get everything between the 2498I<first> "foo" and the I<last> "bar". Here it's more effective 2499to use minimal matching to make sure you get the text between a "foo" 2500and the first "bar" thereafter. 2501 2502 if ( /foo(.*?)bar/ ) { print "got <$1>\n" } 2503 got <d is under the > 2504 2505Here's another example. Let's say you'd like to match a number at the end 2506of a string, and you also want to keep the preceding part of the match. 2507So you write this: 2508 2509 $_ = "I have 2 numbers: 53147"; 2510 if ( /(.*)(\d*)/ ) { # Wrong! 2511 print "Beginning is <$1>, number is <$2>.\n"; 2512 } 2513 2514That won't work at all, because C<.*> was greedy and gobbled up the 2515whole string. As C<\d*> can match on an empty string the complete 2516regular expression matched successfully. 2517 2518 Beginning is <I have 2 numbers: 53147>, number is <>. 2519 2520Here are some variants, most of which don't work: 2521 2522 $_ = "I have 2 numbers: 53147"; 2523 @pats = qw{ 2524 (.*)(\d*) 2525 (.*)(\d+) 2526 (.*?)(\d*) 2527 (.*?)(\d+) 2528 (.*)(\d+)$ 2529 (.*?)(\d+)$ 2530 (.*)\b(\d+)$ 2531 (.*\D)(\d+)$ 2532 }; 2533 2534 for $pat (@pats) { 2535 printf "%-12s ", $pat; 2536 if ( /$pat/ ) { 2537 print "<$1> <$2>\n"; 2538 } else { 2539 print "FAIL\n"; 2540 } 2541 } 2542 2543That will print out: 2544 2545 (.*)(\d*) <I have 2 numbers: 53147> <> 2546 (.*)(\d+) <I have 2 numbers: 5314> <7> 2547 (.*?)(\d*) <> <> 2548 (.*?)(\d+) <I have > <2> 2549 (.*)(\d+)$ <I have 2 numbers: 5314> <7> 2550 (.*?)(\d+)$ <I have 2 numbers: > <53147> 2551 (.*)\b(\d+)$ <I have 2 numbers: > <53147> 2552 (.*\D)(\d+)$ <I have 2 numbers: > <53147> 2553 2554As you see, this can be a bit tricky. It's important to realize that a 2555regular expression is merely a set of assertions that gives a definition 2556of success. There may be 0, 1, or several different ways that the 2557definition might succeed against a particular string. And if there are 2558multiple ways it might succeed, you need to understand backtracking to 2559know which variety of success you will achieve. 2560 2561When using lookahead assertions and negations, this can all get even 2562trickier. Imagine you'd like to find a sequence of non-digits not 2563followed by "123". You might try to write that as 2564 2565 $_ = "ABC123"; 2566 if ( /^\D*(?!123)/ ) { # Wrong! 2567 print "Yup, no 123 in $_\n"; 2568 } 2569 2570But that isn't going to match; at least, not the way you're hoping. It 2571claims that there is no 123 in the string. Here's a clearer picture of 2572why that pattern matches, contrary to popular expectations: 2573 2574 $x = 'ABC123'; 2575 $y = 'ABC445'; 2576 2577 print "1: got $1\n" if $x =~ /^(ABC)(?!123)/; 2578 print "2: got $1\n" if $y =~ /^(ABC)(?!123)/; 2579 2580 print "3: got $1\n" if $x =~ /^(\D*)(?!123)/; 2581 print "4: got $1\n" if $y =~ /^(\D*)(?!123)/; 2582 2583This prints 2584 2585 2: got ABC 2586 3: got AB 2587 4: got ABC 2588 2589You might have expected test 3 to fail because it seems to a more 2590general purpose version of test 1. The important difference between 2591them is that test 3 contains a quantifier (C<\D*>) and so can use 2592backtracking, whereas test 1 will not. What's happening is 2593that you've asked "Is it true that at the start of C<$x>, following 0 or more 2594non-digits, you have something that's not 123?" If the pattern matcher had 2595let C<\D*> expand to "ABC", this would have caused the whole pattern to 2596fail. 2597 2598The search engine will initially match C<\D*> with "ABC". Then it will 2599try to match C<(?!123)> with "123", which fails. But because 2600a quantifier (C<\D*>) has been used in the regular expression, the 2601search engine can backtrack and retry the match differently 2602in the hope of matching the complete regular expression. 2603 2604The pattern really, I<really> wants to succeed, so it uses the 2605standard pattern back-off-and-retry and lets C<\D*> expand to just "AB" this 2606time. Now there's indeed something following "AB" that is not 2607"123". It's "C123", which suffices. 2608 2609We can deal with this by using both an assertion and a negation. 2610We'll say that the first part in C<$1> must be followed both by a digit 2611and by something that's not "123". Remember that the lookaheads 2612are zero-width expressions--they only look, but don't consume any 2613of the string in their match. So rewriting this way produces what 2614you'd expect; that is, case 5 will fail, but case 6 succeeds: 2615 2616 print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/; 2617 print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/; 2618 2619 6: got ABC 2620 2621In other words, the two zero-width assertions next to each other work as though 2622they're ANDed together, just as you'd use any built-in assertions: C</^$/> 2623matches only if you're at the beginning of the line AND the end of the 2624line simultaneously. The deeper underlying truth is that juxtaposition in 2625regular expressions always means AND, except when you write an explicit OR 2626using the vertical bar. C</ab/> means match "a" AND (then) match "b", 2627although the attempted matches are made at different positions because "a" 2628is not a zero-width assertion, but a one-width assertion. 2629 2630B<WARNING>: Particularly complicated regular expressions can take 2631exponential time to solve because of the immense number of possible 2632ways they can use backtracking to try for a match. For example, without 2633internal optimizations done by the regular expression engine, this will 2634take a painfully long time to run: 2635 2636 'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/ 2637 2638And if you used C<"*">'s in the internal groups instead of limiting them 2639to 0 through 5 matches, then it would take forever--or until you ran 2640out of stack space. Moreover, these internal optimizations are not 2641always applicable. For example, if you put C<{0,5}> instead of C<"*"> 2642on the external group, no current optimization is applicable, and the 2643match takes a long time to finish. 2644 2645A powerful tool for optimizing such beasts is what is known as an 2646"independent group", 2647which does not backtrack (see C<L</(?E<gt>pattern)>>). Note also that 2648zero-length lookahead/lookbehind assertions will not backtrack to make 2649the tail match, since they are in "logical" context: only 2650whether they match is considered relevant. For an example 2651where side-effects of lookahead I<might> have influenced the 2652following match, see C<L</(?E<gt>pattern)>>. 2653 2654=head2 Script Runs 2655X<(*script_run:...)> X<(sr:...)> 2656X<(*atomic_script_run:...)> X<(asr:...)> 2657 2658A script run is basically a sequence of characters, all from the same 2659Unicode script (see L<perlunicode/Scripts>), such as Latin or Greek. In 2660most places a single word would never be written in multiple scripts, 2661unless it is a spoofing attack. An infamous example, is 2662 2663 paypal.com 2664 2665Those letters could all be Latin (as in the example just above), or they 2666could be all Cyrillic (except for the dot), or they could be a mixture 2667of the two. In the case of an internet address the C<.com> would be in 2668Latin, And any Cyrillic ones would cause it to be a mixture, not a 2669script run. Someone clicking on such a link would not be directed to 2670the real Paypal website, but an attacker would craft a look-alike one to 2671attempt to gather sensitive information from the person. 2672 2673Starting in Perl 5.28, it is now easy to detect strings that aren't 2674script runs. Simply enclose just about any pattern like either of 2675these: 2676 2677 (*script_run:pattern) 2678 (*sr:pattern) 2679 2680What happens is that after I<pattern> succeeds in matching, it is 2681subjected to the additional criterion that every character in it must be 2682from the same script (see exceptions below). If this isn't true, 2683backtracking occurs until something all in the same script is found that 2684matches, or all possibilities are exhausted. This can cause a lot of 2685backtracking, but generally, only malicious input will result in this, 2686though the slow down could cause a denial of service attack. If your 2687needs permit, it is best to make the pattern atomic to cut down on the 2688amount of backtracking. This is so likely to be what you want, that 2689instead of writing this: 2690 2691 (*script_run:(?>pattern)) 2692 2693you can write either of these: 2694 2695 (*atomic_script_run:pattern) 2696 (*asr:pattern) 2697 2698(See C<L</(?E<gt>I<pattern>)>>.) 2699 2700In Taiwan, Japan, and Korea, it is common for text to have a mixture of 2701characters from their native scripts and base Chinese. Perl follows 2702Unicode's UTS 39 (L<https://unicode.org/reports/tr39/>) Unicode Security 2703Mechanisms in allowing such mixtures. For example, the Japanese scripts 2704Katakana and Hiragana are commonly mixed together in practice, along 2705with some Chinese characters, and hence are treated as being in a single 2706script run by Perl. 2707 2708The rules used for matching decimal digits are slightly stricter. Many 2709scripts have their own sets of digits equivalent to the Western C<0> 2710through C<9> ones. A few, such as Arabic, have more than one set. For 2711a string to be considered a script run, all digits in it must come from 2712the same set of ten, as determined by the first digit encountered. 2713As an example, 2714 2715 qr/(*script_run: \d+ \b )/x 2716 2717guarantees that the digits matched will all be from the same set of 10. 2718You won't get a look-alike digit from a different script that has a 2719different value than what it appears to be. 2720 2721Unicode has three pseudo scripts that are handled specially. 2722 2723"Unknown" is applied to code points whose meaning has yet to be 2724determined. Perl currently will match as a script run, any single 2725character string consisting of one of these code points. But any string 2726longer than one code point containing one of these will not be 2727considered a script run. 2728 2729"Inherited" is applied to characters that modify another, such as an 2730accent of some type. These are considered to be in the script of the 2731master character, and so never cause a script run to not match. 2732 2733The other one is "Common". This consists of mostly punctuation, emoji, 2734characters used in mathematics and music, the ASCII digits C<0> 2735through C<9>, and full-width forms of these digits. These characters 2736can appear intermixed in text in many of the world's scripts. These 2737also don't cause a script run to not match. But like other scripts, all 2738digits in a run must come from the same set of 10. 2739 2740This construct is non-capturing. You can add parentheses to I<pattern> 2741to capture, if desired. You will have to do this if you plan to use 2742L</(*ACCEPT) (*ACCEPT:arg)> and not have it bypass the script run 2743checking. 2744 2745The C<Script_Extensions> property as modified by UTS 39 2746(L<https://unicode.org/reports/tr39/>) is used as the basis for this 2747feature. 2748 2749To summarize, 2750 2751=over 4 2752 2753=item * 2754 2755All length 0 or length 1 sequences are script runs. 2756 2757=item * 2758 2759A longer sequence is a script run if and only if B<all> of the following 2760conditions are met: 2761 2762Z<> 2763 2764=over 2765 2766=item 1 2767 2768No code point in the sequence has the C<Script_Extension> property of 2769C<Unknown>. 2770 2771This currently means that all code points in the sequence have been 2772assigned by Unicode to be characters that aren't private use nor 2773surrogate code points. 2774 2775=item 2 2776 2777All characters in the sequence come from the Common script and/or the 2778Inherited script and/or a single other script. 2779 2780The script of a character is determined by the C<Script_Extensions> 2781property as modified by UTS 39 (L<https://unicode.org/reports/tr39/>), as 2782described above. 2783 2784=item 3 2785 2786All decimal digits in the sequence come from the same block of 10 2787consecutive digits. 2788 2789=back 2790 2791=back 2792 2793=head2 Special Backtracking Control Verbs 2794 2795These special patterns are generally of the form C<(*I<VERB>:I<arg>)>. Unless 2796otherwise stated the I<arg> argument is optional; in some cases, it is 2797mandatory. 2798 2799Any pattern containing a special backtracking verb that allows an argument 2800has the special behaviour that when executed it sets the current package's 2801C<$REGERROR> and C<$REGMARK> variables. When doing so the following 2802rules apply: 2803 2804On failure, the C<$REGERROR> variable will be set to the I<arg> value of the 2805verb pattern, if the verb was involved in the failure of the match. If the 2806I<arg> part of the pattern was omitted, then C<$REGERROR> will be set to the 2807name of the last C<(*MARK:I<NAME>)> pattern executed, or to TRUE if there was 2808none. Also, the C<$REGMARK> variable will be set to FALSE. 2809 2810On a successful match, the C<$REGERROR> variable will be set to FALSE, and 2811the C<$REGMARK> variable will be set to the name of the last 2812C<(*MARK:I<NAME>)> pattern executed. See the explanation for the 2813C<(*MARK:I<NAME>)> verb below for more details. 2814 2815B<NOTE:> C<$REGERROR> and C<$REGMARK> are not magic variables like C<$1> 2816and most other regex-related variables. They are not local to a scope, nor 2817readonly, but instead are volatile package variables similar to C<$AUTOLOAD>. 2818They are set in the package containing the code that I<executed> the regex 2819(rather than the one that compiled it, where those differ). If necessary, you 2820can use C<local> to localize changes to these variables to a specific scope 2821before executing a regex. 2822 2823If a pattern does not contain a special backtracking verb that allows an 2824argument, then C<$REGERROR> and C<$REGMARK> are not touched at all. 2825 2826=over 3 2827 2828=item Verbs 2829 2830=over 4 2831 2832=item C<(*PRUNE)> C<(*PRUNE:I<NAME>)> 2833X<(*PRUNE)> X<(*PRUNE:NAME)> 2834 2835This zero-width pattern prunes the backtracking tree at the current point 2836when backtracked into on failure. Consider the pattern C</I<A> (*PRUNE) I<B>/>, 2837where I<A> and I<B> are complex patterns. Until the C<(*PRUNE)> verb is reached, 2838I<A> may backtrack as necessary to match. Once it is reached, matching 2839continues in I<B>, which may also backtrack as necessary; however, should B 2840not match, then no further backtracking will take place, and the pattern 2841will fail outright at the current starting position. 2842 2843The following example counts all the possible matching strings in a 2844pattern (without actually matching any of them). 2845 2846 'aaab' =~ /a+b?(?{print "$&\n"; $count++})(*FAIL)/; 2847 print "Count=$count\n"; 2848 2849which produces: 2850 2851 aaab 2852 aaa 2853 aa 2854 a 2855 aab 2856 aa 2857 a 2858 ab 2859 a 2860 Count=9 2861 2862If we add a C<(*PRUNE)> before the count like the following 2863 2864 'aaab' =~ /a+b?(*PRUNE)(?{print "$&\n"; $count++})(*FAIL)/; 2865 print "Count=$count\n"; 2866 2867we prevent backtracking and find the count of the longest matching string 2868at each matching starting point like so: 2869 2870 aaab 2871 aab 2872 ab 2873 Count=3 2874 2875Any number of C<(*PRUNE)> assertions may be used in a pattern. 2876 2877See also C<<< L<< /(?>I<pattern>) >> >>> and possessive quantifiers for 2878other ways to 2879control backtracking. In some cases, the use of C<(*PRUNE)> can be 2880replaced with a C<< (?>pattern) >> with no functional difference; however, 2881C<(*PRUNE)> can be used to handle cases that cannot be expressed using a 2882C<< (?>pattern) >> alone. 2883 2884=item C<(*SKIP)> C<(*SKIP:I<NAME>)> 2885X<(*SKIP)> 2886 2887This zero-width pattern is similar to C<(*PRUNE)>, except that on 2888failure it also signifies that whatever text that was matched leading up 2889to the C<(*SKIP)> pattern being executed cannot be part of I<any> match 2890of this pattern. This effectively means that the regex engine "skips" forward 2891to this position on failure and tries to match again, (assuming that 2892there is sufficient room to match). 2893 2894The name of the C<(*SKIP:I<NAME>)> pattern has special significance. If a 2895C<(*MARK:I<NAME>)> was encountered while matching, then it is that position 2896which is used as the "skip point". If no C<(*MARK)> of that name was 2897encountered, then the C<(*SKIP)> operator has no effect. When used 2898without a name the "skip point" is where the match point was when 2899executing the C<(*SKIP)> pattern. 2900 2901Compare the following to the examples in C<(*PRUNE)>; note the string 2902is twice as long: 2903 2904 'aaabaaab' =~ /a+b?(*SKIP)(?{print "$&\n"; $count++})(*FAIL)/; 2905 print "Count=$count\n"; 2906 2907outputs 2908 2909 aaab 2910 aaab 2911 Count=2 2912 2913Once the 'aaab' at the start of the string has matched, and the C<(*SKIP)> 2914executed, the next starting point will be where the cursor was when the 2915C<(*SKIP)> was executed. 2916 2917=item C<(*MARK:I<NAME>)> C<(*:I<NAME>)> 2918X<(*MARK)> X<(*MARK:NAME)> X<(*:NAME)> 2919 2920This zero-width pattern can be used to mark the point reached in a string 2921when a certain part of the pattern has been successfully matched. This 2922mark may be given a name. A later C<(*SKIP)> pattern will then skip 2923forward to that point if backtracked into on failure. Any number of 2924C<(*MARK)> patterns are allowed, and the I<NAME> portion may be duplicated. 2925 2926In addition to interacting with the C<(*SKIP)> pattern, C<(*MARK:I<NAME>)> 2927can be used to "label" a pattern branch, so that after matching, the 2928program can determine which branches of the pattern were involved in the 2929match. 2930 2931When a match is successful, the C<$REGMARK> variable will be set to the 2932name of the most recently executed C<(*MARK:I<NAME>)> that was involved 2933in the match. 2934 2935This can be used to determine which branch of a pattern was matched 2936without using a separate capture group for each branch, which in turn 2937can result in a performance improvement, as perl cannot optimize 2938C</(?:(x)|(y)|(z))/> as efficiently as something like 2939C</(?:x(*MARK:x)|y(*MARK:y)|z(*MARK:z))/>. 2940 2941When a match has failed, and unless another verb has been involved in 2942failing the match and has provided its own name to use, the C<$REGERROR> 2943variable will be set to the name of the most recently executed 2944C<(*MARK:I<NAME>)>. 2945 2946See L</(*SKIP)> for more details. 2947 2948As a shortcut C<(*MARK:I<NAME>)> can be written C<(*:I<NAME>)>. 2949 2950=item C<(*THEN)> C<(*THEN:I<NAME>)> 2951 2952This is similar to the "cut group" operator C<::> from Raku. Like 2953C<(*PRUNE)>, this verb always matches, and when backtracked into on 2954failure, it causes the regex engine to try the next alternation in the 2955innermost enclosing group (capturing or otherwise) that has alternations. 2956The two branches of a C<(?(I<condition>)I<yes-pattern>|I<no-pattern>)> do not 2957count as an alternation, as far as C<(*THEN)> is concerned. 2958 2959Its name comes from the observation that this operation combined with the 2960alternation operator (C<"|">) can be used to create what is essentially a 2961pattern-based if/then/else block: 2962 2963 ( COND (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) 2964 2965Note that if this operator is used and NOT inside of an alternation then 2966it acts exactly like the C<(*PRUNE)> operator. 2967 2968 / A (*PRUNE) B / 2969 2970is the same as 2971 2972 / A (*THEN) B / 2973 2974but 2975 2976 / ( A (*THEN) B | C ) / 2977 2978is not the same as 2979 2980 / ( A (*PRUNE) B | C ) / 2981 2982as after matching the I<A> but failing on the I<B> the C<(*THEN)> verb will 2983backtrack and try I<C>; but the C<(*PRUNE)> verb will simply fail. 2984 2985=item C<(*COMMIT)> C<(*COMMIT:I<arg>)> 2986X<(*COMMIT)> 2987 2988This is the Raku "commit pattern" C<< <commit> >> or C<:::>. It's a 2989zero-width pattern similar to C<(*SKIP)>, except that when backtracked 2990into on failure it causes the match to fail outright. No further attempts 2991to find a valid match by advancing the start pointer will occur again. 2992For example, 2993 2994 'aaabaaab' =~ /a+b?(*COMMIT)(?{print "$&\n"; $count++})(*FAIL)/; 2995 print "Count=$count\n"; 2996 2997outputs 2998 2999 aaab 3000 Count=1 3001 3002In other words, once the C<(*COMMIT)> has been entered, and if the pattern 3003does not match, the regex engine will not try any further matching on the 3004rest of the string. 3005 3006=item C<(*FAIL)> C<(*F)> C<(*FAIL:I<arg>)> 3007X<(*FAIL)> X<(*F)> 3008 3009This pattern matches nothing and always fails. It can be used to force the 3010engine to backtrack. It is equivalent to C<(?!)>, but easier to read. In 3011fact, C<(?!)> gets optimised into C<(*FAIL)> internally. You can provide 3012an argument so that if the match fails because of this C<FAIL> directive 3013the argument can be obtained from C<$REGERROR>. 3014 3015It is probably useful only when combined with C<(?{})> or C<(??{})>. 3016 3017=item C<(*ACCEPT)> C<(*ACCEPT:I<arg>)> 3018X<(*ACCEPT)> 3019 3020This pattern matches nothing and causes the end of successful matching at 3021the point at which the C<(*ACCEPT)> pattern was encountered, regardless of 3022whether there is actually more to match in the string. When inside of a 3023nested pattern, such as recursion, or in a subpattern dynamically generated 3024via C<(??{})>, only the innermost pattern is ended immediately. 3025 3026If the C<(*ACCEPT)> is inside of capturing groups then the groups are 3027marked as ended at the point at which the C<(*ACCEPT)> was encountered. 3028For instance: 3029 3030 'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x; 3031 3032will match, and C<$1> will be C<AB> and C<$2> will be C<"B">, C<$3> will not 3033be set. If another branch in the inner parentheses was matched, such as in the 3034string 'ACDE', then the C<"D"> and C<"E"> would have to be matched as well. 3035 3036You can provide an argument, which will be available in the var 3037C<$REGMARK> after the match completes. 3038 3039=back 3040 3041=back 3042 3043=head2 Warning on C<\1> Instead of C<$1> 3044 3045Some people get too used to writing things like: 3046 3047 $pattern =~ s/(\W)/\\\1/g; 3048 3049This is grandfathered (for \1 to \9) for the RHS of a substitute to avoid 3050shocking the 3051B<sed> addicts, but it's a dirty habit to get into. That's because in 3052PerlThink, the righthand side of an C<s///> is a double-quoted string. C<\1> in 3053the usual double-quoted string means a control-A. The customary Unix 3054meaning of C<\1> is kludged in for C<s///>. However, if you get into the habit 3055of doing that, you get yourself into trouble if you then add an C</e> 3056modifier. 3057 3058 s/(\d+)/ \1 + 1 /eg; # causes warning under -w 3059 3060Or if you try to do 3061 3062 s/(\d+)/\1000/; 3063 3064You can't disambiguate that by saying C<\{1}000>, whereas you can fix it with 3065C<${1}000>. The operation of interpolation should not be confused 3066with the operation of matching a backreference. Certainly they mean two 3067different things on the I<left> side of the C<s///>. 3068 3069=head2 Repeated Patterns Matching a Zero-length Substring 3070 3071B<WARNING>: Difficult material (and prose) ahead. This section needs a rewrite. 3072 3073Regular expressions provide a terse and powerful programming language. As 3074with most other power tools, power comes together with the ability 3075to wreak havoc. 3076 3077A common abuse of this power stems from the ability to make infinite 3078loops using regular expressions, with something as innocuous as: 3079 3080 'foo' =~ m{ ( o? )* }x; 3081 3082The C<o?> matches at the beginning of "C<foo>", and since the position 3083in the string is not moved by the match, C<o?> would match again and again 3084because of the C<"*"> quantifier. Another common way to create a similar cycle 3085is with the looping modifier C</g>: 3086 3087 @matches = ( 'foo' =~ m{ o? }xg ); 3088 3089or 3090 3091 print "match: <$&>\n" while 'foo' =~ m{ o? }xg; 3092 3093or the loop implied by C<split()>. 3094 3095However, long experience has shown that many programming tasks may 3096be significantly simplified by using repeated subexpressions that 3097may match zero-length substrings. Here's a simple example being: 3098 3099 @chars = split //, $string; # // is not magic in split 3100 ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// / 3101 3102Thus Perl allows such constructs, by I<forcefully breaking 3103the infinite loop>. The rules for this are different for lower-level 3104loops given by the greedy quantifiers C<*+{}>, and for higher-level 3105ones like the C</g> modifier or C<split()> operator. 3106 3107The lower-level loops are I<interrupted> (that is, the loop is 3108broken) when Perl detects that a repeated expression matched a 3109zero-length substring. Thus 3110 3111 m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x; 3112 3113is made equivalent to 3114 3115 m{ (?: NON_ZERO_LENGTH )* (?: ZERO_LENGTH )? }x; 3116 3117For example, this program 3118 3119 #!perl -l 3120 "aaaaab" =~ / 3121 (?: 3122 a # non-zero 3123 | # or 3124 (?{print "hello"}) # print hello whenever this 3125 # branch is tried 3126 (?=(b)) # zero-width assertion 3127 )* # any number of times 3128 /x; 3129 print $&; 3130 print $1; 3131 3132prints 3133 3134 hello 3135 aaaaa 3136 b 3137 3138Notice that "hello" is only printed once, as when Perl sees that the sixth 3139iteration of the outermost C<(?:)*> matches a zero-length string, it stops 3140the C<"*">. 3141 3142The higher-level loops preserve an additional state between iterations: 3143whether the last match was zero-length. To break the loop, the following 3144match after a zero-length match is prohibited to have a length of zero. 3145This prohibition interacts with backtracking (see L</"Backtracking">), 3146and so the I<second best> match is chosen if the I<best> match is of 3147zero length. 3148 3149For example: 3150 3151 $_ = 'bar'; 3152 s/\w??/<$&>/g; 3153 3154results in C<< <><b><><a><><r><> >>. At each position of the string the best 3155match given by non-greedy C<??> is the zero-length match, and the I<second 3156best> match is what is matched by C<\w>. Thus zero-length matches 3157alternate with one-character-long matches. 3158 3159Similarly, for repeated C<m/()/g> the second-best match is the match at the 3160position one notch further in the string. 3161 3162The additional state of being I<matched with zero-length> is associated with 3163the matched string, and is reset by each assignment to C<pos()>. 3164Zero-length matches at the end of the previous match are ignored 3165during C<split>. 3166 3167=head2 Combining RE Pieces 3168 3169Each of the elementary pieces of regular expressions which were described 3170before (such as C<ab> or C<\Z>) could match at most one substring 3171at the given position of the input string. However, in a typical regular 3172expression these elementary pieces are combined into more complicated 3173patterns using combining operators C<ST>, C<S|T>, C<S*> I<etc>. 3174(in these examples C<"S"> and C<"T"> are regular subexpressions). 3175 3176Such combinations can include alternatives, leading to a problem of choice: 3177if we match a regular expression C<a|ab> against C<"abc">, will it match 3178substring C<"a"> or C<"ab">? One way to describe which substring is 3179actually matched is the concept of backtracking (see L</"Backtracking">). 3180However, this description is too low-level and makes you think 3181in terms of a particular implementation. 3182 3183Another description starts with notions of "better"/"worse". All the 3184substrings which may be matched by the given regular expression can be 3185sorted from the "best" match to the "worst" match, and it is the "best" 3186match which is chosen. This substitutes the question of "what is chosen?" 3187by the question of "which matches are better, and which are worse?". 3188 3189Again, for elementary pieces there is no such question, since at most 3190one match at a given position is possible. This section describes the 3191notion of better/worse for combining operators. In the description 3192below C<"S"> and C<"T"> are regular subexpressions. 3193 3194=over 4 3195 3196=item C<ST> 3197 3198Consider two possible matches, C<AB> and C<A'B'>, C<"A"> and C<A'> are 3199substrings which can be matched by C<"S">, C<"B"> and C<B'> are substrings 3200which can be matched by C<"T">. 3201 3202If C<"A"> is a better match for C<"S"> than C<A'>, C<AB> is a better 3203match than C<A'B'>. 3204 3205If C<"A"> and C<A'> coincide: C<AB> is a better match than C<AB'> if 3206C<"B"> is a better match for C<"T"> than C<B'>. 3207 3208=item C<S|T> 3209 3210When C<"S"> can match, it is a better match than when only C<"T"> can match. 3211 3212Ordering of two matches for C<"S"> is the same as for C<"S">. Similar for 3213two matches for C<"T">. 3214 3215=item C<S{REPEAT_COUNT}> 3216 3217Matches as C<SSS...S> (repeated as many times as necessary). 3218 3219=item C<S{min,max}> 3220 3221Matches as C<S{max}|S{max-1}|...|S{min+1}|S{min}>. 3222 3223=item C<S{min,max}?> 3224 3225Matches as C<S{min}|S{min+1}|...|S{max-1}|S{max}>. 3226 3227=item C<S?>, C<S*>, C<S+> 3228 3229Same as C<S{0,1}>, C<S{0,BIG_NUMBER}>, C<S{1,BIG_NUMBER}> respectively. 3230 3231=item C<S??>, C<S*?>, C<S+?> 3232 3233Same as C<S{0,1}?>, C<S{0,BIG_NUMBER}?>, C<S{1,BIG_NUMBER}?> respectively. 3234 3235=item C<< (?>S) >> 3236 3237Matches the best match for C<"S"> and only that. 3238 3239=item C<(?=S)>, C<(?<=S)> 3240 3241Only the best match for C<"S"> is considered. (This is important only if 3242C<"S"> has capturing parentheses, and backreferences are used somewhere 3243else in the whole regular expression.) 3244 3245=item C<(?!S)>, C<(?<!S)> 3246 3247For this grouping operator there is no need to describe the ordering, since 3248only whether or not C<"S"> can match is important. 3249 3250=item C<(??{ I<EXPR> })>, C<(?I<PARNO>)> 3251 3252The ordering is the same as for the regular expression which is 3253the result of I<EXPR>, or the pattern contained by capture group I<PARNO>. 3254 3255=item C<(?(I<condition>)I<yes-pattern>|I<no-pattern>)> 3256 3257Recall that which of I<yes-pattern> or I<no-pattern> actually matches is 3258already determined. The ordering of the matches is the same as for the 3259chosen subexpression. 3260 3261=back 3262 3263The above recipes describe the ordering of matches I<at a given position>. 3264One more rule is needed to understand how a match is determined for the 3265whole regular expression: a match at an earlier position is always better 3266than a match at a later position. 3267 3268=head2 Creating Custom RE Engines 3269 3270As of Perl 5.10.0, one can create custom regular expression engines. This 3271is not for the faint of heart, as they have to plug in at the C level. See 3272L<perlreapi> for more details. 3273 3274As an alternative, overloaded constants (see L<overload>) provide a simple 3275way to extend the functionality of the RE engine, by substituting one 3276pattern for another. 3277 3278Suppose that we want to enable a new RE escape-sequence C<\Y|> which 3279matches at a boundary between whitespace characters and non-whitespace 3280characters. Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly 3281at these positions, so we want to have each C<\Y|> in the place of the 3282more complicated version. We can create a module C<customre> to do 3283this: 3284 3285 package customre; 3286 use overload; 3287 3288 sub import { 3289 shift; 3290 die "No argument to customre::import allowed" if @_; 3291 overload::constant 'qr' => \&convert; 3292 } 3293 3294 sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"} 3295 3296 # We must also take care of not escaping the legitimate \\Y| 3297 # sequence, hence the presence of '\\' in the conversion rules. 3298 my %rules = ( '\\' => '\\\\', 3299 'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ ); 3300 sub convert { 3301 my $re = shift; 3302 $re =~ s{ 3303 \\ ( \\ | Y . ) 3304 } 3305 { $rules{$1} or invalid($re,$1) }sgex; 3306 return $re; 3307 } 3308 3309Now C<use customre> enables the new escape in constant regular 3310expressions, I<i.e.>, those without any runtime variable interpolations. 3311As documented in L<overload>, this conversion will work only over 3312literal parts of regular expressions. For C<\Y|$re\Y|> the variable 3313part of this regular expression needs to be converted explicitly 3314(but only if the special meaning of C<\Y|> should be enabled inside C<$re>): 3315 3316 use customre; 3317 $re = <>; 3318 chomp $re; 3319 $re = customre::convert $re; 3320 /\Y|$re\Y|/; 3321 3322=head2 Embedded Code Execution Frequency 3323 3324The exact rules for how often C<(?{})> and C<(??{})> are executed in a pattern 3325are unspecified, and this is even more true of C<(*{})>. 3326In the case of a successful match you can assume that they DWIM and 3327will be executed in left to right order the appropriate number of times in the 3328accepting path of the pattern as would any other meta-pattern. How non- 3329accepting pathways and match failures affect the number of times a pattern is 3330executed is specifically unspecified and may vary depending on what 3331optimizations can be applied to the pattern and is likely to change from 3332version to version. 3333 3334For instance in 3335 3336 "aaabcdeeeee"=~/a(?{print "a"})b(?{print "b"})cde/; 3337 3338the exact number of times "a" or "b" are printed out is unspecified for 3339failure, but you may assume they will be printed at least once during 3340a successful match, additionally you may assume that if "b" is printed, 3341it will be preceded by at least one "a". 3342 3343In the case of branching constructs like the following: 3344 3345 /a(b|(?{ print "a" }))c(?{ print "c" })/; 3346 3347you can assume that the input "ac" will output "ac", and that "abc" 3348will output only "c". 3349 3350When embedded code is quantified, successful matches will call the 3351code once for each matched iteration of the quantifier. For 3352example: 3353 3354 "good" =~ /g(?:o(?{print "o"}))*d/; 3355 3356will output "o" twice. 3357 3358For historical and consistency reasons the use of normal code blocks 3359anywhere in a pattern will disable certain optimisations. As of 5.37.7 3360you can use an "optimistic" codeblock, C<(*{ ... })> as a replacement 3361for C<(?{ ... })>, if you do *not* wish to disable these optimisations. 3362This may result in the code block being called less often than it might 3363have been had they not been optimistic. 3364 3365=head2 PCRE/Python Support 3366 3367As of Perl 5.10.0, Perl supports several Python/PCRE-specific extensions 3368to the regex syntax. While Perl programmers are encouraged to use the 3369Perl-specific syntax, the following are also accepted: 3370 3371=over 4 3372 3373=item C<< (?PE<lt>I<NAME>E<gt>I<pattern>) >> 3374 3375Define a named capture group. Equivalent to C<< (?<I<NAME>>I<pattern>) >>. 3376 3377=item C<< (?P=I<NAME>) >> 3378 3379Backreference to a named capture group. Equivalent to C<< \g{I<NAME>} >>. 3380 3381=item C<< (?P>I<NAME>) >> 3382 3383Subroutine call to a named capture group. Equivalent to C<< (?&I<NAME>) >>. 3384 3385=back 3386 3387=head1 BUGS 3388 3389There are a number of issues with regard to case-insensitive matching 3390in Unicode rules. See C<"i"> under L</Modifiers> above. 3391 3392This document varies from difficult to understand to completely 3393and utterly opaque. The wandering prose riddled with jargon is 3394hard to fathom in several places. 3395 3396This document needs a rewrite that separates the tutorial content 3397from the reference content. 3398 3399=head1 SEE ALSO 3400 3401The syntax of patterns used in Perl pattern matching evolved from those 3402supplied in the Bell Labs Research Unix 8th Edition (Version 8) regex 3403routines. (The code is actually derived (distantly) from Henry 3404Spencer's freely redistributable reimplementation of those V8 routines.) 3405 3406L<perlrequick>. 3407 3408L<perlretut>. 3409 3410L<perlop/"Regexp Quote-Like Operators">. 3411 3412L<perlop/"Gory details of parsing quoted constructs">. 3413 3414L<perlfaq6>. 3415 3416L<perlfunc/pos>. 3417 3418L<perllocale>. 3419 3420L<perlebcdic>. 3421 3422I<Mastering Regular Expressions> by Jeffrey Friedl, published 3423by O'Reilly and Associates. 3424