1=head1 NAME 2 3perlretut - Perl regular expressions tutorial 4 5=head1 DESCRIPTION 6 7This page provides a basic tutorial on understanding, creating and 8using regular expressions in Perl. It serves as a complement to the 9reference page on regular expressions L<perlre>. Regular expressions 10are an integral part of the C<m//>, C<s///>, C<qr//> and C<split> 11operators and so this tutorial also overlaps with 12L<perlop/"Regexp Quote-Like Operators"> and L<perlfunc/split>. 13 14Perl is widely renowned for excellence in text processing, and regular 15expressions are one of the big factors behind this fame. Perl regular 16expressions display an efficiency and flexibility unknown in most 17other computer languages. Mastering even the basics of regular 18expressions will allow you to manipulate text with surprising ease. 19 20What is a regular expression? A regular expression is simply a string 21that describes a pattern. Patterns are in common use these days; 22examples are the patterns typed into a search engine to find web pages 23and the patterns used to list files in a directory, e.g., C<ls *.txt> 24or C<dir *.*>. In Perl, the patterns described by regular expressions 25are used to search strings, extract desired parts of strings, and to 26do search and replace operations. 27 28Regular expressions have the undeserved reputation of being abstract 29and difficult to understand. Regular expressions are constructed using 30simple concepts like conditionals and loops and are no more difficult 31to understand than the corresponding C<if> conditionals and C<while> 32loops in the Perl language itself. In fact, the main challenge in 33learning regular expressions is just getting used to the terse 34notation used to express these concepts. 35 36This tutorial flattens the learning curve by discussing regular 37expression concepts, along with their notation, one at a time and with 38many examples. The first part of the tutorial will progress from the 39simplest word searches to the basic regular expression concepts. If 40you master the first part, you will have all the tools needed to solve 41about 98% of your needs. The second part of the tutorial is for those 42comfortable with the basics and hungry for more power tools. It 43discusses the more advanced regular expression operators and 44introduces the latest cutting-edge innovations. 45 46A note: to save time, 'regular expression' is often abbreviated as 47regexp or regex. Regexp is a more natural abbreviation than regex, but 48is harder to pronounce. The Perl pod documentation is evenly split on 49regexp vs regex; in Perl, there is more than one way to abbreviate it. 50We'll use regexp in this tutorial. 51 52=head1 Part 1: The basics 53 54=head2 Simple word matching 55 56The simplest regexp is simply a word, or more generally, a string of 57characters. A regexp consisting of a word matches any string that 58contains that word: 59 60 "Hello World" =~ /World/; # matches 61 62What is this Perl statement all about? C<"Hello World"> is a simple 63double-quoted string. C<World> is the regular expression and the 64C<//> enclosing C</World/> tells Perl to search a string for a match. 65The operator C<=~> associates the string with the regexp match and 66produces a true value if the regexp matched, or false if the regexp 67did not match. In our case, C<World> matches the second word in 68C<"Hello World">, so the expression is true. Expressions like this 69are useful in conditionals: 70 71 if ("Hello World" =~ /World/) { 72 print "It matches\n"; 73 } 74 else { 75 print "It doesn't match\n"; 76 } 77 78There are useful variations on this theme. The sense of the match can 79be reversed by using the C<!~> operator: 80 81 if ("Hello World" !~ /World/) { 82 print "It doesn't match\n"; 83 } 84 else { 85 print "It matches\n"; 86 } 87 88The literal string in the regexp can be replaced by a variable: 89 90 $greeting = "World"; 91 if ("Hello World" =~ /$greeting/) { 92 print "It matches\n"; 93 } 94 else { 95 print "It doesn't match\n"; 96 } 97 98If you're matching against the special default variable C<$_>, the 99C<$_ =~> part can be omitted: 100 101 $_ = "Hello World"; 102 if (/World/) { 103 print "It matches\n"; 104 } 105 else { 106 print "It doesn't match\n"; 107 } 108 109And finally, the C<//> default delimiters for a match can be changed 110to arbitrary delimiters by putting an C<'m'> out front: 111 112 "Hello World" =~ m!World!; # matches, delimited by '!' 113 "Hello World" =~ m{World}; # matches, note the matching '{}' 114 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', 115 # '/' becomes an ordinary char 116 117C</World/>, C<m!World!>, and C<m{World}> all represent the 118same thing. When, e.g., the quote (C<">) is used as a delimiter, the forward 119slash C<'/'> becomes an ordinary character and can be used in this regexp 120without trouble. 121 122Let's consider how different regexps would match C<"Hello World">: 123 124 "Hello World" =~ /world/; # doesn't match 125 "Hello World" =~ /o W/; # matches 126 "Hello World" =~ /oW/; # doesn't match 127 "Hello World" =~ /World /; # doesn't match 128 129The first regexp C<world> doesn't match because regexps are 130case-sensitive. The second regexp matches because the substring 131S<C<'o W'>> occurs in the string S<C<"Hello World">>. The space 132character ' ' is treated like any other character in a regexp and is 133needed to match in this case. The lack of a space character is the 134reason the third regexp C<'oW'> doesn't match. The fourth regexp 135C<'World '> doesn't match because there is a space at the end of the 136regexp, but not at the end of the string. The lesson here is that 137regexps must match a part of the string I<exactly> in order for the 138statement to be true. 139 140If a regexp matches in more than one place in the string, Perl will 141always match at the earliest possible point in the string: 142 143 "Hello World" =~ /o/; # matches 'o' in 'Hello' 144 "That hat is red" =~ /hat/; # matches 'hat' in 'That' 145 146With respect to character matching, there are a few more points you 147need to know about. First of all, not all characters can be used 'as 148is' in a match. Some characters, called I<metacharacters>, are reserved 149for use in regexp notation. The metacharacters are 150 151 {}[]()^$.|*+?\ 152 153The significance of each of these will be explained 154in the rest of the tutorial, but for now, it is important only to know 155that a metacharacter can be matched by putting a backslash before it: 156 157 "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter 158 "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + 159 "The interval is [0,1)." =~ /[0,1)./ # is a syntax error! 160 "The interval is [0,1)." =~ /\[0,1\)\./ # matches 161 "#!/usr/bin/perl" =~ /#!\/usr\/bin\/perl/; # matches 162 163In the last regexp, the forward slash C<'/'> is also backslashed, 164because it is used to delimit the regexp. This can lead to LTS 165(leaning toothpick syndrome), however, and it is often more readable 166to change delimiters. 167 168 "#!/usr/bin/perl" =~ m!#\!/usr/bin/perl!; # easier to read 169 170The backslash character C<'\'> is a metacharacter itself and needs to 171be backslashed: 172 173 'C:\WIN32' =~ /C:\\WIN/; # matches 174 175In addition to the metacharacters, there are some ASCII characters 176which don't have printable character equivalents and are instead 177represented by I<escape sequences>. Common examples are C<\t> for a 178tab, C<\n> for a newline, C<\r> for a carriage return and C<\a> for a 179bell (or alert). If your string is better thought of as a sequence of arbitrary 180bytes, the octal escape sequence, e.g., C<\033>, or hexadecimal escape 181sequence, e.g., C<\x1B> may be a more natural representation for your 182bytes. Here are some examples of escapes: 183 184 "1000\t2000" =~ m(0\t2) # matches 185 "1000\n2000" =~ /0\n20/ # matches 186 "1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000" 187 "cat" =~ /\o{143}\x61\x74/ # matches in ASCII, but a weird way 188 # to spell cat 189 190If you've been around Perl a while, all this talk of escape sequences 191may seem familiar. Similar escape sequences are used in double-quoted 192strings and in fact the regexps in Perl are mostly treated as 193double-quoted strings. This means that variables can be used in 194regexps as well. Just like double-quoted strings, the values of the 195variables in the regexp will be substituted in before the regexp is 196evaluated for matching purposes. So we have: 197 198 $foo = 'house'; 199 'housecat' =~ /$foo/; # matches 200 'cathouse' =~ /cat$foo/; # matches 201 'housecat' =~ /${foo}cat/; # matches 202 203So far, so good. With the knowledge above you can already perform 204searches with just about any literal string regexp you can dream up. 205Here is a I<very simple> emulation of the Unix grep program: 206 207 % cat > simple_grep 208 #!/usr/bin/perl 209 $regexp = shift; 210 while (<>) { 211 print if /$regexp/; 212 } 213 ^D 214 215 % chmod +x simple_grep 216 217 % simple_grep abba /usr/dict/words 218 Babbage 219 cabbage 220 cabbages 221 sabbath 222 Sabbathize 223 Sabbathizes 224 sabbatical 225 scabbard 226 scabbards 227 228This program is easy to understand. C<#!/usr/bin/perl> is the standard 229way to invoke a perl program from the shell. 230S<C<$regexp = shift;>> saves the first command line argument as the 231regexp to be used, leaving the rest of the command line arguments to 232be treated as files. S<C<< while (<>) >>> loops over all the lines in 233all the files. For each line, S<C<print if /$regexp/;>> prints the 234line if the regexp matches the line. In this line, both C<print> and 235C</$regexp/> use the default variable C<$_> implicitly. 236 237With all of the regexps above, if the regexp matched anywhere in the 238string, it was considered a match. Sometimes, however, we'd like to 239specify I<where> in the string the regexp should try to match. To do 240this, we would use the I<anchor> metacharacters C<^> and C<$>. The 241anchor C<^> means match at the beginning of the string and the anchor 242C<$> means match at the end of the string, or before a newline at the 243end of the string. Here is how they are used: 244 245 "housekeeper" =~ /keeper/; # matches 246 "housekeeper" =~ /^keeper/; # doesn't match 247 "housekeeper" =~ /keeper$/; # matches 248 "housekeeper\n" =~ /keeper$/; # matches 249 250The second regexp doesn't match because C<^> constrains C<keeper> to 251match only at the beginning of the string, but C<"housekeeper"> has 252keeper starting in the middle. The third regexp does match, since the 253C<$> constrains C<keeper> to match only at the end of the string. 254 255When both C<^> and C<$> are used at the same time, the regexp has to 256match both the beginning and the end of the string, i.e., the regexp 257matches the whole string. Consider 258 259 "keeper" =~ /^keep$/; # doesn't match 260 "keeper" =~ /^keeper$/; # matches 261 "" =~ /^$/; # ^$ matches an empty string 262 263The first regexp doesn't match because the string has more to it than 264C<keep>. Since the second regexp is exactly the string, it 265matches. Using both C<^> and C<$> in a regexp forces the complete 266string to match, so it gives you complete control over which strings 267match and which don't. Suppose you are looking for a fellow named 268bert, off in a string by himself: 269 270 "dogbert" =~ /bert/; # matches, but not what you want 271 272 "dilbert" =~ /^bert/; # doesn't match, but .. 273 "bertram" =~ /^bert/; # matches, so still not good enough 274 275 "bertram" =~ /^bert$/; # doesn't match, good 276 "dilbert" =~ /^bert$/; # doesn't match, good 277 "bert" =~ /^bert$/; # matches, perfect 278 279Of course, in the case of a literal string, one could just as easily 280use the string comparison S<C<$string eq 'bert'>> and it would be 281more efficient. The C<^...$> regexp really becomes useful when we 282add in the more powerful regexp tools below. 283 284=head2 Using character classes 285 286Although one can already do quite a lot with the literal string 287regexps above, we've only scratched the surface of regular expression 288technology. In this and subsequent sections we will introduce regexp 289concepts (and associated metacharacter notations) that will allow a 290regexp to represent not just a single character sequence, but a I<whole 291class> of them. 292 293One such concept is that of a I<character class>. A character class 294allows a set of possible characters, rather than just a single 295character, to match at a particular point in a regexp. You can define 296your own custom character classes. These 297are denoted by brackets C<[...]>, with the set of characters 298to be possibly matched inside. Here are some examples: 299 300 /cat/; # matches 'cat' 301 /[bcr]at/; # matches 'bat, 'cat', or 'rat' 302 /item[0123456789]/; # matches 'item0' or ... or 'item9' 303 "abc" =~ /[cab]/; # matches 'a' 304 305In the last statement, even though C<'c'> is the first character in 306the class, C<'a'> matches because the first character position in the 307string is the earliest point at which the regexp can match. 308 309 /[yY][eE][sS]/; # match 'yes' in a case-insensitive way 310 # 'yes', 'Yes', 'YES', etc. 311 312This regexp displays a common task: perform a case-insensitive 313match. Perl provides a way of avoiding all those brackets by simply 314appending an C<'i'> to the end of the match. Then C</[yY][eE][sS]/;> 315can be rewritten as C</yes/i;>. The C<'i'> stands for 316case-insensitive and is an example of a I<modifier> of the matching 317operation. We will meet other modifiers later in the tutorial. 318 319We saw in the section above that there were ordinary characters, which 320represented themselves, and special characters, which needed a 321backslash C<\> to represent themselves. The same is true in a 322character class, but the sets of ordinary and special characters 323inside a character class are different than those outside a character 324class. The special characters for a character class are C<-]\^$> (and 325the pattern delimiter, whatever it is). 326C<]> is special because it denotes the end of a character class. C<$> is 327special because it denotes a scalar variable. C<\> is special because 328it is used in escape sequences, just like above. Here is how the 329special characters C<]$\> are handled: 330 331 /[\]c]def/; # matches ']def' or 'cdef' 332 $x = 'bcr'; 333 /[$x]at/; # matches 'bat', 'cat', or 'rat' 334 /[\$x]at/; # matches '$at' or 'xat' 335 /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' 336 337The last two are a little tricky. In C<[\$x]>, the backslash protects 338the dollar sign, so the character class has two members C<$> and C<x>. 339In C<[\\$x]>, the backslash is protected, so C<$x> is treated as a 340variable and substituted in double quote fashion. 341 342The special character C<'-'> acts as a range operator within character 343classes, so that a contiguous set of characters can be written as a 344range. With ranges, the unwieldy C<[0123456789]> and C<[abc...xyz]> 345become the svelte C<[0-9]> and C<[a-z]>. Some examples are 346 347 /item[0-9]/; # matches 'item0' or ... or 'item9' 348 /[0-9bx-z]aa/; # matches '0aa', ..., '9aa', 349 # 'baa', 'xaa', 'yaa', or 'zaa' 350 /[0-9a-fA-F]/; # matches a hexadecimal digit 351 /[0-9a-zA-Z_]/; # matches a "word" character, 352 # like those in a Perl variable name 353 354If C<'-'> is the first or last character in a character class, it is 355treated as an ordinary character; C<[-ab]>, C<[ab-]> and C<[a\-b]> are 356all equivalent. 357 358The special character C<^> in the first position of a character class 359denotes a I<negated character class>, which matches any character but 360those in the brackets. Both C<[...]> and C<[^...]> must match a 361character, or the match fails. Then 362 363 /[^a]at/; # doesn't match 'aat' or 'at', but matches 364 # all other 'bat', 'cat, '0at', '%at', etc. 365 /[^0-9]/; # matches a non-numeric character 366 /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary 367 368Now, even C<[0-9]> can be a bother to write multiple times, so in the 369interest of saving keystrokes and making regexps more readable, Perl 370has several abbreviations for common character classes, as shown below. 371Since the introduction of Unicode, unless the C<//a> modifier is in 372effect, these character classes match more than just a few characters in 373the ASCII range. 374 375=over 4 376 377=item * 378 379\d matches a digit, not just [0-9] but also digits from non-roman scripts 380 381=item * 382 383\s matches a whitespace character, the set [\ \t\r\n\f] and others 384 385=item * 386 387\w matches a word character (alphanumeric or _), not just [0-9a-zA-Z_] 388but also digits and characters from non-roman scripts 389 390=item * 391 392\D is a negated \d; it represents any other character than a digit, or [^\d] 393 394=item * 395 396\S is a negated \s; it represents any non-whitespace character [^\s] 397 398=item * 399 400\W is a negated \w; it represents any non-word character [^\w] 401 402=item * 403 404The period '.' matches any character but "\n" (unless the modifier C<//s> is 405in effect, as explained below). 406 407=item * 408 409\N, like the period, matches any character but "\n", but it does so 410regardless of whether the modifier C<//s> is in effect. 411 412=back 413 414The C<//a> modifier, available starting in Perl 5.14, is used to 415restrict the matches of \d, \s, and \w to just those in the ASCII range. 416It is useful to keep your program from being needlessly exposed to full 417Unicode (and its accompanying security considerations) when all you want 418is to process English-like text. (The "a" may be doubled, C<//aa>, to 419provide even more restrictions, preventing case-insensitive matching of 420ASCII with non-ASCII characters; otherwise a Unicode "Kelvin Sign" 421would caselessly match a "k" or "K".) 422 423The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside 424of bracketed character classes. Here are some in use: 425 426 /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format 427 /[\d\s]/; # matches any digit or whitespace character 428 /\w\W\w/; # matches a word char, followed by a 429 # non-word char, followed by a word char 430 /..rt/; # matches any two chars, followed by 'rt' 431 /end\./; # matches 'end.' 432 /end[.]/; # same thing, matches 'end.' 433 434Because a period is a metacharacter, it needs to be escaped to match 435as an ordinary period. Because, for example, C<\d> and C<\w> are sets 436of characters, it is incorrect to think of C<[^\d\w]> as C<[\D\W]>; in 437fact C<[^\d\w]> is the same as C<[^\w]>, which is the same as 438C<[\W]>. Think DeMorgan's laws. 439 440In actuality, the period and C<\d\s\w\D\S\W> abbreviations are 441themselves types of character classes, so the ones surrounded by 442brackets are just one type of character class. When we need to make a 443distinction, we refer to them as "bracketed character classes." 444 445An anchor useful in basic regexps is the I<word anchor> 446C<\b>. This matches a boundary between a word character and a non-word 447character C<\w\W> or C<\W\w>: 448 449 $x = "Housecat catenates house and cat"; 450 $x =~ /cat/; # matches cat in 'housecat' 451 $x =~ /\bcat/; # matches cat in 'catenates' 452 $x =~ /cat\b/; # matches cat in 'housecat' 453 $x =~ /\bcat\b/; # matches 'cat' at end of string 454 455Note in the last example, the end of the string is considered a word 456boundary. 457 458You might wonder why C<'.'> matches everything but C<"\n"> - why not 459every character? The reason is that often one is matching against 460lines and would like to ignore the newline characters. For instance, 461while the string C<"\n"> represents one line, we would like to think 462of it as empty. Then 463 464 "" =~ /^$/; # matches 465 "\n" =~ /^$/; # matches, $ anchors before "\n" 466 467 "" =~ /./; # doesn't match; it needs a char 468 "" =~ /^.$/; # doesn't match; it needs a char 469 "\n" =~ /^.$/; # doesn't match; it needs a char other than "\n" 470 "a" =~ /^.$/; # matches 471 "a\n" =~ /^.$/; # matches, $ anchors before "\n" 472 473This behavior is convenient, because we usually want to ignore 474newlines when we count and match characters in a line. Sometimes, 475however, we want to keep track of newlines. We might even want C<^> 476and C<$> to anchor at the beginning and end of lines within the 477string, rather than just the beginning and end of the string. Perl 478allows us to choose between ignoring and paying attention to newlines 479by using the C<//s> and C<//m> modifiers. C<//s> and C<//m> stand for 480single line and multi-line and they determine whether a string is to 481be treated as one continuous string, or as a set of lines. The two 482modifiers affect two aspects of how the regexp is interpreted: 1) how 483the C<'.'> character class is defined, and 2) where the anchors C<^> 484and C<$> are able to match. Here are the four possible combinations: 485 486=over 4 487 488=item * 489 490no modifiers (//): Default behavior. C<'.'> matches any character 491except C<"\n">. C<^> matches only at the beginning of the string and 492C<$> matches only at the end or before a newline at the end. 493 494=item * 495 496s modifier (//s): Treat string as a single long line. C<'.'> matches 497any character, even C<"\n">. C<^> matches only at the beginning of 498the string and C<$> matches only at the end or before a newline at the 499end. 500 501=item * 502 503m modifier (//m): Treat string as a set of multiple lines. C<'.'> 504matches any character except C<"\n">. C<^> and C<$> are able to match 505at the start or end of I<any> line within the string. 506 507=item * 508 509both s and m modifiers (//sm): Treat string as a single long line, but 510detect multiple lines. C<'.'> matches any character, even 511C<"\n">. C<^> and C<$>, however, are able to match at the start or end 512of I<any> line within the string. 513 514=back 515 516Here are examples of C<//s> and C<//m> in action: 517 518 $x = "There once was a girl\nWho programmed in Perl\n"; 519 520 $x =~ /^Who/; # doesn't match, "Who" not at start of string 521 $x =~ /^Who/s; # doesn't match, "Who" not at start of string 522 $x =~ /^Who/m; # matches, "Who" at start of second line 523 $x =~ /^Who/sm; # matches, "Who" at start of second line 524 525 $x =~ /girl.Who/; # doesn't match, "." doesn't match "\n" 526 $x =~ /girl.Who/s; # matches, "." matches "\n" 527 $x =~ /girl.Who/m; # doesn't match, "." doesn't match "\n" 528 $x =~ /girl.Who/sm; # matches, "." matches "\n" 529 530Most of the time, the default behavior is what is wanted, but C<//s> and 531C<//m> are occasionally very useful. If C<//m> is being used, the start 532of the string can still be matched with C<\A> and the end of the string 533can still be matched with the anchors C<\Z> (matches both the end and 534the newline before, like C<$>), and C<\z> (matches only the end): 535 536 $x =~ /^Who/m; # matches, "Who" at start of second line 537 $x =~ /\AWho/m; # doesn't match, "Who" is not at start of string 538 539 $x =~ /girl$/m; # matches, "girl" at end of first line 540 $x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string 541 542 $x =~ /Perl\Z/m; # matches, "Perl" is at newline before end 543 $x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of string 544 545We now know how to create choices among classes of characters in a 546regexp. What about choices among words or character strings? Such 547choices are described in the next section. 548 549=head2 Matching this or that 550 551Sometimes we would like our regexp to be able to match different 552possible words or character strings. This is accomplished by using 553the I<alternation> metacharacter C<|>. To match C<dog> or C<cat>, we 554form the regexp C<dog|cat>. As before, Perl will try to match the 555regexp at the earliest possible point in the string. At each 556character position, Perl will first try to match the first 557alternative, C<dog>. If C<dog> doesn't match, Perl will then try the 558next alternative, C<cat>. If C<cat> doesn't match either, then the 559match fails and Perl moves to the next position in the string. Some 560examples: 561 562 "cats and dogs" =~ /cat|dog|bird/; # matches "cat" 563 "cats and dogs" =~ /dog|cat|bird/; # matches "cat" 564 565Even though C<dog> is the first alternative in the second regexp, 566C<cat> is able to match earlier in the string. 567 568 "cats" =~ /c|ca|cat|cats/; # matches "c" 569 "cats" =~ /cats|cat|ca|c/; # matches "cats" 570 571Here, all the alternatives match at the first string position, so the 572first alternative is the one that matches. If some of the 573alternatives are truncations of the others, put the longest ones first 574to give them a chance to match. 575 576 "cab" =~ /a|b|c/ # matches "c" 577 # /a|b|c/ == /[abc]/ 578 579The last example points out that character classes are like 580alternations of characters. At a given character position, the first 581alternative that allows the regexp match to succeed will be the one 582that matches. 583 584=head2 Grouping things and hierarchical matching 585 586Alternation allows a regexp to choose among alternatives, but by 587itself it is unsatisfying. The reason is that each alternative is a whole 588regexp, but sometime we want alternatives for just part of a 589regexp. For instance, suppose we want to search for housecats or 590housekeepers. The regexp C<housecat|housekeeper> fits the bill, but is 591inefficient because we had to type C<house> twice. It would be nice to 592have parts of the regexp be constant, like C<house>, and some 593parts have alternatives, like C<cat|keeper>. 594 595The I<grouping> metacharacters C<()> solve this problem. Grouping 596allows parts of a regexp to be treated as a single unit. Parts of a 597regexp are grouped by enclosing them in parentheses. Thus we could solve 598the C<housecat|housekeeper> by forming the regexp as 599C<house(cat|keeper)>. The regexp C<house(cat|keeper)> means match 600C<house> followed by either C<cat> or C<keeper>. Some more examples 601are 602 603 /(a|b)b/; # matches 'ab' or 'bb' 604 /(ac|b)b/; # matches 'acb' or 'bb' 605 /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere 606 /(a|[bc])d/; # matches 'ad', 'bd', or 'cd' 607 608 /house(cat|)/; # matches either 'housecat' or 'house' 609 /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or 610 # 'house'. Note groups can be nested. 611 612 /(19|20|)\d\d/; # match years 19xx, 20xx, or the Y2K problem, xx 613 "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', 614 # because '20\d\d' can't match 615 616Alternations behave the same way in groups as out of them: at a given 617string position, the leftmost alternative that allows the regexp to 618match is taken. So in the last example at the first string position, 619C<"20"> matches the second alternative, but there is nothing left over 620to match the next two digits C<\d\d>. So Perl moves on to the next 621alternative, which is the null alternative and that works, since 622C<"20"> is two digits. 623 624The process of trying one alternative, seeing if it matches, and 625moving on to the next alternative, while going back in the string 626from where the previous alternative was tried, if it doesn't, is called 627I<backtracking>. The term 'backtracking' comes from the idea that 628matching a regexp is like a walk in the woods. Successfully matching 629a regexp is like arriving at a destination. There are many possible 630trailheads, one for each string position, and each one is tried in 631order, left to right. From each trailhead there may be many paths, 632some of which get you there, and some which are dead ends. When you 633walk along a trail and hit a dead end, you have to backtrack along the 634trail to an earlier point to try another trail. If you hit your 635destination, you stop immediately and forget about trying all the 636other trails. You are persistent, and only if you have tried all the 637trails from all the trailheads and not arrived at your destination, do 638you declare failure. To be concrete, here is a step-by-step analysis 639of what Perl does when it tries to match the regexp 640 641 "abcde" =~ /(abd|abc)(df|d|de)/; 642 643=over 4 644 645=item Z<>0 646 647Start with the first letter in the string 'a'. 648 649=item Z<>1 650 651Try the first alternative in the first group 'abd'. 652 653=item Z<>2 654 655Match 'a' followed by 'b'. So far so good. 656 657=item Z<>3 658 659'd' in the regexp doesn't match 'c' in the string - a dead 660end. So backtrack two characters and pick the second alternative in 661the first group 'abc'. 662 663=item Z<>4 664 665Match 'a' followed by 'b' followed by 'c'. We are on a roll 666and have satisfied the first group. Set $1 to 'abc'. 667 668=item Z<>5 669 670Move on to the second group and pick the first alternative 671'df'. 672 673=item Z<>6 674 675Match the 'd'. 676 677=item Z<>7 678 679'f' in the regexp doesn't match 'e' in the string, so a dead 680end. Backtrack one character and pick the second alternative in the 681second group 'd'. 682 683=item Z<>8 684 685'd' matches. The second grouping is satisfied, so set $2 to 686'd'. 687 688=item Z<>9 689 690We are at the end of the regexp, so we are done! We have 691matched 'abcd' out of the string "abcde". 692 693=back 694 695There are a couple of things to note about this analysis. First, the 696third alternative in the second group 'de' also allows a match, but we 697stopped before we got to it - at a given character position, leftmost 698wins. Second, we were able to get a match at the first character 699position of the string 'a'. If there were no matches at the first 700position, Perl would move to the second character position 'b' and 701attempt the match all over again. Only when all possible paths at all 702possible character positions have been exhausted does Perl give 703up and declare S<C<$string =~ /(abd|abc)(df|d|de)/;>> to be false. 704 705Even with all this work, regexp matching happens remarkably fast. To 706speed things up, Perl compiles the regexp into a compact sequence of 707opcodes that can often fit inside a processor cache. When the code is 708executed, these opcodes can then run at full throttle and search very 709quickly. 710 711=head2 Extracting matches 712 713The grouping metacharacters C<()> also serve another completely 714different function: they allow the extraction of the parts of a string 715that matched. This is very useful to find out what matched and for 716text processing in general. For each grouping, the part that matched 717inside goes into the special variables C<$1>, C<$2>, etc. They can be 718used just as ordinary variables: 719 720 # extract hours, minutes, seconds 721 if ($time =~ /(\d\d):(\d\d):(\d\d)/) { # match hh:mm:ss format 722 $hours = $1; 723 $minutes = $2; 724 $seconds = $3; 725 } 726 727Now, we know that in scalar context, 728S<C<$time =~ /(\d\d):(\d\d):(\d\d)/>> returns a true or false 729value. In list context, however, it returns the list of matched values 730C<($1,$2,$3)>. So we could write the code more compactly as 731 732 # extract hours, minutes, seconds 733 ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); 734 735If the groupings in a regexp are nested, C<$1> gets the group with the 736leftmost opening parenthesis, C<$2> the next opening parenthesis, 737etc. Here is a regexp with nested groups: 738 739 /(ab(cd|ef)((gi)|j))/; 740 1 2 34 741 742If this regexp matches, C<$1> contains a string starting with 743C<'ab'>, C<$2> is either set to C<'cd'> or C<'ef'>, C<$3> equals either 744C<'gi'> or C<'j'>, and C<$4> is either set to C<'gi'>, just like C<$3>, 745or it remains undefined. 746 747For convenience, Perl sets C<$+> to the string held by the highest numbered 748C<$1>, C<$2>,... that got assigned (and, somewhat related, C<$^N> to the 749value of the C<$1>, C<$2>,... most-recently assigned; i.e. the C<$1>, 750C<$2>,... associated with the rightmost closing parenthesis used in the 751match). 752 753 754=head2 Backreferences 755 756Closely associated with the matching variables C<$1>, C<$2>, ... are 757the I<backreferences> C<\g1>, C<\g2>,... Backreferences are simply 758matching variables that can be used I<inside> a regexp. This is a 759really nice feature; what matches later in a regexp is made to depend on 760what matched earlier in the regexp. Suppose we wanted to look 761for doubled words in a text, like 'the the'. The following regexp finds 762all 3-letter doubles with a space in between: 763 764 /\b(\w\w\w)\s\g1\b/; 765 766The grouping assigns a value to \g1, so that the same 3-letter sequence 767is used for both parts. 768 769A similar task is to find words consisting of two identical parts: 770 771 % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\g1$' /usr/dict/words 772 beriberi 773 booboo 774 coco 775 mama 776 murmur 777 papa 778 779The regexp has a single grouping which considers 4-letter 780combinations, then 3-letter combinations, etc., and uses C<\g1> to look for 781a repeat. Although C<$1> and C<\g1> represent the same thing, care should be 782taken to use matched variables C<$1>, C<$2>,... only I<outside> a regexp 783and backreferences C<\g1>, C<\g2>,... only I<inside> a regexp; not doing 784so may lead to surprising and unsatisfactory results. 785 786 787=head2 Relative backreferences 788 789Counting the opening parentheses to get the correct number for a 790backreference is error-prone as soon as there is more than one 791capturing group. A more convenient technique became available 792with Perl 5.10: relative backreferences. To refer to the immediately 793preceding capture group one now may write C<\g{-1}>, the next but 794last is available via C<\g{-2}>, and so on. 795 796Another good reason in addition to readability and maintainability 797for using relative backreferences is illustrated by the following example, 798where a simple pattern for matching peculiar strings is used: 799 800 $a99a = '([a-z])(\d)\g2\g1'; # matches a11a, g22g, x33x, etc. 801 802Now that we have this pattern stored as a handy string, we might feel 803tempted to use it as a part of some other pattern: 804 805 $line = "code=e99e"; 806 if ($line =~ /^(\w+)=$a99a$/){ # unexpected behavior! 807 print "$1 is valid\n"; 808 } else { 809 print "bad line: '$line'\n"; 810 } 811 812But this doesn't match, at least not the way one might expect. Only 813after inserting the interpolated C<$a99a> and looking at the resulting 814full text of the regexp is it obvious that the backreferences have 815backfired. The subexpression C<(\w+)> has snatched number 1 and 816demoted the groups in C<$a99a> by one rank. This can be avoided by 817using relative backreferences: 818 819 $a99a = '([a-z])(\d)\g{-1}\g{-2}'; # safe for being interpolated 820 821 822=head2 Named backreferences 823 824Perl 5.10 also introduced named capture groups and named backreferences. 825To attach a name to a capturing group, you write either 826C<< (?<name>...) >> or C<< (?'name'...) >>. The backreference may 827then be written as C<\g{name}>. It is permissible to attach the 828same name to more than one group, but then only the leftmost one of the 829eponymous set can be referenced. Outside of the pattern a named 830capture group is accessible through the C<%+> hash. 831 832Assuming that we have to match calendar dates which may be given in one 833of the three formats yyyy-mm-dd, mm/dd/yyyy or dd.mm.yyyy, we can write 834three suitable patterns where we use 'd', 'm' and 'y' respectively as the 835names of the groups capturing the pertaining components of a date. The 836matching operation combines the three patterns as alternatives: 837 838 $fmt1 = '(?<y>\d\d\d\d)-(?<m>\d\d)-(?<d>\d\d)'; 839 $fmt2 = '(?<m>\d\d)/(?<d>\d\d)/(?<y>\d\d\d\d)'; 840 $fmt3 = '(?<d>\d\d)\.(?<m>\d\d)\.(?<y>\d\d\d\d)'; 841 for my $d qw( 2006-10-21 15.01.2007 10/31/2005 ){ 842 if ( $d =~ m{$fmt1|$fmt2|$fmt3} ){ 843 print "day=$+{d} month=$+{m} year=$+{y}\n"; 844 } 845 } 846 847If any of the alternatives matches, the hash C<%+> is bound to contain the 848three key-value pairs. 849 850 851=head2 Alternative capture group numbering 852 853Yet another capturing group numbering technique (also as from Perl 5.10) 854deals with the problem of referring to groups within a set of alternatives. 855Consider a pattern for matching a time of the day, civil or military style: 856 857 if ( $time =~ /(\d\d|\d):(\d\d)|(\d\d)(\d\d)/ ){ 858 # process hour and minute 859 } 860 861Processing the results requires an additional if statement to determine 862whether C<$1> and C<$2> or C<$3> and C<$4> contain the goodies. It would 863be easier if we could use group numbers 1 and 2 in second alternative as 864well, and this is exactly what the parenthesized construct C<(?|...)>, 865set around an alternative achieves. Here is an extended version of the 866previous pattern: 867 868 if($time =~ /(?|(\d\d|\d):(\d\d)|(\d\d)(\d\d))\s+([A-Z][A-Z][A-Z])/){ 869 print "hour=$1 minute=$2 zone=$3\n"; 870 } 871 872Within the alternative numbering group, group numbers start at the same 873position for each alternative. After the group, numbering continues 874with one higher than the maximum reached across all the alternatives. 875 876=head2 Position information 877 878In addition to what was matched, Perl also provides the 879positions of what was matched as contents of the C<@-> and C<@+> 880arrays. C<$-[0]> is the position of the start of the entire match and 881C<$+[0]> is the position of the end. Similarly, C<$-[n]> is the 882position of the start of the C<$n> match and C<$+[n]> is the position 883of the end. If C<$n> is undefined, so are C<$-[n]> and C<$+[n]>. Then 884this code 885 886 $x = "Mmm...donut, thought Homer"; 887 $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches 888 foreach $exp (1..$#-) { 889 print "Match $exp: '${$exp}' at position ($-[$exp],$+[$exp])\n"; 890 } 891 892prints 893 894 Match 1: 'Mmm' at position (0,3) 895 Match 2: 'donut' at position (6,11) 896 897Even if there are no groupings in a regexp, it is still possible to 898find out what exactly matched in a string. If you use them, Perl 899will set C<$`> to the part of the string before the match, will set C<$&> 900to the part of the string that matched, and will set C<$'> to the part 901of the string after the match. An example: 902 903 $x = "the cat caught the mouse"; 904 $x =~ /cat/; # $` = 'the ', $& = 'cat', $' = ' caught the mouse' 905 $x =~ /the/; # $` = '', $& = 'the', $' = ' cat caught the mouse' 906 907In the second match, C<$`> equals C<''> because the regexp matched at the 908first character position in the string and stopped; it never saw the 909second 'the'. 910 911If your code is to run on Perl versions earlier than 9125.20, it is worthwhile to note that using C<$`> and C<$'> 913slows down regexp matching quite a bit, while C<$&> slows it down to a 914lesser extent, because if they are used in one regexp in a program, 915they are generated for I<all> regexps in the program. So if raw 916performance is a goal of your application, they should be avoided. 917If you need to extract the corresponding substrings, use C<@-> and 918C<@+> instead: 919 920 $` is the same as substr( $x, 0, $-[0] ) 921 $& is the same as substr( $x, $-[0], $+[0]-$-[0] ) 922 $' is the same as substr( $x, $+[0] ) 923 924As of Perl 5.10, the C<${^PREMATCH}>, C<${^MATCH}> and C<${^POSTMATCH}> 925variables may be used. These are only set if the C</p> modifier is 926present. Consequently they do not penalize the rest of the program. In 927Perl 5.20, C<${^PREMATCH}>, C<${^MATCH}> and C<${^POSTMATCH}> are available 928whether the C</p> has been used or not (the modifier is ignored), and 929C<$`>, C<$'> and C<$&> do not cause any speed difference. 930 931=head2 Non-capturing groupings 932 933A group that is required to bundle a set of alternatives may or may not be 934useful as a capturing group. If it isn't, it just creates a superfluous 935addition to the set of available capture group values, inside as well as 936outside the regexp. Non-capturing groupings, denoted by C<(?:regexp)>, 937still allow the regexp to be treated as a single unit, but don't establish 938a capturing group at the same time. Both capturing and non-capturing 939groupings are allowed to co-exist in the same regexp. Because there is 940no extraction, non-capturing groupings are faster than capturing 941groupings. Non-capturing groupings are also handy for choosing exactly 942which parts of a regexp are to be extracted to matching variables: 943 944 # match a number, $1-$4 are set, but we only want $1 945 /([+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?)/; 946 947 # match a number faster , only $1 is set 948 /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)/; 949 950 # match a number, get $1 = whole number, $2 = exponent 951 /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE]([+-]?\d+))?)/; 952 953Non-capturing groupings are also useful for removing nuisance 954elements gathered from a split operation where parentheses are 955required for some reason: 956 957 $x = '12aba34ba5'; 958 @num = split /(a|b)+/, $x; # @num = ('12','a','34','a','5') 959 @num = split /(?:a|b)+/, $x; # @num = ('12','34','5') 960 961 962=head2 Matching repetitions 963 964The examples in the previous section display an annoying weakness. We 965were only matching 3-letter words, or chunks of words of 4 letters or 966less. We'd like to be able to match words or, more generally, strings 967of any length, without writing out tedious alternatives like 968C<\w\w\w\w|\w\w\w|\w\w|\w>. 969 970This is exactly the problem the I<quantifier> metacharacters C<?>, 971C<*>, C<+>, and C<{}> were created for. They allow us to delimit the 972number of repeats for a portion of a regexp we consider to be a 973match. Quantifiers are put immediately after the character, character 974class, or grouping that we want to specify. They have the following 975meanings: 976 977=over 4 978 979=item * 980 981C<a?> means: match 'a' 1 or 0 times 982 983=item * 984 985C<a*> means: match 'a' 0 or more times, i.e., any number of times 986 987=item * 988 989C<a+> means: match 'a' 1 or more times, i.e., at least once 990 991=item * 992 993C<a{n,m}> means: match at least C<n> times, but not more than C<m> 994times. 995 996=item * 997 998C<a{n,}> means: match at least C<n> or more times 999 1000=item * 1001 1002C<a{n}> means: match exactly C<n> times 1003 1004=back 1005 1006Here are some examples: 1007 1008 /[a-z]+\s+\d*/; # match a lowercase word, at least one space, and 1009 # any number of digits 1010 /(\w+)\s+\g1/; # match doubled words of arbitrary length 1011 /y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes' 1012 $year =~ /^\d{2,4}$/; # make sure year is at least 2 but not more 1013 # than 4 digits 1014 $year =~ /^\d{4}$|^\d{2}$/; # better match; throw out 3-digit dates 1015 $year =~ /^\d{2}(\d{2})?$/; # same thing written differently. 1016 # However, this captures the last two 1017 # digits in $1 and the other does not. 1018 1019 % simple_grep '^(\w+)\g1$' /usr/dict/words # isn't this easier? 1020 beriberi 1021 booboo 1022 coco 1023 mama 1024 murmur 1025 papa 1026 1027For all of these quantifiers, Perl will try to match as much of the 1028string as possible, while still allowing the regexp to succeed. Thus 1029with C</a?.../>, Perl will first try to match the regexp with the C<a> 1030present; if that fails, Perl will try to match the regexp without the 1031C<a> present. For the quantifier C<*>, we get the following: 1032 1033 $x = "the cat in the hat"; 1034 $x =~ /^(.*)(cat)(.*)$/; # matches, 1035 # $1 = 'the ' 1036 # $2 = 'cat' 1037 # $3 = ' in the hat' 1038 1039Which is what we might expect, the match finds the only C<cat> in the 1040string and locks onto it. Consider, however, this regexp: 1041 1042 $x =~ /^(.*)(at)(.*)$/; # matches, 1043 # $1 = 'the cat in the h' 1044 # $2 = 'at' 1045 # $3 = '' (0 characters match) 1046 1047One might initially guess that Perl would find the C<at> in C<cat> and 1048stop there, but that wouldn't give the longest possible string to the 1049first quantifier C<.*>. Instead, the first quantifier C<.*> grabs as 1050much of the string as possible while still having the regexp match. In 1051this example, that means having the C<at> sequence with the final C<at> 1052in the string. The other important principle illustrated here is that, 1053when there are two or more elements in a regexp, the I<leftmost> 1054quantifier, if there is one, gets to grab as much of the string as 1055possible, leaving the rest of the regexp to fight over scraps. Thus in 1056our example, the first quantifier C<.*> grabs most of the string, while 1057the second quantifier C<.*> gets the empty string. Quantifiers that 1058grab as much of the string as possible are called I<maximal match> or 1059I<greedy> quantifiers. 1060 1061When a regexp can match a string in several different ways, we can use 1062the principles above to predict which way the regexp will match: 1063 1064=over 4 1065 1066=item * 1067 1068Principle 0: Taken as a whole, any regexp will be matched at the 1069earliest possible position in the string. 1070 1071=item * 1072 1073Principle 1: In an alternation C<a|b|c...>, the leftmost alternative 1074that allows a match for the whole regexp will be the one used. 1075 1076=item * 1077 1078Principle 2: The maximal matching quantifiers C<?>, C<*>, C<+> and 1079C<{n,m}> will in general match as much of the string as possible while 1080still allowing the whole regexp to match. 1081 1082=item * 1083 1084Principle 3: If there are two or more elements in a regexp, the 1085leftmost greedy quantifier, if any, will match as much of the string 1086as possible while still allowing the whole regexp to match. The next 1087leftmost greedy quantifier, if any, will try to match as much of the 1088string remaining available to it as possible, while still allowing the 1089whole regexp to match. And so on, until all the regexp elements are 1090satisfied. 1091 1092=back 1093 1094As we have seen above, Principle 0 overrides the others. The regexp 1095will be matched as early as possible, with the other principles 1096determining how the regexp matches at that earliest character 1097position. 1098 1099Here is an example of these principles in action: 1100 1101 $x = "The programming republic of Perl"; 1102 $x =~ /^(.+)(e|r)(.*)$/; # matches, 1103 # $1 = 'The programming republic of Pe' 1104 # $2 = 'r' 1105 # $3 = 'l' 1106 1107This regexp matches at the earliest string position, C<'T'>. One 1108might think that C<e>, being leftmost in the alternation, would be 1109matched, but C<r> produces the longest string in the first quantifier. 1110 1111 $x =~ /(m{1,2})(.*)$/; # matches, 1112 # $1 = 'mm' 1113 # $2 = 'ing republic of Perl' 1114 1115Here, The earliest possible match is at the first C<'m'> in 1116C<programming>. C<m{1,2}> is the first quantifier, so it gets to match 1117a maximal C<mm>. 1118 1119 $x =~ /.*(m{1,2})(.*)$/; # matches, 1120 # $1 = 'm' 1121 # $2 = 'ing republic of Perl' 1122 1123Here, the regexp matches at the start of the string. The first 1124quantifier C<.*> grabs as much as possible, leaving just a single 1125C<'m'> for the second quantifier C<m{1,2}>. 1126 1127 $x =~ /(.?)(m{1,2})(.*)$/; # matches, 1128 # $1 = 'a' 1129 # $2 = 'mm' 1130 # $3 = 'ing republic of Perl' 1131 1132Here, C<.?> eats its maximal one character at the earliest possible 1133position in the string, C<'a'> in C<programming>, leaving C<m{1,2}> 1134the opportunity to match both C<m>'s. Finally, 1135 1136 "aXXXb" =~ /(X*)/; # matches with $1 = '' 1137 1138because it can match zero copies of C<'X'> at the beginning of the 1139string. If you definitely want to match at least one C<'X'>, use 1140C<X+>, not C<X*>. 1141 1142Sometimes greed is not good. At times, we would like quantifiers to 1143match a I<minimal> piece of string, rather than a maximal piece. For 1144this purpose, Larry Wall created the I<minimal match> or 1145I<non-greedy> quantifiers C<??>, C<*?>, C<+?>, and C<{}?>. These are 1146the usual quantifiers with a C<?> appended to them. They have the 1147following meanings: 1148 1149=over 4 1150 1151=item * 1152 1153C<a??> means: match 'a' 0 or 1 times. Try 0 first, then 1. 1154 1155=item * 1156 1157C<a*?> means: match 'a' 0 or more times, i.e., any number of times, 1158but as few times as possible 1159 1160=item * 1161 1162C<a+?> means: match 'a' 1 or more times, i.e., at least once, but 1163as few times as possible 1164 1165=item * 1166 1167C<a{n,m}?> means: match at least C<n> times, not more than C<m> 1168times, as few times as possible 1169 1170=item * 1171 1172C<a{n,}?> means: match at least C<n> times, but as few times as 1173possible 1174 1175=item * 1176 1177C<a{n}?> means: match exactly C<n> times. Because we match exactly 1178C<n> times, C<a{n}?> is equivalent to C<a{n}> and is just there for 1179notational consistency. 1180 1181=back 1182 1183Let's look at the example above, but with minimal quantifiers: 1184 1185 $x = "The programming republic of Perl"; 1186 $x =~ /^(.+?)(e|r)(.*)$/; # matches, 1187 # $1 = 'Th' 1188 # $2 = 'e' 1189 # $3 = ' programming republic of Perl' 1190 1191The minimal string that will allow both the start of the string C<^> 1192and the alternation to match is C<Th>, with the alternation C<e|r> 1193matching C<e>. The second quantifier C<.*> is free to gobble up the 1194rest of the string. 1195 1196 $x =~ /(m{1,2}?)(.*?)$/; # matches, 1197 # $1 = 'm' 1198 # $2 = 'ming republic of Perl' 1199 1200The first string position that this regexp can match is at the first 1201C<'m'> in C<programming>. At this position, the minimal C<m{1,2}?> 1202matches just one C<'m'>. Although the second quantifier C<.*?> would 1203prefer to match no characters, it is constrained by the end-of-string 1204anchor C<$> to match the rest of the string. 1205 1206 $x =~ /(.*?)(m{1,2}?)(.*)$/; # matches, 1207 # $1 = 'The progra' 1208 # $2 = 'm' 1209 # $3 = 'ming republic of Perl' 1210 1211In this regexp, you might expect the first minimal quantifier C<.*?> 1212to match the empty string, because it is not constrained by a C<^> 1213anchor to match the beginning of the word. Principle 0 applies here, 1214however. Because it is possible for the whole regexp to match at the 1215start of the string, it I<will> match at the start of the string. Thus 1216the first quantifier has to match everything up to the first C<m>. The 1217second minimal quantifier matches just one C<m> and the third 1218quantifier matches the rest of the string. 1219 1220 $x =~ /(.??)(m{1,2})(.*)$/; # matches, 1221 # $1 = 'a' 1222 # $2 = 'mm' 1223 # $3 = 'ing republic of Perl' 1224 1225Just as in the previous regexp, the first quantifier C<.??> can match 1226earliest at position C<'a'>, so it does. The second quantifier is 1227greedy, so it matches C<mm>, and the third matches the rest of the 1228string. 1229 1230We can modify principle 3 above to take into account non-greedy 1231quantifiers: 1232 1233=over 4 1234 1235=item * 1236 1237Principle 3: If there are two or more elements in a regexp, the 1238leftmost greedy (non-greedy) quantifier, if any, will match as much 1239(little) of the string as possible while still allowing the whole 1240regexp to match. The next leftmost greedy (non-greedy) quantifier, if 1241any, will try to match as much (little) of the string remaining 1242available to it as possible, while still allowing the whole regexp to 1243match. And so on, until all the regexp elements are satisfied. 1244 1245=back 1246 1247Just like alternation, quantifiers are also susceptible to 1248backtracking. Here is a step-by-step analysis of the example 1249 1250 $x = "the cat in the hat"; 1251 $x =~ /^(.*)(at)(.*)$/; # matches, 1252 # $1 = 'the cat in the h' 1253 # $2 = 'at' 1254 # $3 = '' (0 matches) 1255 1256=over 4 1257 1258=item Z<>0 1259 1260Start with the first letter in the string 't'. 1261 1262=item Z<>1 1263 1264The first quantifier '.*' starts out by matching the whole 1265string 'the cat in the hat'. 1266 1267=item Z<>2 1268 1269'a' in the regexp element 'at' doesn't match the end of the 1270string. Backtrack one character. 1271 1272=item Z<>3 1273 1274'a' in the regexp element 'at' still doesn't match the last 1275letter of the string 't', so backtrack one more character. 1276 1277=item Z<>4 1278 1279Now we can match the 'a' and the 't'. 1280 1281=item Z<>5 1282 1283Move on to the third element '.*'. Since we are at the end of 1284the string and '.*' can match 0 times, assign it the empty string. 1285 1286=item Z<>6 1287 1288We are done! 1289 1290=back 1291 1292Most of the time, all this moving forward and backtracking happens 1293quickly and searching is fast. There are some pathological regexps, 1294however, whose execution time exponentially grows with the size of the 1295string. A typical structure that blows up in your face is of the form 1296 1297 /(a|b+)*/; 1298 1299The problem is the nested indeterminate quantifiers. There are many 1300different ways of partitioning a string of length n between the C<+> 1301and C<*>: one repetition with C<b+> of length n, two repetitions with 1302the first C<b+> length k and the second with length n-k, m repetitions 1303whose bits add up to length n, etc. In fact there are an exponential 1304number of ways to partition a string as a function of its length. A 1305regexp may get lucky and match early in the process, but if there is 1306no match, Perl will try I<every> possibility before giving up. So be 1307careful with nested C<*>'s, C<{n,m}>'s, and C<+>'s. The book 1308I<Mastering Regular Expressions> by Jeffrey Friedl gives a wonderful 1309discussion of this and other efficiency issues. 1310 1311 1312=head2 Possessive quantifiers 1313 1314Backtracking during the relentless search for a match may be a waste 1315of time, particularly when the match is bound to fail. Consider 1316the simple pattern 1317 1318 /^\w+\s+\w+$/; # a word, spaces, a word 1319 1320Whenever this is applied to a string which doesn't quite meet the 1321pattern's expectations such as S<C<"abc ">> or S<C<"abc def ">>, 1322the regex engine will backtrack, approximately once for each character 1323in the string. But we know that there is no way around taking I<all> 1324of the initial word characters to match the first repetition, that I<all> 1325spaces must be eaten by the middle part, and the same goes for the second 1326word. 1327 1328With the introduction of the I<possessive quantifiers> in Perl 5.10, we 1329have a way of instructing the regex engine not to backtrack, with the 1330usual quantifiers with a C<+> appended to them. This makes them greedy as 1331well as stingy; once they succeed they won't give anything back to permit 1332another solution. They have the following meanings: 1333 1334=over 4 1335 1336=item * 1337 1338C<a{n,m}+> means: match at least C<n> times, not more than C<m> times, 1339as many times as possible, and don't give anything up. C<a?+> is short 1340for C<a{0,1}+> 1341 1342=item * 1343 1344C<a{n,}+> means: match at least C<n> times, but as many times as possible, 1345and don't give anything up. C<a*+> is short for C<a{0,}+> and C<a++> is 1346short for C<a{1,}+>. 1347 1348=item * 1349 1350C<a{n}+> means: match exactly C<n> times. It is just there for 1351notational consistency. 1352 1353=back 1354 1355These possessive quantifiers represent a special case of a more general 1356concept, the I<independent subexpression>, see below. 1357 1358As an example where a possessive quantifier is suitable we consider 1359matching a quoted string, as it appears in several programming languages. 1360The backslash is used as an escape character that indicates that the 1361next character is to be taken literally, as another character for the 1362string. Therefore, after the opening quote, we expect a (possibly 1363empty) sequence of alternatives: either some character except an 1364unescaped quote or backslash or an escaped character. 1365 1366 /"(?:[^"\\]++|\\.)*+"/; 1367 1368 1369=head2 Building a regexp 1370 1371At this point, we have all the basic regexp concepts covered, so let's 1372give a more involved example of a regular expression. We will build a 1373regexp that matches numbers. 1374 1375The first task in building a regexp is to decide what we want to match 1376and what we want to exclude. In our case, we want to match both 1377integers and floating point numbers and we want to reject any string 1378that isn't a number. 1379 1380The next task is to break the problem down into smaller problems that 1381are easily converted into a regexp. 1382 1383The simplest case is integers. These consist of a sequence of digits, 1384with an optional sign in front. The digits we can represent with 1385C<\d+> and the sign can be matched with C<[+-]>. Thus the integer 1386regexp is 1387 1388 /[+-]?\d+/; # matches integers 1389 1390A floating point number potentially has a sign, an integral part, a 1391decimal point, a fractional part, and an exponent. One or more of these 1392parts is optional, so we need to check out the different 1393possibilities. Floating point numbers which are in proper form include 1394123., 0.345, .34, -1e6, and 25.4E-72. As with integers, the sign out 1395front is completely optional and can be matched by C<[+-]?>. We can 1396see that if there is no exponent, floating point numbers must have a 1397decimal point, otherwise they are integers. We might be tempted to 1398model these with C<\d*\.\d*>, but this would also match just a single 1399decimal point, which is not a number. So the three cases of floating 1400point number without exponent are 1401 1402 /[+-]?\d+\./; # 1., 321., etc. 1403 /[+-]?\.\d+/; # .1, .234, etc. 1404 /[+-]?\d+\.\d+/; # 1.0, 30.56, etc. 1405 1406These can be combined into a single regexp with a three-way alternation: 1407 1408 /[+-]?(\d+\.\d+|\d+\.|\.\d+)/; # floating point, no exponent 1409 1410In this alternation, it is important to put C<'\d+\.\d+'> before 1411C<'\d+\.'>. If C<'\d+\.'> were first, the regexp would happily match that 1412and ignore the fractional part of the number. 1413 1414Now consider floating point numbers with exponents. The key 1415observation here is that I<both> integers and numbers with decimal 1416points are allowed in front of an exponent. Then exponents, like the 1417overall sign, are independent of whether we are matching numbers with 1418or without decimal points, and can be 'decoupled' from the 1419mantissa. The overall form of the regexp now becomes clear: 1420 1421 /^(optional sign)(integer | f.p. mantissa)(optional exponent)$/; 1422 1423The exponent is an C<e> or C<E>, followed by an integer. So the 1424exponent regexp is 1425 1426 /[eE][+-]?\d+/; # exponent 1427 1428Putting all the parts together, we get a regexp that matches numbers: 1429 1430 /^[+-]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/; # Ta da! 1431 1432Long regexps like this may impress your friends, but can be hard to 1433decipher. In complex situations like this, the C<//x> modifier for a 1434match is invaluable. It allows one to put nearly arbitrary whitespace 1435and comments into a regexp without affecting their meaning. Using it, 1436we can rewrite our 'extended' regexp in the more pleasing form 1437 1438 /^ 1439 [+-]? # first, match an optional sign 1440 ( # then match integers or f.p. mantissas: 1441 \d+\.\d+ # mantissa of the form a.b 1442 |\d+\. # mantissa of the form a. 1443 |\.\d+ # mantissa of the form .b 1444 |\d+ # integer of the form a 1445 ) 1446 ([eE][+-]?\d+)? # finally, optionally match an exponent 1447 $/x; 1448 1449If whitespace is mostly irrelevant, how does one include space 1450characters in an extended regexp? The answer is to backslash it 1451S<C<'\ '>> or put it in a character class S<C<[ ]>>. The same thing 1452goes for pound signs: use C<\#> or C<[#]>. For instance, Perl allows 1453a space between the sign and the mantissa or integer, and we could add 1454this to our regexp as follows: 1455 1456 /^ 1457 [+-]?\ * # first, match an optional sign *and space* 1458 ( # then match integers or f.p. mantissas: 1459 \d+\.\d+ # mantissa of the form a.b 1460 |\d+\. # mantissa of the form a. 1461 |\.\d+ # mantissa of the form .b 1462 |\d+ # integer of the form a 1463 ) 1464 ([eE][+-]?\d+)? # finally, optionally match an exponent 1465 $/x; 1466 1467In this form, it is easier to see a way to simplify the 1468alternation. Alternatives 1, 2, and 4 all start with C<\d+>, so it 1469could be factored out: 1470 1471 /^ 1472 [+-]?\ * # first, match an optional sign 1473 ( # then match integers or f.p. mantissas: 1474 \d+ # start out with a ... 1475 ( 1476 \.\d* # mantissa of the form a.b or a. 1477 )? # ? takes care of integers of the form a 1478 |\.\d+ # mantissa of the form .b 1479 ) 1480 ([eE][+-]?\d+)? # finally, optionally match an exponent 1481 $/x; 1482 1483or written in the compact form, 1484 1485 /^[+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?$/; 1486 1487This is our final regexp. To recap, we built a regexp by 1488 1489=over 4 1490 1491=item * 1492 1493specifying the task in detail, 1494 1495=item * 1496 1497breaking down the problem into smaller parts, 1498 1499=item * 1500 1501translating the small parts into regexps, 1502 1503=item * 1504 1505combining the regexps, 1506 1507=item * 1508 1509and optimizing the final combined regexp. 1510 1511=back 1512 1513These are also the typical steps involved in writing a computer 1514program. This makes perfect sense, because regular expressions are 1515essentially programs written in a little computer language that specifies 1516patterns. 1517 1518=head2 Using regular expressions in Perl 1519 1520The last topic of Part 1 briefly covers how regexps are used in Perl 1521programs. Where do they fit into Perl syntax? 1522 1523We have already introduced the matching operator in its default 1524C</regexp/> and arbitrary delimiter C<m!regexp!> forms. We have used 1525the binding operator C<=~> and its negation C<!~> to test for string 1526matches. Associated with the matching operator, we have discussed the 1527single line C<//s>, multi-line C<//m>, case-insensitive C<//i> and 1528extended C<//x> modifiers. There are a few more things you might 1529want to know about matching operators. 1530 1531=head3 Prohibiting substitution 1532 1533If you change C<$pattern> after the first substitution happens, Perl 1534will ignore it. If you don't want any substitutions at all, use the 1535special delimiter C<m''>: 1536 1537 @pattern = ('Seuss'); 1538 while (<>) { 1539 print if m'@pattern'; # matches literal '@pattern', not 'Seuss' 1540 } 1541 1542Similar to strings, C<m''> acts like apostrophes on a regexp; all other 1543C<m> delimiters act like quotes. If the regexp evaluates to the empty string, 1544the regexp in the I<last successful match> is used instead. So we have 1545 1546 "dog" =~ /d/; # 'd' matches 1547 "dogbert =~ //; # this matches the 'd' regexp used before 1548 1549 1550=head3 Global matching 1551 1552The final two modifiers we will discuss here, 1553C<//g> and C<//c>, concern multiple matches. 1554The modifier C<//g> stands for global matching and allows the 1555matching operator to match within a string as many times as possible. 1556In scalar context, successive invocations against a string will have 1557C<//g> jump from match to match, keeping track of position in the 1558string as it goes along. You can get or set the position with the 1559C<pos()> function. 1560 1561The use of C<//g> is shown in the following example. Suppose we have 1562a string that consists of words separated by spaces. If we know how 1563many words there are in advance, we could extract the words using 1564groupings: 1565 1566 $x = "cat dog house"; # 3 words 1567 $x =~ /^\s*(\w+)\s+(\w+)\s+(\w+)\s*$/; # matches, 1568 # $1 = 'cat' 1569 # $2 = 'dog' 1570 # $3 = 'house' 1571 1572But what if we had an indeterminate number of words? This is the sort 1573of task C<//g> was made for. To extract all words, form the simple 1574regexp C<(\w+)> and loop over all matches with C</(\w+)/g>: 1575 1576 while ($x =~ /(\w+)/g) { 1577 print "Word is $1, ends at position ", pos $x, "\n"; 1578 } 1579 1580prints 1581 1582 Word is cat, ends at position 3 1583 Word is dog, ends at position 7 1584 Word is house, ends at position 13 1585 1586A failed match or changing the target string resets the position. If 1587you don't want the position reset after failure to match, add the 1588C<//c>, as in C</regexp/gc>. The current position in the string is 1589associated with the string, not the regexp. This means that different 1590strings have different positions and their respective positions can be 1591set or read independently. 1592 1593In list context, C<//g> returns a list of matched groupings, or if 1594there are no groupings, a list of matches to the whole regexp. So if 1595we wanted just the words, we could use 1596 1597 @words = ($x =~ /(\w+)/g); # matches, 1598 # $words[0] = 'cat' 1599 # $words[1] = 'dog' 1600 # $words[2] = 'house' 1601 1602Closely associated with the C<//g> modifier is the C<\G> anchor. The 1603C<\G> anchor matches at the point where the previous C<//g> match left 1604off. C<\G> allows us to easily do context-sensitive matching: 1605 1606 $metric = 1; # use metric units 1607 ... 1608 $x = <FILE>; # read in measurement 1609 $x =~ /^([+-]?\d+)\s*/g; # get magnitude 1610 $weight = $1; 1611 if ($metric) { # error checking 1612 print "Units error!" unless $x =~ /\Gkg\./g; 1613 } 1614 else { 1615 print "Units error!" unless $x =~ /\Glbs\./g; 1616 } 1617 $x =~ /\G\s+(widget|sprocket)/g; # continue processing 1618 1619The combination of C<//g> and C<\G> allows us to process the string a 1620bit at a time and use arbitrary Perl logic to decide what to do next. 1621Currently, the C<\G> anchor is only fully supported when used to anchor 1622to the start of the pattern. 1623 1624C<\G> is also invaluable in processing fixed-length records with 1625regexps. Suppose we have a snippet of coding region DNA, encoded as 1626base pair letters C<ATCGTTGAAT...> and we want to find all the stop 1627codons C<TGA>. In a coding region, codons are 3-letter sequences, so 1628we can think of the DNA snippet as a sequence of 3-letter records. The 1629naive regexp 1630 1631 # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC" 1632 $dna = "ATCGTTGAATGCAAATGACATGAC"; 1633 $dna =~ /TGA/; 1634 1635doesn't work; it may match a C<TGA>, but there is no guarantee that 1636the match is aligned with codon boundaries, e.g., the substring 1637S<C<GTT GAA>> gives a match. A better solution is 1638 1639 while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *? 1640 print "Got a TGA stop codon at position ", pos $dna, "\n"; 1641 } 1642 1643which prints 1644 1645 Got a TGA stop codon at position 18 1646 Got a TGA stop codon at position 23 1647 1648Position 18 is good, but position 23 is bogus. What happened? 1649 1650The answer is that our regexp works well until we get past the last 1651real match. Then the regexp will fail to match a synchronized C<TGA> 1652and start stepping ahead one character position at a time, not what we 1653want. The solution is to use C<\G> to anchor the match to the codon 1654alignment: 1655 1656 while ($dna =~ /\G(\w\w\w)*?TGA/g) { 1657 print "Got a TGA stop codon at position ", pos $dna, "\n"; 1658 } 1659 1660This prints 1661 1662 Got a TGA stop codon at position 18 1663 1664which is the correct answer. This example illustrates that it is 1665important not only to match what is desired, but to reject what is not 1666desired. 1667 1668(There are other regexp modifiers that are available, such as 1669C<//o>, but their specialized uses are beyond the 1670scope of this introduction. ) 1671 1672=head3 Search and replace 1673 1674Regular expressions also play a big role in I<search and replace> 1675operations in Perl. Search and replace is accomplished with the 1676C<s///> operator. The general form is 1677C<s/regexp/replacement/modifiers>, with everything we know about 1678regexps and modifiers applying in this case as well. The 1679C<replacement> is a Perl double-quoted string that replaces in the 1680string whatever is matched with the C<regexp>. The operator C<=~> is 1681also used here to associate a string with C<s///>. If matching 1682against C<$_>, the S<C<$_ =~>> can be dropped. If there is a match, 1683C<s///> returns the number of substitutions made; otherwise it returns 1684false. Here are a few examples: 1685 1686 $x = "Time to feed the cat!"; 1687 $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" 1688 if ($x =~ s/^(Time.*hacker)!$/$1 now!/) { 1689 $more_insistent = 1; 1690 } 1691 $y = "'quoted words'"; 1692 $y =~ s/^'(.*)'$/$1/; # strip single quotes, 1693 # $y contains "quoted words" 1694 1695In the last example, the whole string was matched, but only the part 1696inside the single quotes was grouped. With the C<s///> operator, the 1697matched variables C<$1>, C<$2>, etc. are immediately available for use 1698in the replacement expression, so we use C<$1> to replace the quoted 1699string with just what was quoted. With the global modifier, C<s///g> 1700will search and replace all occurrences of the regexp in the string: 1701 1702 $x = "I batted 4 for 4"; 1703 $x =~ s/4/four/; # doesn't do it all: 1704 # $x contains "I batted four for 4" 1705 $x = "I batted 4 for 4"; 1706 $x =~ s/4/four/g; # does it all: 1707 # $x contains "I batted four for four" 1708 1709If you prefer 'regex' over 'regexp' in this tutorial, you could use 1710the following program to replace it: 1711 1712 % cat > simple_replace 1713 #!/usr/bin/perl 1714 $regexp = shift; 1715 $replacement = shift; 1716 while (<>) { 1717 s/$regexp/$replacement/g; 1718 print; 1719 } 1720 ^D 1721 1722 % simple_replace regexp regex perlretut.pod 1723 1724In C<simple_replace> we used the C<s///g> modifier to replace all 1725occurrences of the regexp on each line. (Even though the regular 1726expression appears in a loop, Perl is smart enough to compile it 1727only once.) As with C<simple_grep>, both the 1728C<print> and the C<s/$regexp/$replacement/g> use C<$_> implicitly. 1729 1730If you don't want C<s///> to change your original variable you can use 1731the non-destructive substitute modifier, C<s///r>. This changes the 1732behavior so that C<s///r> returns the final substituted string 1733(instead of the number of substitutions): 1734 1735 $x = "I like dogs."; 1736 $y = $x =~ s/dogs/cats/r; 1737 print "$x $y\n"; 1738 1739That example will print "I like dogs. I like cats". Notice the original 1740C<$x> variable has not been affected. The overall 1741result of the substitution is instead stored in C<$y>. If the 1742substitution doesn't affect anything then the original string is 1743returned: 1744 1745 $x = "I like dogs."; 1746 $y = $x =~ s/elephants/cougars/r; 1747 print "$x $y\n"; # prints "I like dogs. I like dogs." 1748 1749One other interesting thing that the C<s///r> flag allows is chaining 1750substitutions: 1751 1752 $x = "Cats are great."; 1753 print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~ 1754 s/Frogs/Hedgehogs/r, "\n"; 1755 # prints "Hedgehogs are great." 1756 1757A modifier available specifically to search and replace is the 1758C<s///e> evaluation modifier. C<s///e> treats the 1759replacement text as Perl code, rather than a double-quoted 1760string. The value that the code returns is substituted for the 1761matched substring. C<s///e> is useful if you need to do a bit of 1762computation in the process of replacing text. This example counts 1763character frequencies in a line: 1764 1765 $x = "Bill the cat"; 1766 $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself 1767 print "frequency of '$_' is $chars{$_}\n" 1768 foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars); 1769 1770This prints 1771 1772 frequency of ' ' is 2 1773 frequency of 't' is 2 1774 frequency of 'l' is 2 1775 frequency of 'B' is 1 1776 frequency of 'c' is 1 1777 frequency of 'e' is 1 1778 frequency of 'h' is 1 1779 frequency of 'i' is 1 1780 frequency of 'a' is 1 1781 1782As with the match C<m//> operator, C<s///> can use other delimiters, 1783such as C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are 1784used C<s'''>, then the regexp and replacement are 1785treated as single-quoted strings and there are no 1786variable substitutions. C<s///> in list context 1787returns the same thing as in scalar context, i.e., the number of 1788matches. 1789 1790=head3 The split function 1791 1792The C<split()> function is another place where a regexp is used. 1793C<split /regexp/, string, limit> separates the C<string> operand into 1794a list of substrings and returns that list. The regexp must be designed 1795to match whatever constitutes the separators for the desired substrings. 1796The C<limit>, if present, constrains splitting into no more than C<limit> 1797number of strings. For example, to split a string into words, use 1798 1799 $x = "Calvin and Hobbes"; 1800 @words = split /\s+/, $x; # $word[0] = 'Calvin' 1801 # $word[1] = 'and' 1802 # $word[2] = 'Hobbes' 1803 1804If the empty regexp C<//> is used, the regexp always matches and 1805the string is split into individual characters. If the regexp has 1806groupings, then the resulting list contains the matched substrings from the 1807groupings as well. For instance, 1808 1809 $x = "/usr/bin/perl"; 1810 @dirs = split m!/!, $x; # $dirs[0] = '' 1811 # $dirs[1] = 'usr' 1812 # $dirs[2] = 'bin' 1813 # $dirs[3] = 'perl' 1814 @parts = split m!(/)!, $x; # $parts[0] = '' 1815 # $parts[1] = '/' 1816 # $parts[2] = 'usr' 1817 # $parts[3] = '/' 1818 # $parts[4] = 'bin' 1819 # $parts[5] = '/' 1820 # $parts[6] = 'perl' 1821 1822Since the first character of $x matched the regexp, C<split> prepended 1823an empty initial element to the list. 1824 1825If you have read this far, congratulations! You now have all the basic 1826tools needed to use regular expressions to solve a wide range of text 1827processing problems. If this is your first time through the tutorial, 1828why not stop here and play around with regexps a while.... S<Part 2> 1829concerns the more esoteric aspects of regular expressions and those 1830concepts certainly aren't needed right at the start. 1831 1832=head1 Part 2: Power tools 1833 1834OK, you know the basics of regexps and you want to know more. If 1835matching regular expressions is analogous to a walk in the woods, then 1836the tools discussed in Part 1 are analogous to topo maps and a 1837compass, basic tools we use all the time. Most of the tools in part 2 1838are analogous to flare guns and satellite phones. They aren't used 1839too often on a hike, but when we are stuck, they can be invaluable. 1840 1841What follows are the more advanced, less used, or sometimes esoteric 1842capabilities of Perl regexps. In Part 2, we will assume you are 1843comfortable with the basics and concentrate on the advanced features. 1844 1845=head2 More on characters, strings, and character classes 1846 1847There are a number of escape sequences and character classes that we 1848haven't covered yet. 1849 1850There are several escape sequences that convert characters or strings 1851between upper and lower case, and they are also available within 1852patterns. C<\l> and C<\u> convert the next character to lower or 1853upper case, respectively: 1854 1855 $x = "perl"; 1856 $string =~ /\u$x/; # matches 'Perl' in $string 1857 $x = "M(rs?|s)\\."; # note the double backslash 1858 $string =~ /\l$x/; # matches 'mr.', 'mrs.', and 'ms.', 1859 1860A C<\L> or C<\U> indicates a lasting conversion of case, until 1861terminated by C<\E> or thrown over by another C<\U> or C<\L>: 1862 1863 $x = "This word is in lower case:\L SHOUT\E"; 1864 $x =~ /shout/; # matches 1865 $x = "I STILL KEYPUNCH CARDS FOR MY 360" 1866 $x =~ /\Ukeypunch/; # matches punch card string 1867 1868If there is no C<\E>, case is converted until the end of the 1869string. The regexps C<\L\u$word> or C<\u\L$word> convert the first 1870character of C<$word> to uppercase and the rest of the characters to 1871lowercase. 1872 1873Control characters can be escaped with C<\c>, so that a control-Z 1874character would be matched with C<\cZ>. The escape sequence 1875C<\Q>...C<\E> quotes, or protects most non-alphabetic characters. For 1876instance, 1877 1878 $x = "\QThat !^*&%~& cat!"; 1879 $x =~ /\Q!^*&%~&\E/; # check for rough language 1880 1881It does not protect C<$> or C<@>, so that variables can still be 1882substituted. 1883 1884C<\Q>, C<\L>, C<\l>, C<\U>, C<\u> and C<\E> are actually part of 1885double-quotish syntax, and not part of regexp syntax proper. They will 1886work if they appear in a regular expression embedded directly in a 1887program, but not when contained in a string that is interpolated in a 1888pattern. 1889 1890Perl regexps can handle more than just the 1891standard ASCII character set. Perl supports I<Unicode>, a standard 1892for representing the alphabets from virtually all of the world's written 1893languages, and a host of symbols. Perl's text strings are Unicode strings, so 1894they can contain characters with a value (codepoint or character number) higher 1895than 255. 1896 1897What does this mean for regexps? Well, regexp users don't need to know 1898much about Perl's internal representation of strings. But they do need 1899to know 1) how to represent Unicode characters in a regexp and 2) that 1900a matching operation will treat the string to be searched as a sequence 1901of characters, not bytes. The answer to 1) is that Unicode characters 1902greater than C<chr(255)> are represented using the C<\x{hex}> notation, because 1903\x hex (without curly braces) doesn't go further than 255. (Starting in Perl 19045.14, if you're an octal fan, you can also use C<\o{oct}>.) 1905 1906 /\x{263a}/; # match a Unicode smiley face :) 1907 1908B<NOTE>: In Perl 5.6.0 it used to be that one needed to say C<use 1909utf8> to use any Unicode features. This is no more the case: for 1910almost all Unicode processing, the explicit C<utf8> pragma is not 1911needed. (The only case where it matters is if your Perl script is in 1912Unicode and encoded in UTF-8, then an explicit C<use utf8> is needed.) 1913 1914Figuring out the hexadecimal sequence of a Unicode character you want 1915or deciphering someone else's hexadecimal Unicode regexp is about as 1916much fun as programming in machine code. So another way to specify 1917Unicode characters is to use the I<named character> escape 1918sequence C<\N{I<name>}>. I<name> is a name for the Unicode character, as 1919specified in the Unicode standard. For instance, if we wanted to 1920represent or match the astrological sign for the planet Mercury, we 1921could use 1922 1923 $x = "abc\N{MERCURY}def"; 1924 $x =~ /\N{MERCURY}/; # matches 1925 1926One can also use "short" names: 1927 1928 print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n"; 1929 print "\N{greek:Sigma} is an upper-case sigma.\n"; 1930 1931You can also restrict names to a certain alphabet by specifying the 1932L<charnames> pragma: 1933 1934 use charnames qw(greek); 1935 print "\N{sigma} is Greek sigma\n"; 1936 1937An index of character names is available on-line from the Unicode 1938Consortium, L<http://www.unicode.org/charts/charindex.html>; explanatory 1939material with links to other resources at 1940L<http://www.unicode.org/standard/where>. 1941 1942The answer to requirement 2) is that a regexp (mostly) 1943uses Unicode characters. The "mostly" is for messy backward 1944compatibility reasons, but starting in Perl 5.14, any regex compiled in 1945the scope of a C<use feature 'unicode_strings'> (which is automatically 1946turned on within the scope of a C<use 5.012> or higher) will turn that 1947"mostly" into "always". If you want to handle Unicode properly, you 1948should ensure that C<'unicode_strings'> is turned on. 1949Internally, this is encoded to bytes using either UTF-8 or a native 8 1950bit encoding, depending on the history of the string, but conceptually 1951it is a sequence of characters, not bytes. See L<perlunitut> for a 1952tutorial about that. 1953 1954Let us now discuss Unicode character classes, most usually called 1955"character properties". These are represented by the 1956C<\p{name}> escape sequence. Closely associated is the C<\P{name}> 1957property, which is the negation of the C<\p{name}> one. For 1958example, to match lower and uppercase characters, 1959 1960 $x = "BOB"; 1961 $x =~ /^\p{IsUpper}/; # matches, uppercase char class 1962 $x =~ /^\P{IsUpper}/; # doesn't match, char class sans uppercase 1963 $x =~ /^\p{IsLower}/; # doesn't match, lowercase char class 1964 $x =~ /^\P{IsLower}/; # matches, char class sans lowercase 1965 1966(The "Is" is optional.) 1967 1968There are many, many Unicode character properties. For the full list 1969see L<perluniprops>. Most of them have synonyms with shorter names, 1970also listed there. Some synonyms are a single character. For these, 1971you can drop the braces. For instance, C<\pM> is the same thing as 1972C<\p{Mark}>, meaning things like accent marks. 1973 1974The Unicode C<\p{Script}> property is used to categorize every Unicode 1975character into the language script it is written in. For example, 1976English, French, and a bunch of other European languages are written in 1977the Latin script. But there is also the Greek script, the Thai script, 1978the Katakana script, etc. You can test whether a character is in a 1979particular script with, for example C<\p{Latin}>, C<\p{Greek}>, 1980or C<\p{Katakana}>. To test if it isn't in the Balinese script, you 1981would use C<\P{Balinese}>. 1982 1983What we have described so far is the single form of the C<\p{...}> character 1984classes. There is also a compound form which you may run into. These 1985look like C<\p{name=value}> or C<\p{name:value}> (the equals sign and colon 1986can be used interchangeably). These are more general than the single form, 1987and in fact most of the single forms are just Perl-defined shortcuts for common 1988compound forms. For example, the script examples in the previous paragraph 1989could be written equivalently as C<\p{Script=Latin}>, C<\p{Script:Greek}>, 1990C<\p{script=katakana}>, and C<\P{script=balinese}> (case is irrelevant 1991between the C<{}> braces). You may 1992never have to use the compound forms, but sometimes it is necessary, and their 1993use can make your code easier to understand. 1994 1995C<\X> is an abbreviation for a character class that comprises 1996a Unicode I<extended grapheme cluster>. This represents a "logical character": 1997what appears to be a single character, but may be represented internally by more 1998than one. As an example, using the Unicode full names, e.g., S<C<A + COMBINING 1999RING>> is a grapheme cluster with base character C<A> and combining character 2000S<C<COMBINING RING>>, which translates in Danish to A with the circle atop it, 2001as in the word E<Aring>ngstrom. 2002 2003For the full and latest information about Unicode see the latest 2004Unicode standard, or the Unicode Consortium's website L<http://www.unicode.org> 2005 2006As if all those classes weren't enough, Perl also defines POSIX-style 2007character classes. These have the form C<[:name:]>, with C<name> the 2008name of the POSIX class. The POSIX classes are C<alpha>, C<alnum>, 2009C<ascii>, C<cntrl>, C<digit>, C<graph>, C<lower>, C<print>, C<punct>, 2010C<space>, C<upper>, and C<xdigit>, and two extensions, C<word> (a Perl 2011extension to match C<\w>), and C<blank> (a GNU extension). The C<//a> 2012modifier restricts these to matching just in the ASCII range; otherwise 2013they can match the same as their corresponding Perl Unicode classes: 2014C<[:upper:]> is the same as C<\p{IsUpper}>, etc. (There are some 2015exceptions and gotchas with this; see L<perlrecharclass> for a full 2016discussion.) The C<[:digit:]>, C<[:word:]>, and 2017C<[:space:]> correspond to the familiar C<\d>, C<\w>, and C<\s> 2018character classes. To negate a POSIX class, put a C<^> in front of 2019the name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and, under 2020Unicode, C<\P{IsDigit}>. The Unicode and POSIX character classes can 2021be used just like C<\d>, with the exception that POSIX character 2022classes can only be used inside of a character class: 2023 2024 /\s+[abc[:digit:]xyz]\s*/; # match a,b,c,x,y,z, or a digit 2025 /^=item\s[[:digit:]]/; # match '=item', 2026 # followed by a space and a digit 2027 /\s+[abc\p{IsDigit}xyz]\s+/; # match a,b,c,x,y,z, or a digit 2028 /^=item\s\p{IsDigit}/; # match '=item', 2029 # followed by a space and a digit 2030 2031Whew! That is all the rest of the characters and character classes. 2032 2033=head2 Compiling and saving regular expressions 2034 2035In Part 1 we mentioned that Perl compiles a regexp into a compact 2036sequence of opcodes. Thus, a compiled regexp is a data structure 2037that can be stored once and used again and again. The regexp quote 2038C<qr//> does exactly that: C<qr/string/> compiles the C<string> as a 2039regexp and transforms the result into a form that can be assigned to a 2040variable: 2041 2042 $reg = qr/foo+bar?/; # reg contains a compiled regexp 2043 2044Then C<$reg> can be used as a regexp: 2045 2046 $x = "fooooba"; 2047 $x =~ $reg; # matches, just like /foo+bar?/ 2048 $x =~ /$reg/; # same thing, alternate form 2049 2050C<$reg> can also be interpolated into a larger regexp: 2051 2052 $x =~ /(abc)?$reg/; # still matches 2053 2054As with the matching operator, the regexp quote can use different 2055delimiters, e.g., C<qr!!>, C<qr{}> or C<qr~~>. Apostrophes 2056as delimiters (C<qr''>) inhibit any interpolation. 2057 2058Pre-compiled regexps are useful for creating dynamic matches that 2059don't need to be recompiled each time they are encountered. Using 2060pre-compiled regexps, we write a C<grep_step> program which greps 2061for a sequence of patterns, advancing to the next pattern as soon 2062as one has been satisfied. 2063 2064 % cat > grep_step 2065 #!/usr/bin/perl 2066 # grep_step - match <number> regexps, one after the other 2067 # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ... 2068 2069 $number = shift; 2070 $regexp[$_] = shift foreach (0..$number-1); 2071 @compiled = map qr/$_/, @regexp; 2072 while ($line = <>) { 2073 if ($line =~ /$compiled[0]/) { 2074 print $line; 2075 shift @compiled; 2076 last unless @compiled; 2077 } 2078 } 2079 ^D 2080 2081 % grep_step 3 shift print last grep_step 2082 $number = shift; 2083 print $line; 2084 last unless @compiled; 2085 2086Storing pre-compiled regexps in an array C<@compiled> allows us to 2087simply loop through the regexps without any recompilation, thus gaining 2088flexibility without sacrificing speed. 2089 2090 2091=head2 Composing regular expressions at runtime 2092 2093Backtracking is more efficient than repeated tries with different regular 2094expressions. If there are several regular expressions and a match with 2095any of them is acceptable, then it is possible to combine them into a set 2096of alternatives. If the individual expressions are input data, this 2097can be done by programming a join operation. We'll exploit this idea in 2098an improved version of the C<simple_grep> program: a program that matches 2099multiple patterns: 2100 2101 % cat > multi_grep 2102 #!/usr/bin/perl 2103 # multi_grep - match any of <number> regexps 2104 # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ... 2105 2106 $number = shift; 2107 $regexp[$_] = shift foreach (0..$number-1); 2108 $pattern = join '|', @regexp; 2109 2110 while ($line = <>) { 2111 print $line if $line =~ /$pattern/; 2112 } 2113 ^D 2114 2115 % multi_grep 2 shift for multi_grep 2116 $number = shift; 2117 $regexp[$_] = shift foreach (0..$number-1); 2118 2119Sometimes it is advantageous to construct a pattern from the I<input> 2120that is to be analyzed and use the permissible values on the left 2121hand side of the matching operations. As an example for this somewhat 2122paradoxical situation, let's assume that our input contains a command 2123verb which should match one out of a set of available command verbs, 2124with the additional twist that commands may be abbreviated as long as 2125the given string is unique. The program below demonstrates the basic 2126algorithm. 2127 2128 % cat > keymatch 2129 #!/usr/bin/perl 2130 $kwds = 'copy compare list print'; 2131 while( $cmd = <> ){ 2132 $cmd =~ s/^\s+|\s+$//g; # trim leading and trailing spaces 2133 if( ( @matches = $kwds =~ /\b$cmd\w*/g ) == 1 ){ 2134 print "command: '@matches'\n"; 2135 } elsif( @matches == 0 ){ 2136 print "no such command: '$cmd'\n"; 2137 } else { 2138 print "not unique: '$cmd' (could be one of: @matches)\n"; 2139 } 2140 } 2141 ^D 2142 2143 % keymatch 2144 li 2145 command: 'list' 2146 co 2147 not unique: 'co' (could be one of: copy compare) 2148 printer 2149 no such command: 'printer' 2150 2151Rather than trying to match the input against the keywords, we match the 2152combined set of keywords against the input. The pattern matching 2153operation S<C<$kwds =~ /\b($cmd\w*)/g>> does several things at the 2154same time. It makes sure that the given command begins where a keyword 2155begins (C<\b>). It tolerates abbreviations due to the added C<\w*>. It 2156tells us the number of matches (C<scalar @matches>) and all the keywords 2157that were actually matched. You could hardly ask for more. 2158 2159=head2 Embedding comments and modifiers in a regular expression 2160 2161Starting with this section, we will be discussing Perl's set of 2162I<extended patterns>. These are extensions to the traditional regular 2163expression syntax that provide powerful new tools for pattern 2164matching. We have already seen extensions in the form of the minimal 2165matching constructs C<??>, C<*?>, C<+?>, C<{n,m}?>, and C<{n,}?>. Most 2166of the extensions below have the form C<(?char...)>, where the 2167C<char> is a character that determines the type of extension. 2168 2169The first extension is an embedded comment C<(?#text)>. This embeds a 2170comment into the regular expression without affecting its meaning. The 2171comment should not have any closing parentheses in the text. An 2172example is 2173 2174 /(?# Match an integer:)[+-]?\d+/; 2175 2176This style of commenting has been largely superseded by the raw, 2177freeform commenting that is allowed with the C<//x> modifier. 2178 2179Most modifiers, such as C<//i>, C<//m>, C<//s> and C<//x> (or any 2180combination thereof) can also be embedded in 2181a regexp using C<(?i)>, C<(?m)>, C<(?s)>, and C<(?x)>. For instance, 2182 2183 /(?i)yes/; # match 'yes' case insensitively 2184 /yes/i; # same thing 2185 /(?x)( # freeform version of an integer regexp 2186 [+-]? # match an optional sign 2187 \d+ # match a sequence of digits 2188 ) 2189 /x; 2190 2191Embedded modifiers can have two important advantages over the usual 2192modifiers. Embedded modifiers allow a custom set of modifiers to 2193I<each> regexp pattern. This is great for matching an array of regexps 2194that must have different modifiers: 2195 2196 $pattern[0] = '(?i)doctor'; 2197 $pattern[1] = 'Johnson'; 2198 ... 2199 while (<>) { 2200 foreach $patt (@pattern) { 2201 print if /$patt/; 2202 } 2203 } 2204 2205The second advantage is that embedded modifiers (except C<//p>, which 2206modifies the entire regexp) only affect the regexp 2207inside the group the embedded modifier is contained in. So grouping 2208can be used to localize the modifier's effects: 2209 2210 /Answer: ((?i)yes)/; # matches 'Answer: yes', 'Answer: YES', etc. 2211 2212Embedded modifiers can also turn off any modifiers already present 2213by using, e.g., C<(?-i)>. Modifiers can also be combined into 2214a single expression, e.g., C<(?s-i)> turns on single line mode and 2215turns off case insensitivity. 2216 2217Embedded modifiers may also be added to a non-capturing grouping. 2218C<(?i-m:regexp)> is a non-capturing grouping that matches C<regexp> 2219case insensitively and turns off multi-line mode. 2220 2221 2222=head2 Looking ahead and looking behind 2223 2224This section concerns the lookahead and lookbehind assertions. First, 2225a little background. 2226 2227In Perl regular expressions, most regexp elements 'eat up' a certain 2228amount of string when they match. For instance, the regexp element 2229C<[abc}]> eats up one character of the string when it matches, in the 2230sense that Perl moves to the next character position in the string 2231after the match. There are some elements, however, that don't eat up 2232characters (advance the character position) if they match. The examples 2233we have seen so far are the anchors. The anchor C<^> matches the 2234beginning of the line, but doesn't eat any characters. Similarly, the 2235word boundary anchor C<\b> matches wherever a character matching C<\w> 2236is next to a character that doesn't, but it doesn't eat up any 2237characters itself. Anchors are examples of I<zero-width assertions>: 2238zero-width, because they consume 2239no characters, and assertions, because they test some property of the 2240string. In the context of our walk in the woods analogy to regexp 2241matching, most regexp elements move us along a trail, but anchors have 2242us stop a moment and check our surroundings. If the local environment 2243checks out, we can proceed forward. But if the local environment 2244doesn't satisfy us, we must backtrack. 2245 2246Checking the environment entails either looking ahead on the trail, 2247looking behind, or both. C<^> looks behind, to see that there are no 2248characters before. C<$> looks ahead, to see that there are no 2249characters after. C<\b> looks both ahead and behind, to see if the 2250characters on either side differ in their "word-ness". 2251 2252The lookahead and lookbehind assertions are generalizations of the 2253anchor concept. Lookahead and lookbehind are zero-width assertions 2254that let us specify which characters we want to test for. The 2255lookahead assertion is denoted by C<(?=regexp)> and the lookbehind 2256assertion is denoted by C<< (?<=fixed-regexp) >>. Some examples are 2257 2258 $x = "I catch the housecat 'Tom-cat' with catnip"; 2259 $x =~ /cat(?=\s)/; # matches 'cat' in 'housecat' 2260 @catwords = ($x =~ /(?<=\s)cat\w+/g); # matches, 2261 # $catwords[0] = 'catch' 2262 # $catwords[1] = 'catnip' 2263 $x =~ /\bcat\b/; # matches 'cat' in 'Tom-cat' 2264 $x =~ /(?<=\s)cat(?=\s)/; # doesn't match; no isolated 'cat' in 2265 # middle of $x 2266 2267Note that the parentheses in C<(?=regexp)> and C<< (?<=regexp) >> are 2268non-capturing, since these are zero-width assertions. Thus in the 2269second regexp, the substrings captured are those of the whole regexp 2270itself. Lookahead C<(?=regexp)> can match arbitrary regexps, but 2271lookbehind C<< (?<=fixed-regexp) >> only works for regexps of fixed 2272width, i.e., a fixed number of characters long. Thus 2273C<< (?<=(ab|bc)) >> is fine, but C<< (?<=(ab)*) >> is not. The 2274negated versions of the lookahead and lookbehind assertions are 2275denoted by C<(?!regexp)> and C<< (?<!fixed-regexp) >> respectively. 2276They evaluate true if the regexps do I<not> match: 2277 2278 $x = "foobar"; 2279 $x =~ /foo(?!bar)/; # doesn't match, 'bar' follows 'foo' 2280 $x =~ /foo(?!baz)/; # matches, 'baz' doesn't follow 'foo' 2281 $x =~ /(?<!\s)foo/; # matches, there is no \s before 'foo' 2282 2283The C<\C> is unsupported in lookbehind, because the already 2284treacherous definition of C<\C> would become even more so 2285when going backwards. 2286 2287Here is an example where a string containing blank-separated words, 2288numbers and single dashes is to be split into its components. 2289Using C</\s+/> alone won't work, because spaces are not required between 2290dashes, or a word or a dash. Additional places for a split are established 2291by looking ahead and behind: 2292 2293 $str = "one two - --6-8"; 2294 @toks = split / \s+ # a run of spaces 2295 | (?<=\S) (?=-) # any non-space followed by '-' 2296 | (?<=-) (?=\S) # a '-' followed by any non-space 2297 /x, $str; # @toks = qw(one two - - - 6 - 8) 2298 2299 2300=head2 Using independent subexpressions to prevent backtracking 2301 2302I<Independent subexpressions> are regular expressions, in the 2303context of a larger regular expression, that function independently of 2304the larger regular expression. That is, they consume as much or as 2305little of the string as they wish without regard for the ability of 2306the larger regexp to match. Independent subexpressions are represented 2307by C<< (?>regexp) >>. We can illustrate their behavior by first 2308considering an ordinary regexp: 2309 2310 $x = "ab"; 2311 $x =~ /a*ab/; # matches 2312 2313This obviously matches, but in the process of matching, the 2314subexpression C<a*> first grabbed the C<a>. Doing so, however, 2315wouldn't allow the whole regexp to match, so after backtracking, C<a*> 2316eventually gave back the C<a> and matched the empty string. Here, what 2317C<a*> matched was I<dependent> on what the rest of the regexp matched. 2318 2319Contrast that with an independent subexpression: 2320 2321 $x =~ /(?>a*)ab/; # doesn't match! 2322 2323The independent subexpression C<< (?>a*) >> doesn't care about the rest 2324of the regexp, so it sees an C<a> and grabs it. Then the rest of the 2325regexp C<ab> cannot match. Because C<< (?>a*) >> is independent, there 2326is no backtracking and the independent subexpression does not give 2327up its C<a>. Thus the match of the regexp as a whole fails. A similar 2328behavior occurs with completely independent regexps: 2329 2330 $x = "ab"; 2331 $x =~ /a*/g; # matches, eats an 'a' 2332 $x =~ /\Gab/g; # doesn't match, no 'a' available 2333 2334Here C<//g> and C<\G> create a 'tag team' handoff of the string from 2335one regexp to the other. Regexps with an independent subexpression are 2336much like this, with a handoff of the string to the independent 2337subexpression, and a handoff of the string back to the enclosing 2338regexp. 2339 2340The ability of an independent subexpression to prevent backtracking 2341can be quite useful. Suppose we want to match a non-empty string 2342enclosed in parentheses up to two levels deep. Then the following 2343regexp matches: 2344 2345 $x = "abc(de(fg)h"; # unbalanced parentheses 2346 $x =~ /\( ( [^()]+ | \([^()]*\) )+ \)/x; 2347 2348The regexp matches an open parenthesis, one or more copies of an 2349alternation, and a close parenthesis. The alternation is two-way, with 2350the first alternative C<[^()]+> matching a substring with no 2351parentheses and the second alternative C<\([^()]*\)> matching a 2352substring delimited by parentheses. The problem with this regexp is 2353that it is pathological: it has nested indeterminate quantifiers 2354of the form C<(a+|b)+>. We discussed in Part 1 how nested quantifiers 2355like this could take an exponentially long time to execute if there 2356was no match possible. To prevent the exponential blowup, we need to 2357prevent useless backtracking at some point. This can be done by 2358enclosing the inner quantifier as an independent subexpression: 2359 2360 $x =~ /\( ( (?>[^()]+) | \([^()]*\) )+ \)/x; 2361 2362Here, C<< (?>[^()]+) >> breaks the degeneracy of string partitioning 2363by gobbling up as much of the string as possible and keeping it. Then 2364match failures fail much more quickly. 2365 2366 2367=head2 Conditional expressions 2368 2369A I<conditional expression> is a form of if-then-else statement 2370that allows one to choose which patterns are to be matched, based on 2371some condition. There are two types of conditional expression: 2372C<(?(condition)yes-regexp)> and 2373C<(?(condition)yes-regexp|no-regexp)>. C<(?(condition)yes-regexp)> is 2374like an S<C<'if () {}'>> statement in Perl. If the C<condition> is true, 2375the C<yes-regexp> will be matched. If the C<condition> is false, the 2376C<yes-regexp> will be skipped and Perl will move onto the next regexp 2377element. The second form is like an S<C<'if () {} else {}'>> statement 2378in Perl. If the C<condition> is true, the C<yes-regexp> will be 2379matched, otherwise the C<no-regexp> will be matched. 2380 2381The C<condition> can have several forms. The first form is simply an 2382integer in parentheses C<(integer)>. It is true if the corresponding 2383backreference C<\integer> matched earlier in the regexp. The same 2384thing can be done with a name associated with a capture group, written 2385as C<< (<name>) >> or C<< ('name') >>. The second form is a bare 2386zero-width assertion C<(?...)>, either a lookahead, a lookbehind, or a 2387code assertion (discussed in the next section). The third set of forms 2388provides tests that return true if the expression is executed within 2389a recursion (C<(R)>) or is being called from some capturing group, 2390referenced either by number (C<(R1)>, C<(R2)>,...) or by name 2391(C<(R&name)>). 2392 2393The integer or name form of the C<condition> allows us to choose, 2394with more flexibility, what to match based on what matched earlier in the 2395regexp. This searches for words of the form C<"$x$x"> or C<"$x$y$y$x">: 2396 2397 % simple_grep '^(\w+)(\w+)?(?(2)\g2\g1|\g1)$' /usr/dict/words 2398 beriberi 2399 coco 2400 couscous 2401 deed 2402 ... 2403 toot 2404 toto 2405 tutu 2406 2407The lookbehind C<condition> allows, along with backreferences, 2408an earlier part of the match to influence a later part of the 2409match. For instance, 2410 2411 /[ATGC]+(?(?<=AA)G|C)$/; 2412 2413matches a DNA sequence such that it either ends in C<AAG>, or some 2414other base pair combination and C<C>. Note that the form is 2415C<< (?(?<=AA)G|C) >> and not C<< (?((?<=AA))G|C) >>; for the 2416lookahead, lookbehind or code assertions, the parentheses around the 2417conditional are not needed. 2418 2419 2420=head2 Defining named patterns 2421 2422Some regular expressions use identical subpatterns in several places. 2423Starting with Perl 5.10, it is possible to define named subpatterns in 2424a section of the pattern so that they can be called up by name 2425anywhere in the pattern. This syntactic pattern for this definition 2426group is C<< (?(DEFINE)(?<name>pattern)...) >>. An insertion 2427of a named pattern is written as C<(?&name)>. 2428 2429The example below illustrates this feature using the pattern for 2430floating point numbers that was presented earlier on. The three 2431subpatterns that are used more than once are the optional sign, the 2432digit sequence for an integer and the decimal fraction. The DEFINE 2433group at the end of the pattern contains their definition. Notice 2434that the decimal fraction pattern is the first place where we can 2435reuse the integer pattern. 2436 2437 /^ (?&osg)\ * ( (?&int)(?&dec)? | (?&dec) ) 2438 (?: [eE](?&osg)(?&int) )? 2439 $ 2440 (?(DEFINE) 2441 (?<osg>[-+]?) # optional sign 2442 (?<int>\d++) # integer 2443 (?<dec>\.(?&int)) # decimal fraction 2444 )/x 2445 2446 2447=head2 Recursive patterns 2448 2449This feature (introduced in Perl 5.10) significantly extends the 2450power of Perl's pattern matching. By referring to some other 2451capture group anywhere in the pattern with the construct 2452C<(?group-ref)>, the I<pattern> within the referenced group is used 2453as an independent subpattern in place of the group reference itself. 2454Because the group reference may be contained I<within> the group it 2455refers to, it is now possible to apply pattern matching to tasks that 2456hitherto required a recursive parser. 2457 2458To illustrate this feature, we'll design a pattern that matches if 2459a string contains a palindrome. (This is a word or a sentence that, 2460while ignoring spaces, interpunctuation and case, reads the same backwards 2461as forwards. We begin by observing that the empty string or a string 2462containing just one word character is a palindrome. Otherwise it must 2463have a word character up front and the same at its end, with another 2464palindrome in between. 2465 2466 /(?: (\w) (?...Here be a palindrome...) \g{-1} | \w? )/x 2467 2468Adding C<\W*> at either end to eliminate what is to be ignored, we already 2469have the full pattern: 2470 2471 my $pp = qr/^(\W* (?: (\w) (?1) \g{-1} | \w? ) \W*)$/ix; 2472 for $s ( "saippuakauppias", "A man, a plan, a canal: Panama!" ){ 2473 print "'$s' is a palindrome\n" if $s =~ /$pp/; 2474 } 2475 2476In C<(?...)> both absolute and relative backreferences may be used. 2477The entire pattern can be reinserted with C<(?R)> or C<(?0)>. 2478If you prefer to name your groups, you can use C<(?&name)> to 2479recurse into that group. 2480 2481 2482=head2 A bit of magic: executing Perl code in a regular expression 2483 2484Normally, regexps are a part of Perl expressions. 2485I<Code evaluation> expressions turn that around by allowing 2486arbitrary Perl code to be a part of a regexp. A code evaluation 2487expression is denoted C<(?{code})>, with I<code> a string of Perl 2488statements. 2489 2490Be warned that this feature is considered experimental, and may be 2491changed without notice. 2492 2493Code expressions are zero-width assertions, and the value they return 2494depends on their environment. There are two possibilities: either the 2495code expression is used as a conditional in a conditional expression 2496C<(?(condition)...)>, or it is not. If the code expression is a 2497conditional, the code is evaluated and the result (i.e., the result of 2498the last statement) is used to determine truth or falsehood. If the 2499code expression is not used as a conditional, the assertion always 2500evaluates true and the result is put into the special variable 2501C<$^R>. The variable C<$^R> can then be used in code expressions later 2502in the regexp. Here are some silly examples: 2503 2504 $x = "abcdef"; 2505 $x =~ /abc(?{print "Hi Mom!";})def/; # matches, 2506 # prints 'Hi Mom!' 2507 $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match, 2508 # no 'Hi Mom!' 2509 2510Pay careful attention to the next example: 2511 2512 $x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match, 2513 # no 'Hi Mom!' 2514 # but why not? 2515 2516At first glance, you'd think that it shouldn't print, because obviously 2517the C<ddd> isn't going to match the target string. But look at this 2518example: 2519 2520 $x =~ /abc(?{print "Hi Mom!";})[dD]dd/; # doesn't match, 2521 # but _does_ print 2522 2523Hmm. What happened here? If you've been following along, you know that 2524the above pattern should be effectively (almost) the same as the last one; 2525enclosing the C<d> in a character class isn't going to change what it 2526matches. So why does the first not print while the second one does? 2527 2528The answer lies in the optimizations the regex engine makes. In the first 2529case, all the engine sees are plain old characters (aside from the 2530C<?{}> construct). It's smart enough to realize that the string 'ddd' 2531doesn't occur in our target string before actually running the pattern 2532through. But in the second case, we've tricked it into thinking that our 2533pattern is more complicated. It takes a look, sees our 2534character class, and decides that it will have to actually run the 2535pattern to determine whether or not it matches, and in the process of 2536running it hits the print statement before it discovers that we don't 2537have a match. 2538 2539To take a closer look at how the engine does optimizations, see the 2540section L<"Pragmas and debugging"> below. 2541 2542More fun with C<?{}>: 2543 2544 $x =~ /(?{print "Hi Mom!";})/; # matches, 2545 # prints 'Hi Mom!' 2546 $x =~ /(?{$c = 1;})(?{print "$c";})/; # matches, 2547 # prints '1' 2548 $x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches, 2549 # prints '1' 2550 2551The bit of magic mentioned in the section title occurs when the regexp 2552backtracks in the process of searching for a match. If the regexp 2553backtracks over a code expression and if the variables used within are 2554localized using C<local>, the changes in the variables produced by the 2555code expression are undone! Thus, if we wanted to count how many times 2556a character got matched inside a group, we could use, e.g., 2557 2558 $x = "aaaa"; 2559 $count = 0; # initialize 'a' count 2560 $c = "bob"; # test if $c gets clobbered 2561 $x =~ /(?{local $c = 0;}) # initialize count 2562 ( a # match 'a' 2563 (?{local $c = $c + 1;}) # increment count 2564 )* # do this any number of times, 2565 aa # but match 'aa' at the end 2566 (?{$count = $c;}) # copy local $c var into $count 2567 /x; 2568 print "'a' count is $count, \$c variable is '$c'\n"; 2569 2570This prints 2571 2572 'a' count is 2, $c variable is 'bob' 2573 2574If we replace the S<C< (?{local $c = $c + 1;})>> with 2575S<C< (?{$c = $c + 1;})>>, the variable changes are I<not> undone 2576during backtracking, and we get 2577 2578 'a' count is 4, $c variable is 'bob' 2579 2580Note that only localized variable changes are undone. Other side 2581effects of code expression execution are permanent. Thus 2582 2583 $x = "aaaa"; 2584 $x =~ /(a(?{print "Yow\n";}))*aa/; 2585 2586produces 2587 2588 Yow 2589 Yow 2590 Yow 2591 Yow 2592 2593The result C<$^R> is automatically localized, so that it will behave 2594properly in the presence of backtracking. 2595 2596This example uses a code expression in a conditional to match a 2597definite article, either 'the' in English or 'der|die|das' in German: 2598 2599 $lang = 'DE'; # use German 2600 ... 2601 $text = "das"; 2602 print "matched\n" 2603 if $text =~ /(?(?{ 2604 $lang eq 'EN'; # is the language English? 2605 }) 2606 the | # if so, then match 'the' 2607 (der|die|das) # else, match 'der|die|das' 2608 ) 2609 /xi; 2610 2611Note that the syntax here is C<(?(?{...})yes-regexp|no-regexp)>, not 2612C<(?((?{...}))yes-regexp|no-regexp)>. In other words, in the case of a 2613code expression, we don't need the extra parentheses around the 2614conditional. 2615 2616If you try to use code expressions where the code text is contained within 2617an interpolated variable, rather than appearing literally in the pattern, 2618Perl may surprise you: 2619 2620 $bar = 5; 2621 $pat = '(?{ 1 })'; 2622 /foo(?{ $bar })bar/; # compiles ok, $bar not interpolated 2623 /foo(?{ 1 })$bar/; # compiles ok, $bar interpolated 2624 /foo${pat}bar/; # compile error! 2625 2626 $pat = qr/(?{ $foo = 1 })/; # precompile code regexp 2627 /foo${pat}bar/; # compiles ok 2628 2629If a regexp has a variable that interpolates a code expression, Perl 2630treats the regexp as an error. If the code expression is precompiled into 2631a variable, however, interpolating is ok. The question is, why is this an 2632error? 2633 2634The reason is that variable interpolation and code expressions 2635together pose a security risk. The combination is dangerous because 2636many programmers who write search engines often take user input and 2637plug it directly into a regexp: 2638 2639 $regexp = <>; # read user-supplied regexp 2640 $chomp $regexp; # get rid of possible newline 2641 $text =~ /$regexp/; # search $text for the $regexp 2642 2643If the C<$regexp> variable contains a code expression, the user could 2644then execute arbitrary Perl code. For instance, some joker could 2645search for S<C<system('rm -rf *');>> to erase your files. In this 2646sense, the combination of interpolation and code expressions I<taints> 2647your regexp. So by default, using both interpolation and code 2648expressions in the same regexp is not allowed. If you're not 2649concerned about malicious users, it is possible to bypass this 2650security check by invoking S<C<use re 'eval'>>: 2651 2652 use re 'eval'; # throw caution out the door 2653 $bar = 5; 2654 $pat = '(?{ 1 })'; 2655 /foo${pat}bar/; # compiles ok 2656 2657Another form of code expression is the I<pattern code expression>. 2658The pattern code expression is like a regular code expression, except 2659that the result of the code evaluation is treated as a regular 2660expression and matched immediately. A simple example is 2661 2662 $length = 5; 2663 $char = 'a'; 2664 $x = 'aaaaabb'; 2665 $x =~ /(??{$char x $length})/x; # matches, there are 5 of 'a' 2666 2667 2668This final example contains both ordinary and pattern code 2669expressions. It detects whether a binary string C<1101010010001...> has a 2670Fibonacci spacing 0,1,1,2,3,5,... of the C<1>'s: 2671 2672 $x = "1101010010001000001"; 2673 $z0 = ''; $z1 = '0'; # initial conditions 2674 print "It is a Fibonacci sequence\n" 2675 if $x =~ /^1 # match an initial '1' 2676 (?: 2677 ((??{ $z0 })) # match some '0' 2678 1 # and then a '1' 2679 (?{ $z0 = $z1; $z1 .= $^N; }) 2680 )+ # repeat as needed 2681 $ # that is all there is 2682 /x; 2683 printf "Largest sequence matched was %d\n", length($z1)-length($z0); 2684 2685Remember that C<$^N> is set to whatever was matched by the last 2686completed capture group. This prints 2687 2688 It is a Fibonacci sequence 2689 Largest sequence matched was 5 2690 2691Ha! Try that with your garden variety regexp package... 2692 2693Note that the variables C<$z0> and C<$z1> are not substituted when the 2694regexp is compiled, as happens for ordinary variables outside a code 2695expression. Rather, the whole code block is parsed as perl code at the 2696same time as perl is compiling the code containing the literal regexp 2697pattern. 2698 2699The regexp without the C<//x> modifier is 2700 2701 /^1(?:((??{ $z0 }))1(?{ $z0 = $z1; $z1 .= $^N; }))+$/ 2702 2703which shows that spaces are still possible in the code parts. Nevertheless, 2704when working with code and conditional expressions, the extended form of 2705regexps is almost necessary in creating and debugging regexps. 2706 2707 2708=head2 Backtracking control verbs 2709 2710Perl 5.10 introduced a number of control verbs intended to provide 2711detailed control over the backtracking process, by directly influencing 2712the regexp engine and by providing monitoring techniques. As all 2713the features in this group are experimental and subject to change or 2714removal in a future version of Perl, the interested reader is 2715referred to L<perlre/"Special Backtracking Control Verbs"> for a 2716detailed description. 2717 2718Below is just one example, illustrating the control verb C<(*FAIL)>, 2719which may be abbreviated as C<(*F)>. If this is inserted in a regexp 2720it will cause it to fail, just as it would at some 2721mismatch between the pattern and the string. Processing 2722of the regexp continues as it would after any "normal" 2723failure, so that, for instance, the next position in the string or another 2724alternative will be tried. As failing to match doesn't preserve capture 2725groups or produce results, it may be necessary to use this in 2726combination with embedded code. 2727 2728 %count = (); 2729 "supercalifragilisticexpialidocious" =~ 2730 /([aeiou])(?{ $count{$1}++; })(*FAIL)/i; 2731 printf "%3d '%s'\n", $count{$_}, $_ for (sort keys %count); 2732 2733The pattern begins with a class matching a subset of letters. Whenever 2734this matches, a statement like C<$count{'a'}++;> is executed, incrementing 2735the letter's counter. Then C<(*FAIL)> does what it says, and 2736the regexp engine proceeds according to the book: as long as the end of 2737the string hasn't been reached, the position is advanced before looking 2738for another vowel. Thus, match or no match makes no difference, and the 2739regexp engine proceeds until the entire string has been inspected. 2740(It's remarkable that an alternative solution using something like 2741 2742 $count{lc($_)}++ for split('', "supercalifragilisticexpialidocious"); 2743 printf "%3d '%s'\n", $count2{$_}, $_ for ( qw{ a e i o u } ); 2744 2745is considerably slower.) 2746 2747 2748=head2 Pragmas and debugging 2749 2750Speaking of debugging, there are several pragmas available to control 2751and debug regexps in Perl. We have already encountered one pragma in 2752the previous section, S<C<use re 'eval';>>, that allows variable 2753interpolation and code expressions to coexist in a regexp. The other 2754pragmas are 2755 2756 use re 'taint'; 2757 $tainted = <>; 2758 @parts = ($tainted =~ /(\w+)\s+(\w+)/; # @parts is now tainted 2759 2760The C<taint> pragma causes any substrings from a match with a tainted 2761variable to be tainted as well. This is not normally the case, as 2762regexps are often used to extract the safe bits from a tainted 2763variable. Use C<taint> when you are not extracting safe bits, but are 2764performing some other processing. Both C<taint> and C<eval> pragmas 2765are lexically scoped, which means they are in effect only until 2766the end of the block enclosing the pragmas. 2767 2768 use re '/m'; # or any other flags 2769 $multiline_string =~ /^foo/; # /m is implied 2770 2771The C<re '/flags'> pragma (introduced in Perl 27725.14) turns on the given regular expression flags 2773until the end of the lexical scope. See 2774L<re/"'E<sol>flags' mode"> for more 2775detail. 2776 2777 use re 'debug'; 2778 /^(.*)$/s; # output debugging info 2779 2780 use re 'debugcolor'; 2781 /^(.*)$/s; # output debugging info in living color 2782 2783The global C<debug> and C<debugcolor> pragmas allow one to get 2784detailed debugging info about regexp compilation and 2785execution. C<debugcolor> is the same as debug, except the debugging 2786information is displayed in color on terminals that can display 2787termcap color sequences. Here is example output: 2788 2789 % perl -e 'use re "debug"; "abc" =~ /a*b+c/;' 2790 Compiling REx 'a*b+c' 2791 size 9 first at 1 2792 1: STAR(4) 2793 2: EXACT <a>(0) 2794 4: PLUS(7) 2795 5: EXACT <b>(0) 2796 7: EXACT <c>(9) 2797 9: END(0) 2798 floating 'bc' at 0..2147483647 (checking floating) minlen 2 2799 Guessing start of match, REx 'a*b+c' against 'abc'... 2800 Found floating substr 'bc' at offset 1... 2801 Guessed: match at offset 0 2802 Matching REx 'a*b+c' against 'abc' 2803 Setting an EVAL scope, savestack=3 2804 0 <> <abc> | 1: STAR 2805 EXACT <a> can match 1 times out of 32767... 2806 Setting an EVAL scope, savestack=3 2807 1 <a> <bc> | 4: PLUS 2808 EXACT <b> can match 1 times out of 32767... 2809 Setting an EVAL scope, savestack=3 2810 2 <ab> <c> | 7: EXACT <c> 2811 3 <abc> <> | 9: END 2812 Match successful! 2813 Freeing REx: 'a*b+c' 2814 2815If you have gotten this far into the tutorial, you can probably guess 2816what the different parts of the debugging output tell you. The first 2817part 2818 2819 Compiling REx 'a*b+c' 2820 size 9 first at 1 2821 1: STAR(4) 2822 2: EXACT <a>(0) 2823 4: PLUS(7) 2824 5: EXACT <b>(0) 2825 7: EXACT <c>(9) 2826 9: END(0) 2827 2828describes the compilation stage. C<STAR(4)> means that there is a 2829starred object, in this case C<'a'>, and if it matches, goto line 4, 2830i.e., C<PLUS(7)>. The middle lines describe some heuristics and 2831optimizations performed before a match: 2832 2833 floating 'bc' at 0..2147483647 (checking floating) minlen 2 2834 Guessing start of match, REx 'a*b+c' against 'abc'... 2835 Found floating substr 'bc' at offset 1... 2836 Guessed: match at offset 0 2837 2838Then the match is executed and the remaining lines describe the 2839process: 2840 2841 Matching REx 'a*b+c' against 'abc' 2842 Setting an EVAL scope, savestack=3 2843 0 <> <abc> | 1: STAR 2844 EXACT <a> can match 1 times out of 32767... 2845 Setting an EVAL scope, savestack=3 2846 1 <a> <bc> | 4: PLUS 2847 EXACT <b> can match 1 times out of 32767... 2848 Setting an EVAL scope, savestack=3 2849 2 <ab> <c> | 7: EXACT <c> 2850 3 <abc> <> | 9: END 2851 Match successful! 2852 Freeing REx: 'a*b+c' 2853 2854Each step is of the form S<C<< n <x> <y> >>>, with C<< <x> >> the 2855part of the string matched and C<< <y> >> the part not yet 2856matched. The S<C<< | 1: STAR >>> says that Perl is at line number 1 2857in the compilation list above. See 2858L<perldebguts/"Debugging Regular Expressions"> for much more detail. 2859 2860An alternative method of debugging regexps is to embed C<print> 2861statements within the regexp. This provides a blow-by-blow account of 2862the backtracking in an alternation: 2863 2864 "that this" =~ m@(?{print "Start at position ", pos, "\n";}) 2865 t(?{print "t1\n";}) 2866 h(?{print "h1\n";}) 2867 i(?{print "i1\n";}) 2868 s(?{print "s1\n";}) 2869 | 2870 t(?{print "t2\n";}) 2871 h(?{print "h2\n";}) 2872 a(?{print "a2\n";}) 2873 t(?{print "t2\n";}) 2874 (?{print "Done at position ", pos, "\n";}) 2875 @x; 2876 2877prints 2878 2879 Start at position 0 2880 t1 2881 h1 2882 t2 2883 h2 2884 a2 2885 t2 2886 Done at position 4 2887 2888=head1 BUGS 2889 2890Code expressions, conditional expressions, and independent expressions 2891are I<experimental>. Don't use them in production code. Yet. 2892 2893=head1 SEE ALSO 2894 2895This is just a tutorial. For the full story on Perl regular 2896expressions, see the L<perlre> regular expressions reference page. 2897 2898For more information on the matching C<m//> and substitution C<s///> 2899operators, see L<perlop/"Regexp Quote-Like Operators">. For 2900information on the C<split> operation, see L<perlfunc/split>. 2901 2902For an excellent all-around resource on the care and feeding of 2903regular expressions, see the book I<Mastering Regular Expressions> by 2904Jeffrey Friedl (published by O'Reilly, ISBN 1556592-257-3). 2905 2906=head1 AUTHOR AND COPYRIGHT 2907 2908Copyright (c) 2000 Mark Kvale 2909All rights reserved. 2910 2911This document may be distributed under the same terms as Perl itself. 2912 2913=head2 Acknowledgments 2914 2915The inspiration for the stop codon DNA example came from the ZIP 2916code example in chapter 7 of I<Mastering Regular Expressions>. 2917 2918The author would like to thank Jeff Pinyan, Andrew Johnson, Peter 2919Haworth, Ronald J Kimball, and Joe Smith for all their helpful 2920comments. 2921 2922=cut 2923 2924