1=head1 NAME 2 3perlretut - Perl regular expressions tutorial 4 5=head1 DESCRIPTION 6 7This page provides a basic tutorial on understanding, creating and 8using regular expressions in Perl. It serves as a complement to the 9reference page on regular expressions L<perlre>. Regular expressions 10are an integral part of the C<m//>, C<s///>, C<qr//> and C<split> 11operators and so this tutorial also overlaps with 12L<perlop/"Regexp Quote-Like Operators"> and L<perlfunc/split>. 13 14Perl is widely renowned for excellence in text processing, and regular 15expressions are one of the big factors behind this fame. Perl regular 16expressions display an efficiency and flexibility unknown in most 17other computer languages. Mastering even the basics of regular 18expressions will allow you to manipulate text with surprising ease. 19 20What is a regular expression? A regular expression is simply a string 21that describes a pattern. Patterns are in common use these days; 22examples are the patterns typed into a search engine to find web pages 23and the patterns used to list files in a directory, e.g., C<ls *.txt> 24or C<dir *.*>. In Perl, the patterns described by regular expressions 25are used to search strings, extract desired parts of strings, and to 26do search and replace operations. 27 28Regular expressions have the undeserved reputation of being abstract 29and difficult to understand. Regular expressions are constructed using 30simple concepts like conditionals and loops and are no more difficult 31to understand than the corresponding C<if> conditionals and C<while> 32loops in the Perl language itself. In fact, the main challenge in 33learning regular expressions is just getting used to the terse 34notation used to express these concepts. 35 36This tutorial flattens the learning curve by discussing regular 37expression concepts, along with their notation, one at a time and with 38many examples. The first part of the tutorial will progress from the 39simplest word searches to the basic regular expression concepts. If 40you master the first part, you will have all the tools needed to solve 41about 98% of your needs. The second part of the tutorial is for those 42comfortable with the basics and hungry for more power tools. It 43discusses the more advanced regular expression operators and 44introduces the latest cutting edge innovations in 5.6.0. 45 46A note: to save time, 'regular expression' is often abbreviated as 47regexp or regex. Regexp is a more natural abbreviation than regex, but 48is harder to pronounce. The Perl pod documentation is evenly split on 49regexp vs regex; in Perl, there is more than one way to abbreviate it. 50We'll use regexp in this tutorial. 51 52=head1 Part 1: The basics 53 54=head2 Simple word matching 55 56The simplest regexp is simply a word, or more generally, a string of 57characters. A regexp consisting of a word matches any string that 58contains that word: 59 60 "Hello World" =~ /World/; # matches 61 62What is this perl statement all about? C<"Hello World"> is a simple 63double quoted string. C<World> is the regular expression and the 64C<//> enclosing C</World/> tells perl to search a string for a match. 65The operator C<=~> associates the string with the regexp match and 66produces a true value if the regexp matched, or false if the regexp 67did not match. In our case, C<World> matches the second word in 68C<"Hello World">, so the expression is true. Expressions like this 69are useful in conditionals: 70 71 if ("Hello World" =~ /World/) { 72 print "It matches\n"; 73 } 74 else { 75 print "It doesn't match\n"; 76 } 77 78There are useful variations on this theme. The sense of the match can 79be reversed by using C<!~> operator: 80 81 if ("Hello World" !~ /World/) { 82 print "It doesn't match\n"; 83 } 84 else { 85 print "It matches\n"; 86 } 87 88The literal string in the regexp can be replaced by a variable: 89 90 $greeting = "World"; 91 if ("Hello World" =~ /$greeting/) { 92 print "It matches\n"; 93 } 94 else { 95 print "It doesn't match\n"; 96 } 97 98If you're matching against the special default variable C<$_>, the 99C<$_ =~> part can be omitted: 100 101 $_ = "Hello World"; 102 if (/World/) { 103 print "It matches\n"; 104 } 105 else { 106 print "It doesn't match\n"; 107 } 108 109And finally, the C<//> default delimiters for a match can be changed 110to arbitrary delimiters by putting an C<'m'> out front: 111 112 "Hello World" =~ m!World!; # matches, delimited by '!' 113 "Hello World" =~ m{World}; # matches, note the matching '{}' 114 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', 115 # '/' becomes an ordinary char 116 117C</World/>, C<m!World!>, and C<m{World}> all represent the 118same thing. When, e.g., C<""> is used as a delimiter, the forward 119slash C<'/'> becomes an ordinary character and can be used in a regexp 120without trouble. 121 122Let's consider how different regexps would match C<"Hello World">: 123 124 "Hello World" =~ /world/; # doesn't match 125 "Hello World" =~ /o W/; # matches 126 "Hello World" =~ /oW/; # doesn't match 127 "Hello World" =~ /World /; # doesn't match 128 129The first regexp C<world> doesn't match because regexps are 130case-sensitive. The second regexp matches because the substring 131S<C<'o W'> > occurs in the string S<C<"Hello World"> >. The space 132character ' ' is treated like any other character in a regexp and is 133needed to match in this case. The lack of a space character is the 134reason the third regexp C<'oW'> doesn't match. The fourth regexp 135C<'World '> doesn't match because there is a space at the end of the 136regexp, but not at the end of the string. The lesson here is that 137regexps must match a part of the string I<exactly> in order for the 138statement to be true. 139 140If a regexp matches in more than one place in the string, perl will 141always match at the earliest possible point in the string: 142 143 "Hello World" =~ /o/; # matches 'o' in 'Hello' 144 "That hat is red" =~ /hat/; # matches 'hat' in 'That' 145 146With respect to character matching, there are a few more points you 147need to know about. First of all, not all characters can be used 'as 148is' in a match. Some characters, called B<metacharacters>, are reserved 149for use in regexp notation. The metacharacters are 150 151 {}[]()^$.|*+?\ 152 153The significance of each of these will be explained 154in the rest of the tutorial, but for now, it is important only to know 155that a metacharacter can be matched by putting a backslash before it: 156 157 "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter 158 "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + 159 "The interval is [0,1)." =~ /[0,1)./ # is a syntax error! 160 "The interval is [0,1)." =~ /\[0,1\)\./ # matches 161 "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches 162 163In the last regexp, the forward slash C<'/'> is also backslashed, 164because it is used to delimit the regexp. This can lead to LTS 165(leaning toothpick syndrome), however, and it is often more readable 166to change delimiters. 167 168 "/usr/bin/perl" =~ m!/usr/bin/perl!; # easier to read 169 170The backslash character C<'\'> is a metacharacter itself and needs to 171be backslashed: 172 173 'C:\WIN32' =~ /C:\\WIN/; # matches 174 175In addition to the metacharacters, there are some ASCII characters 176which don't have printable character equivalents and are instead 177represented by B<escape sequences>. Common examples are C<\t> for a 178tab, C<\n> for a newline, C<\r> for a carriage return and C<\a> for a 179bell. If your string is better thought of as a sequence of arbitrary 180bytes, the octal escape sequence, e.g., C<\033>, or hexadecimal escape 181sequence, e.g., C<\x1B> may be a more natural representation for your 182bytes. Here are some examples of escapes: 183 184 "1000\t2000" =~ m(0\t2) # matches 185 "1000\n2000" =~ /0\n20/ # matches 186 "1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000" 187 "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat 188 189If you've been around Perl a while, all this talk of escape sequences 190may seem familiar. Similar escape sequences are used in double-quoted 191strings and in fact the regexps in Perl are mostly treated as 192double-quoted strings. This means that variables can be used in 193regexps as well. Just like double-quoted strings, the values of the 194variables in the regexp will be substituted in before the regexp is 195evaluated for matching purposes. So we have: 196 197 $foo = 'house'; 198 'housecat' =~ /$foo/; # matches 199 'cathouse' =~ /cat$foo/; # matches 200 'housecat' =~ /${foo}cat/; # matches 201 202So far, so good. With the knowledge above you can already perform 203searches with just about any literal string regexp you can dream up. 204Here is a I<very simple> emulation of the Unix grep program: 205 206 % cat > simple_grep 207 #!/usr/bin/perl 208 $regexp = shift; 209 while (<>) { 210 print if /$regexp/; 211 } 212 ^D 213 214 % chmod +x simple_grep 215 216 % simple_grep abba /usr/dict/words 217 Babbage 218 cabbage 219 cabbages 220 sabbath 221 Sabbathize 222 Sabbathizes 223 sabbatical 224 scabbard 225 scabbards 226 227This program is easy to understand. C<#!/usr/bin/perl> is the standard 228way to invoke a perl program from the shell. 229S<C<$regexp = shift;> > saves the first command line argument as the 230regexp to be used, leaving the rest of the command line arguments to 231be treated as files. S<C<< while (<>) >> > loops over all the lines in 232all the files. For each line, S<C<print if /$regexp/;> > prints the 233line if the regexp matches the line. In this line, both C<print> and 234C</$regexp/> use the default variable C<$_> implicitly. 235 236With all of the regexps above, if the regexp matched anywhere in the 237string, it was considered a match. Sometimes, however, we'd like to 238specify I<where> in the string the regexp should try to match. To do 239this, we would use the B<anchor> metacharacters C<^> and C<$>. The 240anchor C<^> means match at the beginning of the string and the anchor 241C<$> means match at the end of the string, or before a newline at the 242end of the string. Here is how they are used: 243 244 "housekeeper" =~ /keeper/; # matches 245 "housekeeper" =~ /^keeper/; # doesn't match 246 "housekeeper" =~ /keeper$/; # matches 247 "housekeeper\n" =~ /keeper$/; # matches 248 249The second regexp doesn't match because C<^> constrains C<keeper> to 250match only at the beginning of the string, but C<"housekeeper"> has 251keeper starting in the middle. The third regexp does match, since the 252C<$> constrains C<keeper> to match only at the end of the string. 253 254When both C<^> and C<$> are used at the same time, the regexp has to 255match both the beginning and the end of the string, i.e., the regexp 256matches the whole string. Consider 257 258 "keeper" =~ /^keep$/; # doesn't match 259 "keeper" =~ /^keeper$/; # matches 260 "" =~ /^$/; # ^$ matches an empty string 261 262The first regexp doesn't match because the string has more to it than 263C<keep>. Since the second regexp is exactly the string, it 264matches. Using both C<^> and C<$> in a regexp forces the complete 265string to match, so it gives you complete control over which strings 266match and which don't. Suppose you are looking for a fellow named 267bert, off in a string by himself: 268 269 "dogbert" =~ /bert/; # matches, but not what you want 270 271 "dilbert" =~ /^bert/; # doesn't match, but .. 272 "bertram" =~ /^bert/; # matches, so still not good enough 273 274 "bertram" =~ /^bert$/; # doesn't match, good 275 "dilbert" =~ /^bert$/; # doesn't match, good 276 "bert" =~ /^bert$/; # matches, perfect 277 278Of course, in the case of a literal string, one could just as easily 279use the string equivalence S<C<$string eq 'bert'> > and it would be 280more efficient. The C<^...$> regexp really becomes useful when we 281add in the more powerful regexp tools below. 282 283=head2 Using character classes 284 285Although one can already do quite a lot with the literal string 286regexps above, we've only scratched the surface of regular expression 287technology. In this and subsequent sections we will introduce regexp 288concepts (and associated metacharacter notations) that will allow a 289regexp to not just represent a single character sequence, but a I<whole 290class> of them. 291 292One such concept is that of a B<character class>. A character class 293allows a set of possible characters, rather than just a single 294character, to match at a particular point in a regexp. Character 295classes are denoted by brackets C<[...]>, with the set of characters 296to be possibly matched inside. Here are some examples: 297 298 /cat/; # matches 'cat' 299 /[bcr]at/; # matches 'bat, 'cat', or 'rat' 300 /item[0123456789]/; # matches 'item0' or ... or 'item9' 301 "abc" =~ /[cab]/; # matches 'a' 302 303In the last statement, even though C<'c'> is the first character in 304the class, C<'a'> matches because the first character position in the 305string is the earliest point at which the regexp can match. 306 307 /[yY][eE][sS]/; # match 'yes' in a case-insensitive way 308 # 'yes', 'Yes', 'YES', etc. 309 310This regexp displays a common task: perform a case-insensitive 311match. Perl provides away of avoiding all those brackets by simply 312appending an C<'i'> to the end of the match. Then C</[yY][eE][sS]/;> 313can be rewritten as C</yes/i;>. The C<'i'> stands for 314case-insensitive and is an example of a B<modifier> of the matching 315operation. We will meet other modifiers later in the tutorial. 316 317We saw in the section above that there were ordinary characters, which 318represented themselves, and special characters, which needed a 319backslash C<\> to represent themselves. The same is true in a 320character class, but the sets of ordinary and special characters 321inside a character class are different than those outside a character 322class. The special characters for a character class are C<-]\^$>. C<]> 323is special because it denotes the end of a character class. C<$> is 324special because it denotes a scalar variable. C<\> is special because 325it is used in escape sequences, just like above. Here is how the 326special characters C<]$\> are handled: 327 328 /[\]c]def/; # matches ']def' or 'cdef' 329 $x = 'bcr'; 330 /[$x]at/; # matches 'bat', 'cat', or 'rat' 331 /[\$x]at/; # matches '$at' or 'xat' 332 /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' 333 334The last two are a little tricky. in C<[\$x]>, the backslash protects 335the dollar sign, so the character class has two members C<$> and C<x>. 336In C<[\\$x]>, the backslash is protected, so C<$x> is treated as a 337variable and substituted in double quote fashion. 338 339The special character C<'-'> acts as a range operator within character 340classes, so that a contiguous set of characters can be written as a 341range. With ranges, the unwieldy C<[0123456789]> and C<[abc...xyz]> 342become the svelte C<[0-9]> and C<[a-z]>. Some examples are 343 344 /item[0-9]/; # matches 'item0' or ... or 'item9' 345 /[0-9bx-z]aa/; # matches '0aa', ..., '9aa', 346 # 'baa', 'xaa', 'yaa', or 'zaa' 347 /[0-9a-fA-F]/; # matches a hexadecimal digit 348 /[0-9a-zA-Z_]/; # matches a "word" character, 349 # like those in a perl variable name 350 351If C<'-'> is the first or last character in a character class, it is 352treated as an ordinary character; C<[-ab]>, C<[ab-]> and C<[a\-b]> are 353all equivalent. 354 355The special character C<^> in the first position of a character class 356denotes a B<negated character class>, which matches any character but 357those in the brackets. Both C<[...]> and C<[^...]> must match a 358character, or the match fails. Then 359 360 /[^a]at/; # doesn't match 'aat' or 'at', but matches 361 # all other 'bat', 'cat, '0at', '%at', etc. 362 /[^0-9]/; # matches a non-numeric character 363 /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary 364 365Now, even C<[0-9]> can be a bother the write multiple times, so in the 366interest of saving keystrokes and making regexps more readable, Perl 367has several abbreviations for common character classes: 368 369=over 4 370 371=item * 372 373\d is a digit and represents [0-9] 374 375=item * 376 377\s is a whitespace character and represents [\ \t\r\n\f] 378 379=item * 380 381\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_] 382 383=item * 384 385\D is a negated \d; it represents any character but a digit [^0-9] 386 387=item * 388 389\S is a negated \s; it represents any non-whitespace character [^\s] 390 391=item * 392 393\W is a negated \w; it represents any non-word character [^\w] 394 395=item * 396 397The period '.' matches any character but "\n" 398 399=back 400 401The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside 402of character classes. Here are some in use: 403 404 /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format 405 /[\d\s]/; # matches any digit or whitespace character 406 /\w\W\w/; # matches a word char, followed by a 407 # non-word char, followed by a word char 408 /..rt/; # matches any two chars, followed by 'rt' 409 /end\./; # matches 'end.' 410 /end[.]/; # same thing, matches 'end.' 411 412Because a period is a metacharacter, it needs to be escaped to match 413as an ordinary period. Because, for example, C<\d> and C<\w> are sets 414of characters, it is incorrect to think of C<[^\d\w]> as C<[\D\W]>; in 415fact C<[^\d\w]> is the same as C<[^\w]>, which is the same as 416C<[\W]>. Think DeMorgan's laws. 417 418An anchor useful in basic regexps is the S<B<word anchor> > 419C<\b>. This matches a boundary between a word character and a non-word 420character C<\w\W> or C<\W\w>: 421 422 $x = "Housecat catenates house and cat"; 423 $x =~ /cat/; # matches cat in 'housecat' 424 $x =~ /\bcat/; # matches cat in 'catenates' 425 $x =~ /cat\b/; # matches cat in 'housecat' 426 $x =~ /\bcat\b/; # matches 'cat' at end of string 427 428Note in the last example, the end of the string is considered a word 429boundary. 430 431You might wonder why C<'.'> matches everything but C<"\n"> - why not 432every character? The reason is that often one is matching against 433lines and would like to ignore the newline characters. For instance, 434while the string C<"\n"> represents one line, we would like to think 435of as empty. Then 436 437 "" =~ /^$/; # matches 438 "\n" =~ /^$/; # matches, "\n" is ignored 439 440 "" =~ /./; # doesn't match; it needs a char 441 "" =~ /^.$/; # doesn't match; it needs a char 442 "\n" =~ /^.$/; # doesn't match; it needs a char other than "\n" 443 "a" =~ /^.$/; # matches 444 "a\n" =~ /^.$/; # matches, ignores the "\n" 445 446This behavior is convenient, because we usually want to ignore 447newlines when we count and match characters in a line. Sometimes, 448however, we want to keep track of newlines. We might even want C<^> 449and C<$> to anchor at the beginning and end of lines within the 450string, rather than just the beginning and end of the string. Perl 451allows us to choose between ignoring and paying attention to newlines 452by using the C<//s> and C<//m> modifiers. C<//s> and C<//m> stand for 453single line and multi-line and they determine whether a string is to 454be treated as one continuous string, or as a set of lines. The two 455modifiers affect two aspects of how the regexp is interpreted: 1) how 456the C<'.'> character class is defined, and 2) where the anchors C<^> 457and C<$> are able to match. Here are the four possible combinations: 458 459=over 4 460 461=item * 462 463no modifiers (//): Default behavior. C<'.'> matches any character 464except C<"\n">. C<^> matches only at the beginning of the string and 465C<$> matches only at the end or before a newline at the end. 466 467=item * 468 469s modifier (//s): Treat string as a single long line. C<'.'> matches 470any character, even C<"\n">. C<^> matches only at the beginning of 471the string and C<$> matches only at the end or before a newline at the 472end. 473 474=item * 475 476m modifier (//m): Treat string as a set of multiple lines. C<'.'> 477matches any character except C<"\n">. C<^> and C<$> are able to match 478at the start or end of I<any> line within the string. 479 480=item * 481 482both s and m modifiers (//sm): Treat string as a single long line, but 483detect multiple lines. C<'.'> matches any character, even 484C<"\n">. C<^> and C<$>, however, are able to match at the start or end 485of I<any> line within the string. 486 487=back 488 489Here are examples of C<//s> and C<//m> in action: 490 491 $x = "There once was a girl\nWho programmed in Perl\n"; 492 493 $x =~ /^Who/; # doesn't match, "Who" not at start of string 494 $x =~ /^Who/s; # doesn't match, "Who" not at start of string 495 $x =~ /^Who/m; # matches, "Who" at start of second line 496 $x =~ /^Who/sm; # matches, "Who" at start of second line 497 498 $x =~ /girl.Who/; # doesn't match, "." doesn't match "\n" 499 $x =~ /girl.Who/s; # matches, "." matches "\n" 500 $x =~ /girl.Who/m; # doesn't match, "." doesn't match "\n" 501 $x =~ /girl.Who/sm; # matches, "." matches "\n" 502 503Most of the time, the default behavior is what is want, but C<//s> and 504C<//m> are occasionally very useful. If C<//m> is being used, the start 505of the string can still be matched with C<\A> and the end of string 506can still be matched with the anchors C<\Z> (matches both the end and 507the newline before, like C<$>), and C<\z> (matches only the end): 508 509 $x =~ /^Who/m; # matches, "Who" at start of second line 510 $x =~ /\AWho/m; # doesn't match, "Who" is not at start of string 511 512 $x =~ /girl$/m; # matches, "girl" at end of first line 513 $x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string 514 515 $x =~ /Perl\Z/m; # matches, "Perl" is at newline before end 516 $x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of string 517 518We now know how to create choices among classes of characters in a 519regexp. What about choices among words or character strings? Such 520choices are described in the next section. 521 522=head2 Matching this or that 523 524Sometimes we would like to our regexp to be able to match different 525possible words or character strings. This is accomplished by using 526the B<alternation> metacharacter C<|>. To match C<dog> or C<cat>, we 527form the regexp C<dog|cat>. As before, perl will try to match the 528regexp at the earliest possible point in the string. At each 529character position, perl will first try to match the first 530alternative, C<dog>. If C<dog> doesn't match, perl will then try the 531next alternative, C<cat>. If C<cat> doesn't match either, then the 532match fails and perl moves to the next position in the string. Some 533examples: 534 535 "cats and dogs" =~ /cat|dog|bird/; # matches "cat" 536 "cats and dogs" =~ /dog|cat|bird/; # matches "cat" 537 538Even though C<dog> is the first alternative in the second regexp, 539C<cat> is able to match earlier in the string. 540 541 "cats" =~ /c|ca|cat|cats/; # matches "c" 542 "cats" =~ /cats|cat|ca|c/; # matches "cats" 543 544Here, all the alternatives match at the first string position, so the 545first alternative is the one that matches. If some of the 546alternatives are truncations of the others, put the longest ones first 547to give them a chance to match. 548 549 "cab" =~ /a|b|c/ # matches "c" 550 # /a|b|c/ == /[abc]/ 551 552The last example points out that character classes are like 553alternations of characters. At a given character position, the first 554alternative that allows the regexp match to succeed will be the one 555that matches. 556 557=head2 Grouping things and hierarchical matching 558 559Alternation allows a regexp to choose among alternatives, but by 560itself it unsatisfying. The reason is that each alternative is a whole 561regexp, but sometime we want alternatives for just part of a 562regexp. For instance, suppose we want to search for housecats or 563housekeepers. The regexp C<housecat|housekeeper> fits the bill, but is 564inefficient because we had to type C<house> twice. It would be nice to 565have parts of the regexp be constant, like C<house>, and some 566parts have alternatives, like C<cat|keeper>. 567 568The B<grouping> metacharacters C<()> solve this problem. Grouping 569allows parts of a regexp to be treated as a single unit. Parts of a 570regexp are grouped by enclosing them in parentheses. Thus we could solve 571the C<housecat|housekeeper> by forming the regexp as 572C<house(cat|keeper)>. The regexp C<house(cat|keeper)> means match 573C<house> followed by either C<cat> or C<keeper>. Some more examples 574are 575 576 /(a|b)b/; # matches 'ab' or 'bb' 577 /(ac|b)b/; # matches 'acb' or 'bb' 578 /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere 579 /(a|[bc])d/; # matches 'ad', 'bd', or 'cd' 580 581 /house(cat|)/; # matches either 'housecat' or 'house' 582 /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or 583 # 'house'. Note groups can be nested. 584 585 /(19|20|)\d\d/; # match years 19xx, 20xx, or the Y2K problem, xx 586 "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', 587 # because '20\d\d' can't match 588 589Alternations behave the same way in groups as out of them: at a given 590string position, the leftmost alternative that allows the regexp to 591match is taken. So in the last example at the first string position, 592C<"20"> matches the second alternative, but there is nothing left over 593to match the next two digits C<\d\d>. So perl moves on to the next 594alternative, which is the null alternative and that works, since 595C<"20"> is two digits. 596 597The process of trying one alternative, seeing if it matches, and 598moving on to the next alternative if it doesn't, is called 599B<backtracking>. The term 'backtracking' comes from the idea that 600matching a regexp is like a walk in the woods. Successfully matching 601a regexp is like arriving at a destination. There are many possible 602trailheads, one for each string position, and each one is tried in 603order, left to right. From each trailhead there may be many paths, 604some of which get you there, and some which are dead ends. When you 605walk along a trail and hit a dead end, you have to backtrack along the 606trail to an earlier point to try another trail. If you hit your 607destination, you stop immediately and forget about trying all the 608other trails. You are persistent, and only if you have tried all the 609trails from all the trailheads and not arrived at your destination, do 610you declare failure. To be concrete, here is a step-by-step analysis 611of what perl does when it tries to match the regexp 612 613 "abcde" =~ /(abd|abc)(df|d|de)/; 614 615=over 4 616 617=item 0 618 619Start with the first letter in the string 'a'. 620 621=item 1 622 623Try the first alternative in the first group 'abd'. 624 625=item 2 626 627Match 'a' followed by 'b'. So far so good. 628 629=item 3 630 631'd' in the regexp doesn't match 'c' in the string - a dead 632end. So backtrack two characters and pick the second alternative in 633the first group 'abc'. 634 635=item 4 636 637Match 'a' followed by 'b' followed by 'c'. We are on a roll 638and have satisfied the first group. Set $1 to 'abc'. 639 640=item 5 641 642Move on to the second group and pick the first alternative 643'df'. 644 645=item 6 646 647Match the 'd'. 648 649=item 7 650 651'f' in the regexp doesn't match 'e' in the string, so a dead 652end. Backtrack one character and pick the second alternative in the 653second group 'd'. 654 655=item 8 656 657'd' matches. The second grouping is satisfied, so set $2 to 658'd'. 659 660=item 9 661 662We are at the end of the regexp, so we are done! We have 663matched 'abcd' out of the string "abcde". 664 665=back 666 667There are a couple of things to note about this analysis. First, the 668third alternative in the second group 'de' also allows a match, but we 669stopped before we got to it - at a given character position, leftmost 670wins. Second, we were able to get a match at the first character 671position of the string 'a'. If there were no matches at the first 672position, perl would move to the second character position 'b' and 673attempt the match all over again. Only when all possible paths at all 674possible character positions have been exhausted does perl give 675up and declare S<C<$string =~ /(abd|abc)(df|d|de)/;> > to be false. 676 677Even with all this work, regexp matching happens remarkably fast. To 678speed things up, during compilation stage, perl compiles the regexp 679into a compact sequence of opcodes that can often fit inside a 680processor cache. When the code is executed, these opcodes can then run 681at full throttle and search very quickly. 682 683=head2 Extracting matches 684 685The grouping metacharacters C<()> also serve another completely 686different function: they allow the extraction of the parts of a string 687that matched. This is very useful to find out what matched and for 688text processing in general. For each grouping, the part that matched 689inside goes into the special variables C<$1>, C<$2>, etc. They can be 690used just as ordinary variables: 691 692 # extract hours, minutes, seconds 693 if ($time =~ /(\d\d):(\d\d):(\d\d)/) { # match hh:mm:ss format 694 $hours = $1; 695 $minutes = $2; 696 $seconds = $3; 697 } 698 699Now, we know that in scalar context, 700S<C<$time =~ /(\d\d):(\d\d):(\d\d)/> > returns a true or false 701value. In list context, however, it returns the list of matched values 702C<($1,$2,$3)>. So we could write the code more compactly as 703 704 # extract hours, minutes, seconds 705 ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); 706 707If the groupings in a regexp are nested, C<$1> gets the group with the 708leftmost opening parenthesis, C<$2> the next opening parenthesis, 709etc. For example, here is a complex regexp and the matching variables 710indicated below it: 711 712 /(ab(cd|ef)((gi)|j))/; 713 1 2 34 714 715so that if the regexp matched, e.g., C<$2> would contain 'cd' or 'ef'. For 716convenience, perl sets C<$+> to the string held by the highest numbered 717C<$1>, C<$2>, ... that got assigned (and, somewhat related, C<$^N> to the 718value of the C<$1>, C<$2>, ... most-recently assigned; i.e. the C<$1>, 719C<$2>, ... associated with the rightmost closing parenthesis used in the 720match). 721 722Closely associated with the matching variables C<$1>, C<$2>, ... are 723the B<backreferences> C<\1>, C<\2>, ... . Backreferences are simply 724matching variables that can be used I<inside> a regexp. This is a 725really nice feature - what matches later in a regexp can depend on 726what matched earlier in the regexp. Suppose we wanted to look 727for doubled words in text, like 'the the'. The following regexp finds 728all 3-letter doubles with a space in between: 729 730 /(\w\w\w)\s\1/; 731 732The grouping assigns a value to \1, so that the same 3 letter sequence 733is used for both parts. Here are some words with repeated parts: 734 735 % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\1$' /usr/dict/words 736 beriberi 737 booboo 738 coco 739 mama 740 murmur 741 papa 742 743The regexp has a single grouping which considers 4-letter 744combinations, then 3-letter combinations, etc. and uses C<\1> to look for 745a repeat. Although C<$1> and C<\1> represent the same thing, care should be 746taken to use matched variables C<$1>, C<$2>, ... only outside a regexp 747and backreferences C<\1>, C<\2>, ... only inside a regexp; not doing 748so may lead to surprising and/or undefined results. 749 750In addition to what was matched, Perl 5.6.0 also provides the 751positions of what was matched with the C<@-> and C<@+> 752arrays. C<$-[0]> is the position of the start of the entire match and 753C<$+[0]> is the position of the end. Similarly, C<$-[n]> is the 754position of the start of the C<$n> match and C<$+[n]> is the position 755of the end. If C<$n> is undefined, so are C<$-[n]> and C<$+[n]>. Then 756this code 757 758 $x = "Mmm...donut, thought Homer"; 759 $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches 760 foreach $expr (1..$#-) { 761 print "Match $expr: '${$expr}' at position ($-[$expr],$+[$expr])\n"; 762 } 763 764prints 765 766 Match 1: 'Mmm' at position (0,3) 767 Match 2: 'donut' at position (6,11) 768 769Even if there are no groupings in a regexp, it is still possible to 770find out what exactly matched in a string. If you use them, perl 771will set C<$`> to the part of the string before the match, will set C<$&> 772to the part of the string that matched, and will set C<$'> to the part 773of the string after the match. An example: 774 775 $x = "the cat caught the mouse"; 776 $x =~ /cat/; # $` = 'the ', $& = 'cat', $' = ' caught the mouse' 777 $x =~ /the/; # $` = '', $& = 'the', $' = ' cat caught the mouse' 778 779In the second match, S<C<$` = ''> > because the regexp matched at the 780first character position in the string and stopped, it never saw the 781second 'the'. It is important to note that using C<$`> and C<$'> 782slows down regexp matching quite a bit, and C< $& > slows it down to a 783lesser extent, because if they are used in one regexp in a program, 784they are generated for <all> regexps in the program. So if raw 785performance is a goal of your application, they should be avoided. 786If you need them, use C<@-> and C<@+> instead: 787 788 $` is the same as substr( $x, 0, $-[0] ) 789 $& is the same as substr( $x, $-[0], $+[0]-$-[0] ) 790 $' is the same as substr( $x, $+[0] ) 791 792=head2 Matching repetitions 793 794The examples in the previous section display an annoying weakness. We 795were only matching 3-letter words, or syllables of 4 letters or 796less. We'd like to be able to match words or syllables of any length, 797without writing out tedious alternatives like 798C<\w\w\w\w|\w\w\w|\w\w|\w>. 799 800This is exactly the problem the B<quantifier> metacharacters C<?>, 801C<*>, C<+>, and C<{}> were created for. They allow us to determine the 802number of repeats of a portion of a regexp we consider to be a 803match. Quantifiers are put immediately after the character, character 804class, or grouping that we want to specify. They have the following 805meanings: 806 807=over 4 808 809=item * 810 811C<a?> = match 'a' 1 or 0 times 812 813=item * 814 815C<a*> = match 'a' 0 or more times, i.e., any number of times 816 817=item * 818 819C<a+> = match 'a' 1 or more times, i.e., at least once 820 821=item * 822 823C<a{n,m}> = match at least C<n> times, but not more than C<m> 824times. 825 826=item * 827 828C<a{n,}> = match at least C<n> or more times 829 830=item * 831 832C<a{n}> = match exactly C<n> times 833 834=back 835 836Here are some examples: 837 838 /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and 839 # any number of digits 840 /(\w+)\s+\1/; # match doubled words of arbitrary length 841 /y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes' 842 $year =~ /\d{2,4}/; # make sure year is at least 2 but not more 843 # than 4 digits 844 $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates 845 $year =~ /\d{2}(\d{2})?/; # same thing written differently. However, 846 # this produces $1 and the other does not. 847 848 % simple_grep '^(\w+)\1$' /usr/dict/words # isn't this easier? 849 beriberi 850 booboo 851 coco 852 mama 853 murmur 854 papa 855 856For all of these quantifiers, perl will try to match as much of the 857string as possible, while still allowing the regexp to succeed. Thus 858with C</a?.../>, perl will first try to match the regexp with the C<a> 859present; if that fails, perl will try to match the regexp without the 860C<a> present. For the quantifier C<*>, we get the following: 861 862 $x = "the cat in the hat"; 863 $x =~ /^(.*)(cat)(.*)$/; # matches, 864 # $1 = 'the ' 865 # $2 = 'cat' 866 # $3 = ' in the hat' 867 868Which is what we might expect, the match finds the only C<cat> in the 869string and locks onto it. Consider, however, this regexp: 870 871 $x =~ /^(.*)(at)(.*)$/; # matches, 872 # $1 = 'the cat in the h' 873 # $2 = 'at' 874 # $3 = '' (0 matches) 875 876One might initially guess that perl would find the C<at> in C<cat> and 877stop there, but that wouldn't give the longest possible string to the 878first quantifier C<.*>. Instead, the first quantifier C<.*> grabs as 879much of the string as possible while still having the regexp match. In 880this example, that means having the C<at> sequence with the final C<at> 881in the string. The other important principle illustrated here is that 882when there are two or more elements in a regexp, the I<leftmost> 883quantifier, if there is one, gets to grab as much the string as 884possible, leaving the rest of the regexp to fight over scraps. Thus in 885our example, the first quantifier C<.*> grabs most of the string, while 886the second quantifier C<.*> gets the empty string. Quantifiers that 887grab as much of the string as possible are called B<maximal match> or 888B<greedy> quantifiers. 889 890When a regexp can match a string in several different ways, we can use 891the principles above to predict which way the regexp will match: 892 893=over 4 894 895=item * 896 897Principle 0: Taken as a whole, any regexp will be matched at the 898earliest possible position in the string. 899 900=item * 901 902Principle 1: In an alternation C<a|b|c...>, the leftmost alternative 903that allows a match for the whole regexp will be the one used. 904 905=item * 906 907Principle 2: The maximal matching quantifiers C<?>, C<*>, C<+> and 908C<{n,m}> will in general match as much of the string as possible while 909still allowing the whole regexp to match. 910 911=item * 912 913Principle 3: If there are two or more elements in a regexp, the 914leftmost greedy quantifier, if any, will match as much of the string 915as possible while still allowing the whole regexp to match. The next 916leftmost greedy quantifier, if any, will try to match as much of the 917string remaining available to it as possible, while still allowing the 918whole regexp to match. And so on, until all the regexp elements are 919satisfied. 920 921=back 922 923As we have seen above, Principle 0 overrides the others - the regexp 924will be matched as early as possible, with the other principles 925determining how the regexp matches at that earliest character 926position. 927 928Here is an example of these principles in action: 929 930 $x = "The programming republic of Perl"; 931 $x =~ /^(.+)(e|r)(.*)$/; # matches, 932 # $1 = 'The programming republic of Pe' 933 # $2 = 'r' 934 # $3 = 'l' 935 936This regexp matches at the earliest string position, C<'T'>. One 937might think that C<e>, being leftmost in the alternation, would be 938matched, but C<r> produces the longest string in the first quantifier. 939 940 $x =~ /(m{1,2})(.*)$/; # matches, 941 # $1 = 'mm' 942 # $2 = 'ing republic of Perl' 943 944Here, The earliest possible match is at the first C<'m'> in 945C<programming>. C<m{1,2}> is the first quantifier, so it gets to match 946a maximal C<mm>. 947 948 $x =~ /.*(m{1,2})(.*)$/; # matches, 949 # $1 = 'm' 950 # $2 = 'ing republic of Perl' 951 952Here, the regexp matches at the start of the string. The first 953quantifier C<.*> grabs as much as possible, leaving just a single 954C<'m'> for the second quantifier C<m{1,2}>. 955 956 $x =~ /(.?)(m{1,2})(.*)$/; # matches, 957 # $1 = 'a' 958 # $2 = 'mm' 959 # $3 = 'ing republic of Perl' 960 961Here, C<.?> eats its maximal one character at the earliest possible 962position in the string, C<'a'> in C<programming>, leaving C<m{1,2}> 963the opportunity to match both C<m>'s. Finally, 964 965 "aXXXb" =~ /(X*)/; # matches with $1 = '' 966 967because it can match zero copies of C<'X'> at the beginning of the 968string. If you definitely want to match at least one C<'X'>, use 969C<X+>, not C<X*>. 970 971Sometimes greed is not good. At times, we would like quantifiers to 972match a I<minimal> piece of string, rather than a maximal piece. For 973this purpose, Larry Wall created the S<B<minimal match> > or 974B<non-greedy> quantifiers C<??>,C<*?>, C<+?>, and C<{}?>. These are 975the usual quantifiers with a C<?> appended to them. They have the 976following meanings: 977 978=over 4 979 980=item * 981 982C<a??> = match 'a' 0 or 1 times. Try 0 first, then 1. 983 984=item * 985 986C<a*?> = match 'a' 0 or more times, i.e., any number of times, 987but as few times as possible 988 989=item * 990 991C<a+?> = match 'a' 1 or more times, i.e., at least once, but 992as few times as possible 993 994=item * 995 996C<a{n,m}?> = match at least C<n> times, not more than C<m> 997times, as few times as possible 998 999=item * 1000 1001C<a{n,}?> = match at least C<n> times, but as few times as 1002possible 1003 1004=item * 1005 1006C<a{n}?> = match exactly C<n> times. Because we match exactly 1007C<n> times, C<a{n}?> is equivalent to C<a{n}> and is just there for 1008notational consistency. 1009 1010=back 1011 1012Let's look at the example above, but with minimal quantifiers: 1013 1014 $x = "The programming republic of Perl"; 1015 $x =~ /^(.+?)(e|r)(.*)$/; # matches, 1016 # $1 = 'Th' 1017 # $2 = 'e' 1018 # $3 = ' programming republic of Perl' 1019 1020The minimal string that will allow both the start of the string C<^> 1021and the alternation to match is C<Th>, with the alternation C<e|r> 1022matching C<e>. The second quantifier C<.*> is free to gobble up the 1023rest of the string. 1024 1025 $x =~ /(m{1,2}?)(.*?)$/; # matches, 1026 # $1 = 'm' 1027 # $2 = 'ming republic of Perl' 1028 1029The first string position that this regexp can match is at the first 1030C<'m'> in C<programming>. At this position, the minimal C<m{1,2}?> 1031matches just one C<'m'>. Although the second quantifier C<.*?> would 1032prefer to match no characters, it is constrained by the end-of-string 1033anchor C<$> to match the rest of the string. 1034 1035 $x =~ /(.*?)(m{1,2}?)(.*)$/; # matches, 1036 # $1 = 'The progra' 1037 # $2 = 'm' 1038 # $3 = 'ming republic of Perl' 1039 1040In this regexp, you might expect the first minimal quantifier C<.*?> 1041to match the empty string, because it is not constrained by a C<^> 1042anchor to match the beginning of the word. Principle 0 applies here, 1043however. Because it is possible for the whole regexp to match at the 1044start of the string, it I<will> match at the start of the string. Thus 1045the first quantifier has to match everything up to the first C<m>. The 1046second minimal quantifier matches just one C<m> and the third 1047quantifier matches the rest of the string. 1048 1049 $x =~ /(.??)(m{1,2})(.*)$/; # matches, 1050 # $1 = 'a' 1051 # $2 = 'mm' 1052 # $3 = 'ing republic of Perl' 1053 1054Just as in the previous regexp, the first quantifier C<.??> can match 1055earliest at position C<'a'>, so it does. The second quantifier is 1056greedy, so it matches C<mm>, and the third matches the rest of the 1057string. 1058 1059We can modify principle 3 above to take into account non-greedy 1060quantifiers: 1061 1062=over 4 1063 1064=item * 1065 1066Principle 3: If there are two or more elements in a regexp, the 1067leftmost greedy (non-greedy) quantifier, if any, will match as much 1068(little) of the string as possible while still allowing the whole 1069regexp to match. The next leftmost greedy (non-greedy) quantifier, if 1070any, will try to match as much (little) of the string remaining 1071available to it as possible, while still allowing the whole regexp to 1072match. And so on, until all the regexp elements are satisfied. 1073 1074=back 1075 1076Just like alternation, quantifiers are also susceptible to 1077backtracking. Here is a step-by-step analysis of the example 1078 1079 $x = "the cat in the hat"; 1080 $x =~ /^(.*)(at)(.*)$/; # matches, 1081 # $1 = 'the cat in the h' 1082 # $2 = 'at' 1083 # $3 = '' (0 matches) 1084 1085=over 4 1086 1087=item 0 1088 1089Start with the first letter in the string 't'. 1090 1091=item 1 1092 1093The first quantifier '.*' starts out by matching the whole 1094string 'the cat in the hat'. 1095 1096=item 2 1097 1098'a' in the regexp element 'at' doesn't match the end of the 1099string. Backtrack one character. 1100 1101=item 3 1102 1103'a' in the regexp element 'at' still doesn't match the last 1104letter of the string 't', so backtrack one more character. 1105 1106=item 4 1107 1108Now we can match the 'a' and the 't'. 1109 1110=item 5 1111 1112Move on to the third element '.*'. Since we are at the end of 1113the string and '.*' can match 0 times, assign it the empty string. 1114 1115=item 6 1116 1117We are done! 1118 1119=back 1120 1121Most of the time, all this moving forward and backtracking happens 1122quickly and searching is fast. There are some pathological regexps, 1123however, whose execution time exponentially grows with the size of the 1124string. A typical structure that blows up in your face is of the form 1125 1126 /(a|b+)*/; 1127 1128The problem is the nested indeterminate quantifiers. There are many 1129different ways of partitioning a string of length n between the C<+> 1130and C<*>: one repetition with C<b+> of length n, two repetitions with 1131the first C<b+> length k and the second with length n-k, m repetitions 1132whose bits add up to length n, etc. In fact there are an exponential 1133number of ways to partition a string as a function of length. A 1134regexp may get lucky and match early in the process, but if there is 1135no match, perl will try I<every> possibility before giving up. So be 1136careful with nested C<*>'s, C<{n,m}>'s, and C<+>'s. The book 1137I<Mastering regular expressions> by Jeffrey Friedl gives a wonderful 1138discussion of this and other efficiency issues. 1139 1140=head2 Building a regexp 1141 1142At this point, we have all the basic regexp concepts covered, so let's 1143give a more involved example of a regular expression. We will build a 1144regexp that matches numbers. 1145 1146The first task in building a regexp is to decide what we want to match 1147and what we want to exclude. In our case, we want to match both 1148integers and floating point numbers and we want to reject any string 1149that isn't a number. 1150 1151The next task is to break the problem down into smaller problems that 1152are easily converted into a regexp. 1153 1154The simplest case is integers. These consist of a sequence of digits, 1155with an optional sign in front. The digits we can represent with 1156C<\d+> and the sign can be matched with C<[+-]>. Thus the integer 1157regexp is 1158 1159 /[+-]?\d+/; # matches integers 1160 1161A floating point number potentially has a sign, an integral part, a 1162decimal point, a fractional part, and an exponent. One or more of these 1163parts is optional, so we need to check out the different 1164possibilities. Floating point numbers which are in proper form include 1165123., 0.345, .34, -1e6, and 25.4E-72. As with integers, the sign out 1166front is completely optional and can be matched by C<[+-]?>. We can 1167see that if there is no exponent, floating point numbers must have a 1168decimal point, otherwise they are integers. We might be tempted to 1169model these with C<\d*\.\d*>, but this would also match just a single 1170decimal point, which is not a number. So the three cases of floating 1171point number sans exponent are 1172 1173 /[+-]?\d+\./; # 1., 321., etc. 1174 /[+-]?\.\d+/; # .1, .234, etc. 1175 /[+-]?\d+\.\d+/; # 1.0, 30.56, etc. 1176 1177These can be combined into a single regexp with a three-way alternation: 1178 1179 /[+-]?(\d+\.\d+|\d+\.|\.\d+)/; # floating point, no exponent 1180 1181In this alternation, it is important to put C<'\d+\.\d+'> before 1182C<'\d+\.'>. If C<'\d+\.'> were first, the regexp would happily match that 1183and ignore the fractional part of the number. 1184 1185Now consider floating point numbers with exponents. The key 1186observation here is that I<both> integers and numbers with decimal 1187points are allowed in front of an exponent. Then exponents, like the 1188overall sign, are independent of whether we are matching numbers with 1189or without decimal points, and can be 'decoupled' from the 1190mantissa. The overall form of the regexp now becomes clear: 1191 1192 /^(optional sign)(integer | f.p. mantissa)(optional exponent)$/; 1193 1194The exponent is an C<e> or C<E>, followed by an integer. So the 1195exponent regexp is 1196 1197 /[eE][+-]?\d+/; # exponent 1198 1199Putting all the parts together, we get a regexp that matches numbers: 1200 1201 /^[+-]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/; # Ta da! 1202 1203Long regexps like this may impress your friends, but can be hard to 1204decipher. In complex situations like this, the C<//x> modifier for a 1205match is invaluable. It allows one to put nearly arbitrary whitespace 1206and comments into a regexp without affecting their meaning. Using it, 1207we can rewrite our 'extended' regexp in the more pleasing form 1208 1209 /^ 1210 [+-]? # first, match an optional sign 1211 ( # then match integers or f.p. mantissas: 1212 \d+\.\d+ # mantissa of the form a.b 1213 |\d+\. # mantissa of the form a. 1214 |\.\d+ # mantissa of the form .b 1215 |\d+ # integer of the form a 1216 ) 1217 ([eE][+-]?\d+)? # finally, optionally match an exponent 1218 $/x; 1219 1220If whitespace is mostly irrelevant, how does one include space 1221characters in an extended regexp? The answer is to backslash it 1222S<C<'\ '> > or put it in a character class S<C<[ ]> >. The same thing 1223goes for pound signs, use C<\#> or C<[#]>. For instance, Perl allows 1224a space between the sign and the mantissa/integer, and we could add 1225this to our regexp as follows: 1226 1227 /^ 1228 [+-]?\ * # first, match an optional sign *and space* 1229 ( # then match integers or f.p. mantissas: 1230 \d+\.\d+ # mantissa of the form a.b 1231 |\d+\. # mantissa of the form a. 1232 |\.\d+ # mantissa of the form .b 1233 |\d+ # integer of the form a 1234 ) 1235 ([eE][+-]?\d+)? # finally, optionally match an exponent 1236 $/x; 1237 1238In this form, it is easier to see a way to simplify the 1239alternation. Alternatives 1, 2, and 4 all start with C<\d+>, so it 1240could be factored out: 1241 1242 /^ 1243 [+-]?\ * # first, match an optional sign 1244 ( # then match integers or f.p. mantissas: 1245 \d+ # start out with a ... 1246 ( 1247 \.\d* # mantissa of the form a.b or a. 1248 )? # ? takes care of integers of the form a 1249 |\.\d+ # mantissa of the form .b 1250 ) 1251 ([eE][+-]?\d+)? # finally, optionally match an exponent 1252 $/x; 1253 1254or written in the compact form, 1255 1256 /^[+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?$/; 1257 1258This is our final regexp. To recap, we built a regexp by 1259 1260=over 4 1261 1262=item * 1263 1264specifying the task in detail, 1265 1266=item * 1267 1268breaking down the problem into smaller parts, 1269 1270=item * 1271 1272translating the small parts into regexps, 1273 1274=item * 1275 1276combining the regexps, 1277 1278=item * 1279 1280and optimizing the final combined regexp. 1281 1282=back 1283 1284These are also the typical steps involved in writing a computer 1285program. This makes perfect sense, because regular expressions are 1286essentially programs written a little computer language that specifies 1287patterns. 1288 1289=head2 Using regular expressions in Perl 1290 1291The last topic of Part 1 briefly covers how regexps are used in Perl 1292programs. Where do they fit into Perl syntax? 1293 1294We have already introduced the matching operator in its default 1295C</regexp/> and arbitrary delimiter C<m!regexp!> forms. We have used 1296the binding operator C<=~> and its negation C<!~> to test for string 1297matches. Associated with the matching operator, we have discussed the 1298single line C<//s>, multi-line C<//m>, case-insensitive C<//i> and 1299extended C<//x> modifiers. 1300 1301There are a few more things you might want to know about matching 1302operators. First, we pointed out earlier that variables in regexps are 1303substituted before the regexp is evaluated: 1304 1305 $pattern = 'Seuss'; 1306 while (<>) { 1307 print if /$pattern/; 1308 } 1309 1310This will print any lines containing the word C<Seuss>. It is not as 1311efficient as it could be, however, because perl has to re-evaluate 1312C<$pattern> each time through the loop. If C<$pattern> won't be 1313changing over the lifetime of the script, we can add the C<//o> 1314modifier, which directs perl to only perform variable substitutions 1315once: 1316 1317 #!/usr/bin/perl 1318 # Improved simple_grep 1319 $regexp = shift; 1320 while (<>) { 1321 print if /$regexp/o; # a good deal faster 1322 } 1323 1324If you change C<$pattern> after the first substitution happens, perl 1325will ignore it. If you don't want any substitutions at all, use the 1326special delimiter C<m''>: 1327 1328 @pattern = ('Seuss'); 1329 while (<>) { 1330 print if m'@pattern'; # matches literal '@pattern', not 'Seuss' 1331 } 1332 1333C<m''> acts like single quotes on a regexp; all other C<m> delimiters 1334act like double quotes. If the regexp evaluates to the empty string, 1335the regexp in the I<last successful match> is used instead. So we have 1336 1337 "dog" =~ /d/; # 'd' matches 1338 "dogbert =~ //; # this matches the 'd' regexp used before 1339 1340The final two modifiers C<//g> and C<//c> concern multiple matches. 1341The modifier C<//g> stands for global matching and allows the 1342matching operator to match within a string as many times as possible. 1343In scalar context, successive invocations against a string will have 1344`C<//g> jump from match to match, keeping track of position in the 1345string as it goes along. You can get or set the position with the 1346C<pos()> function. 1347 1348The use of C<//g> is shown in the following example. Suppose we have 1349a string that consists of words separated by spaces. If we know how 1350many words there are in advance, we could extract the words using 1351groupings: 1352 1353 $x = "cat dog house"; # 3 words 1354 $x =~ /^\s*(\w+)\s+(\w+)\s+(\w+)\s*$/; # matches, 1355 # $1 = 'cat' 1356 # $2 = 'dog' 1357 # $3 = 'house' 1358 1359But what if we had an indeterminate number of words? This is the sort 1360of task C<//g> was made for. To extract all words, form the simple 1361regexp C<(\w+)> and loop over all matches with C</(\w+)/g>: 1362 1363 while ($x =~ /(\w+)/g) { 1364 print "Word is $1, ends at position ", pos $x, "\n"; 1365 } 1366 1367prints 1368 1369 Word is cat, ends at position 3 1370 Word is dog, ends at position 7 1371 Word is house, ends at position 13 1372 1373A failed match or changing the target string resets the position. If 1374you don't want the position reset after failure to match, add the 1375C<//c>, as in C</regexp/gc>. The current position in the string is 1376associated with the string, not the regexp. This means that different 1377strings have different positions and their respective positions can be 1378set or read independently. 1379 1380In list context, C<//g> returns a list of matched groupings, or if 1381there are no groupings, a list of matches to the whole regexp. So if 1382we wanted just the words, we could use 1383 1384 @words = ($x =~ /(\w+)/g); # matches, 1385 # $word[0] = 'cat' 1386 # $word[1] = 'dog' 1387 # $word[2] = 'house' 1388 1389Closely associated with the C<//g> modifier is the C<\G> anchor. The 1390C<\G> anchor matches at the point where the previous C<//g> match left 1391off. C<\G> allows us to easily do context-sensitive matching: 1392 1393 $metric = 1; # use metric units 1394 ... 1395 $x = <FILE>; # read in measurement 1396 $x =~ /^([+-]?\d+)\s*/g; # get magnitude 1397 $weight = $1; 1398 if ($metric) { # error checking 1399 print "Units error!" unless $x =~ /\Gkg\./g; 1400 } 1401 else { 1402 print "Units error!" unless $x =~ /\Glbs\./g; 1403 } 1404 $x =~ /\G\s+(widget|sprocket)/g; # continue processing 1405 1406The combination of C<//g> and C<\G> allows us to process the string a 1407bit at a time and use arbitrary Perl logic to decide what to do next. 1408Currently, the C<\G> anchor is only fully supported when used to anchor 1409to the start of the pattern. 1410 1411C<\G> is also invaluable in processing fixed length records with 1412regexps. Suppose we have a snippet of coding region DNA, encoded as 1413base pair letters C<ATCGTTGAAT...> and we want to find all the stop 1414codons C<TGA>. In a coding region, codons are 3-letter sequences, so 1415we can think of the DNA snippet as a sequence of 3-letter records. The 1416naive regexp 1417 1418 # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC" 1419 $dna = "ATCGTTGAATGCAAATGACATGAC"; 1420 $dna =~ /TGA/; 1421 1422doesn't work; it may match a C<TGA>, but there is no guarantee that 1423the match is aligned with codon boundaries, e.g., the substring 1424S<C<GTT GAA> > gives a match. A better solution is 1425 1426 while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *? 1427 print "Got a TGA stop codon at position ", pos $dna, "\n"; 1428 } 1429 1430which prints 1431 1432 Got a TGA stop codon at position 18 1433 Got a TGA stop codon at position 23 1434 1435Position 18 is good, but position 23 is bogus. What happened? 1436 1437The answer is that our regexp works well until we get past the last 1438real match. Then the regexp will fail to match a synchronized C<TGA> 1439and start stepping ahead one character position at a time, not what we 1440want. The solution is to use C<\G> to anchor the match to the codon 1441alignment: 1442 1443 while ($dna =~ /\G(\w\w\w)*?TGA/g) { 1444 print "Got a TGA stop codon at position ", pos $dna, "\n"; 1445 } 1446 1447This prints 1448 1449 Got a TGA stop codon at position 18 1450 1451which is the correct answer. This example illustrates that it is 1452important not only to match what is desired, but to reject what is not 1453desired. 1454 1455B<search and replace> 1456 1457Regular expressions also play a big role in B<search and replace> 1458operations in Perl. Search and replace is accomplished with the 1459C<s///> operator. The general form is 1460C<s/regexp/replacement/modifiers>, with everything we know about 1461regexps and modifiers applying in this case as well. The 1462C<replacement> is a Perl double quoted string that replaces in the 1463string whatever is matched with the C<regexp>. The operator C<=~> is 1464also used here to associate a string with C<s///>. If matching 1465against C<$_>, the S<C<$_ =~> > can be dropped. If there is a match, 1466C<s///> returns the number of substitutions made, otherwise it returns 1467false. Here are a few examples: 1468 1469 $x = "Time to feed the cat!"; 1470 $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" 1471 if ($x =~ s/^(Time.*hacker)!$/$1 now!/) { 1472 $more_insistent = 1; 1473 } 1474 $y = "'quoted words'"; 1475 $y =~ s/^'(.*)'$/$1/; # strip single quotes, 1476 # $y contains "quoted words" 1477 1478In the last example, the whole string was matched, but only the part 1479inside the single quotes was grouped. With the C<s///> operator, the 1480matched variables C<$1>, C<$2>, etc. are immediately available for use 1481in the replacement expression, so we use C<$1> to replace the quoted 1482string with just what was quoted. With the global modifier, C<s///g> 1483will search and replace all occurrences of the regexp in the string: 1484 1485 $x = "I batted 4 for 4"; 1486 $x =~ s/4/four/; # doesn't do it all: 1487 # $x contains "I batted four for 4" 1488 $x = "I batted 4 for 4"; 1489 $x =~ s/4/four/g; # does it all: 1490 # $x contains "I batted four for four" 1491 1492If you prefer 'regex' over 'regexp' in this tutorial, you could use 1493the following program to replace it: 1494 1495 % cat > simple_replace 1496 #!/usr/bin/perl 1497 $regexp = shift; 1498 $replacement = shift; 1499 while (<>) { 1500 s/$regexp/$replacement/go; 1501 print; 1502 } 1503 ^D 1504 1505 % simple_replace regexp regex perlretut.pod 1506 1507In C<simple_replace> we used the C<s///g> modifier to replace all 1508occurrences of the regexp on each line and the C<s///o> modifier to 1509compile the regexp only once. As with C<simple_grep>, both the 1510C<print> and the C<s/$regexp/$replacement/go> use C<$_> implicitly. 1511 1512A modifier available specifically to search and replace is the 1513C<s///e> evaluation modifier. C<s///e> wraps an C<eval{...}> around 1514the replacement string and the evaluated result is substituted for the 1515matched substring. C<s///e> is useful if you need to do a bit of 1516computation in the process of replacing text. This example counts 1517character frequencies in a line: 1518 1519 $x = "Bill the cat"; 1520 $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself 1521 print "frequency of '$_' is $chars{$_}\n" 1522 foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars); 1523 1524This prints 1525 1526 frequency of ' ' is 2 1527 frequency of 't' is 2 1528 frequency of 'l' is 2 1529 frequency of 'B' is 1 1530 frequency of 'c' is 1 1531 frequency of 'e' is 1 1532 frequency of 'h' is 1 1533 frequency of 'i' is 1 1534 frequency of 'a' is 1 1535 1536As with the match C<m//> operator, C<s///> can use other delimiters, 1537such as C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are 1538used C<s'''>, then the regexp and replacement are treated as single 1539quoted strings and there are no substitutions. C<s///> in list context 1540returns the same thing as in scalar context, i.e., the number of 1541matches. 1542 1543B<The split operator> 1544 1545The B<C<split> > function can also optionally use a matching operator 1546C<m//> to split a string. C<split /regexp/, string, limit> splits 1547C<string> into a list of substrings and returns that list. The regexp 1548is used to match the character sequence that the C<string> is split 1549with respect to. The C<limit>, if present, constrains splitting into 1550no more than C<limit> number of strings. For example, to split a 1551string into words, use 1552 1553 $x = "Calvin and Hobbes"; 1554 @words = split /\s+/, $x; # $word[0] = 'Calvin' 1555 # $word[1] = 'and' 1556 # $word[2] = 'Hobbes' 1557 1558If the empty regexp C<//> is used, the regexp always matches and 1559the string is split into individual characters. If the regexp has 1560groupings, then list produced contains the matched substrings from the 1561groupings as well. For instance, 1562 1563 $x = "/usr/bin/perl"; 1564 @dirs = split m!/!, $x; # $dirs[0] = '' 1565 # $dirs[1] = 'usr' 1566 # $dirs[2] = 'bin' 1567 # $dirs[3] = 'perl' 1568 @parts = split m!(/)!, $x; # $parts[0] = '' 1569 # $parts[1] = '/' 1570 # $parts[2] = 'usr' 1571 # $parts[3] = '/' 1572 # $parts[4] = 'bin' 1573 # $parts[5] = '/' 1574 # $parts[6] = 'perl' 1575 1576Since the first character of $x matched the regexp, C<split> prepended 1577an empty initial element to the list. 1578 1579If you have read this far, congratulations! You now have all the basic 1580tools needed to use regular expressions to solve a wide range of text 1581processing problems. If this is your first time through the tutorial, 1582why not stop here and play around with regexps a while... S<Part 2> 1583concerns the more esoteric aspects of regular expressions and those 1584concepts certainly aren't needed right at the start. 1585 1586=head1 Part 2: Power tools 1587 1588OK, you know the basics of regexps and you want to know more. If 1589matching regular expressions is analogous to a walk in the woods, then 1590the tools discussed in Part 1 are analogous to topo maps and a 1591compass, basic tools we use all the time. Most of the tools in part 2 1592are analogous to flare guns and satellite phones. They aren't used 1593too often on a hike, but when we are stuck, they can be invaluable. 1594 1595What follows are the more advanced, less used, or sometimes esoteric 1596capabilities of perl regexps. In Part 2, we will assume you are 1597comfortable with the basics and concentrate on the new features. 1598 1599=head2 More on characters, strings, and character classes 1600 1601There are a number of escape sequences and character classes that we 1602haven't covered yet. 1603 1604There are several escape sequences that convert characters or strings 1605between upper and lower case. C<\l> and C<\u> convert the next 1606character to lower or upper case, respectively: 1607 1608 $x = "perl"; 1609 $string =~ /\u$x/; # matches 'Perl' in $string 1610 $x = "M(rs?|s)\\."; # note the double backslash 1611 $string =~ /\l$x/; # matches 'mr.', 'mrs.', and 'ms.', 1612 1613C<\L> and C<\U> converts a whole substring, delimited by C<\L> or 1614C<\U> and C<\E>, to lower or upper case: 1615 1616 $x = "This word is in lower case:\L SHOUT\E"; 1617 $x =~ /shout/; # matches 1618 $x = "I STILL KEYPUNCH CARDS FOR MY 360" 1619 $x =~ /\Ukeypunch/; # matches punch card string 1620 1621If there is no C<\E>, case is converted until the end of the 1622string. The regexps C<\L\u$word> or C<\u\L$word> convert the first 1623character of C<$word> to uppercase and the rest of the characters to 1624lowercase. 1625 1626Control characters can be escaped with C<\c>, so that a control-Z 1627character would be matched with C<\cZ>. The escape sequence 1628C<\Q>...C<\E> quotes, or protects most non-alphabetic characters. For 1629instance, 1630 1631 $x = "\QThat !^*&%~& cat!"; 1632 $x =~ /\Q!^*&%~&\E/; # check for rough language 1633 1634It does not protect C<$> or C<@>, so that variables can still be 1635substituted. 1636 1637With the advent of 5.6.0, perl regexps can handle more than just the 1638standard ASCII character set. Perl now supports B<Unicode>, a standard 1639for encoding the character sets from many of the world's written 1640languages. Unicode does this by allowing characters to be more than 1641one byte wide. Perl uses the UTF-8 encoding, in which ASCII characters 1642are still encoded as one byte, but characters greater than C<chr(127)> 1643may be stored as two or more bytes. 1644 1645What does this mean for regexps? Well, regexp users don't need to know 1646much about perl's internal representation of strings. But they do need 1647to know 1) how to represent Unicode characters in a regexp and 2) when 1648a matching operation will treat the string to be searched as a 1649sequence of bytes (the old way) or as a sequence of Unicode characters 1650(the new way). The answer to 1) is that Unicode characters greater 1651than C<chr(127)> may be represented using the C<\x{hex}> notation, 1652with C<hex> a hexadecimal integer: 1653 1654 /\x{263a}/; # match a Unicode smiley face :) 1655 1656Unicode characters in the range of 128-255 use two hexadecimal digits 1657with braces: C<\x{ab}>. Note that this is different than C<\xab>, 1658which is just a hexadecimal byte with no Unicode significance. 1659 1660B<NOTE>: in Perl 5.6.0 it used to be that one needed to say C<use 1661utf8> to use any Unicode features. This is no more the case: for 1662almost all Unicode processing, the explicit C<utf8> pragma is not 1663needed. (The only case where it matters is if your Perl script is in 1664Unicode and encoded in UTF-8, then an explicit C<use utf8> is needed.) 1665 1666Figuring out the hexadecimal sequence of a Unicode character you want 1667or deciphering someone else's hexadecimal Unicode regexp is about as 1668much fun as programming in machine code. So another way to specify 1669Unicode characters is to use the S<B<named character> > escape 1670sequence C<\N{name}>. C<name> is a name for the Unicode character, as 1671specified in the Unicode standard. For instance, if we wanted to 1672represent or match the astrological sign for the planet Mercury, we 1673could use 1674 1675 use charnames ":full"; # use named chars with Unicode full names 1676 $x = "abc\N{MERCURY}def"; 1677 $x =~ /\N{MERCURY}/; # matches 1678 1679One can also use short names or restrict names to a certain alphabet: 1680 1681 use charnames ':full'; 1682 print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n"; 1683 1684 use charnames ":short"; 1685 print "\N{greek:Sigma} is an upper-case sigma.\n"; 1686 1687 use charnames qw(greek); 1688 print "\N{sigma} is Greek sigma\n"; 1689 1690A list of full names is found in the file Names.txt in the 1691lib/perl5/5.X.X/unicore directory. 1692 1693The answer to requirement 2), as of 5.6.0, is that if a regexp 1694contains Unicode characters, the string is searched as a sequence of 1695Unicode characters. Otherwise, the string is searched as a sequence of 1696bytes. If the string is being searched as a sequence of Unicode 1697characters, but matching a single byte is required, we can use the C<\C> 1698escape sequence. C<\C> is a character class akin to C<.> except that 1699it matches I<any> byte 0-255. So 1700 1701 use charnames ":full"; # use named chars with Unicode full names 1702 $x = "a"; 1703 $x =~ /\C/; # matches 'a', eats one byte 1704 $x = ""; 1705 $x =~ /\C/; # doesn't match, no bytes to match 1706 $x = "\N{MERCURY}"; # two-byte Unicode character 1707 $x =~ /\C/; # matches, but dangerous! 1708 1709The last regexp matches, but is dangerous because the string 1710I<character> position is no longer synchronized to the string I<byte> 1711position. This generates the warning 'Malformed UTF-8 1712character'. The C<\C> is best used for matching the binary data in strings 1713with binary data intermixed with Unicode characters. 1714 1715Let us now discuss the rest of the character classes. Just as with 1716Unicode characters, there are named Unicode character classes 1717represented by the C<\p{name}> escape sequence. Closely associated is 1718the C<\P{name}> character class, which is the negation of the 1719C<\p{name}> class. For example, to match lower and uppercase 1720characters, 1721 1722 use charnames ":full"; # use named chars with Unicode full names 1723 $x = "BOB"; 1724 $x =~ /^\p{IsUpper}/; # matches, uppercase char class 1725 $x =~ /^\P{IsUpper}/; # doesn't match, char class sans uppercase 1726 $x =~ /^\p{IsLower}/; # doesn't match, lowercase char class 1727 $x =~ /^\P{IsLower}/; # matches, char class sans lowercase 1728 1729Here is the association between some Perl named classes and the 1730traditional Unicode classes: 1731 1732 Perl class name Unicode class name or regular expression 1733 1734 IsAlpha /^[LM]/ 1735 IsAlnum /^[LMN]/ 1736 IsASCII $code <= 127 1737 IsCntrl /^C/ 1738 IsBlank $code =~ /^(0020|0009)$/ || /^Z[^lp]/ 1739 IsDigit Nd 1740 IsGraph /^([LMNPS]|Co)/ 1741 IsLower Ll 1742 IsPrint /^([LMNPS]|Co|Zs)/ 1743 IsPunct /^P/ 1744 IsSpace /^Z/ || ($code =~ /^(0009|000A|000B|000C|000D)$/ 1745 IsSpacePerl /^Z/ || ($code =~ /^(0009|000A|000C|000D|0085|2028|2029)$/ 1746 IsUpper /^L[ut]/ 1747 IsWord /^[LMN]/ || $code eq "005F" 1748 IsXDigit $code =~ /^00(3[0-9]|[46][1-6])$/ 1749 1750You can also use the official Unicode class names with the C<\p> and 1751C<\P>, like C<\p{L}> for Unicode 'letters', or C<\p{Lu}> for uppercase 1752letters, or C<\P{Nd}> for non-digits. If a C<name> is just one 1753letter, the braces can be dropped. For instance, C<\pM> is the 1754character class of Unicode 'marks', for example accent marks. 1755For the full list see L<perlunicode>. 1756 1757The Unicode has also been separated into various sets of charaters 1758which you can test with C<\p{In...}> (in) and C<\P{In...}> (not in), 1759for example C<\p{Latin}>, C<\p{Greek}>, or C<\P{Katakana}>. 1760For the full list see L<perlunicode>. 1761 1762C<\X> is an abbreviation for a character class sequence that includes 1763the Unicode 'combining character sequences'. A 'combining character 1764sequence' is a base character followed by any number of combining 1765characters. An example of a combining character is an accent. Using 1766the Unicode full names, e.g., S<C<A + COMBINING RING> > is a combining 1767character sequence with base character C<A> and combining character 1768S<C<COMBINING RING> >, which translates in Danish to A with the circle 1769atop it, as in the word Angstrom. C<\X> is equivalent to C<\PM\pM*}>, 1770i.e., a non-mark followed by one or more marks. 1771 1772For the full and latest information about Unicode see the latest 1773Unicode standard, or the Unicode Consortium's website http://www.unicode.org/ 1774 1775As if all those classes weren't enough, Perl also defines POSIX style 1776character classes. These have the form C<[:name:]>, with C<name> the 1777name of the POSIX class. The POSIX classes are C<alpha>, C<alnum>, 1778C<ascii>, C<cntrl>, C<digit>, C<graph>, C<lower>, C<print>, C<punct>, 1779C<space>, C<upper>, and C<xdigit>, and two extensions, C<word> (a Perl 1780extension to match C<\w>), and C<blank> (a GNU extension). If C<utf8> 1781is being used, then these classes are defined the same as their 1782corresponding perl Unicode classes: C<[:upper:]> is the same as 1783C<\p{IsUpper}>, etc. The POSIX character classes, however, don't 1784require using C<utf8>. The C<[:digit:]>, C<[:word:]>, and 1785C<[:space:]> correspond to the familiar C<\d>, C<\w>, and C<\s> 1786character classes. To negate a POSIX class, put a C<^> in front of 1787the name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and under 1788C<utf8>, C<\P{IsDigit}>. The Unicode and POSIX character classes can 1789be used just like C<\d>, with the exception that POSIX character 1790classes can only be used inside of a character class: 1791 1792 /\s+[abc[:digit:]xyz]\s*/; # match a,b,c,x,y,z, or a digit 1793 /^=item\s[[:digit:]]/; # match '=item', 1794 # followed by a space and a digit 1795 use charnames ":full"; 1796 /\s+[abc\p{IsDigit}xyz]\s+/; # match a,b,c,x,y,z, or a digit 1797 /^=item\s\p{IsDigit}/; # match '=item', 1798 # followed by a space and a digit 1799 1800Whew! That is all the rest of the characters and character classes. 1801 1802=head2 Compiling and saving regular expressions 1803 1804In Part 1 we discussed the C<//o> modifier, which compiles a regexp 1805just once. This suggests that a compiled regexp is some data structure 1806that can be stored once and used again and again. The regexp quote 1807C<qr//> does exactly that: C<qr/string/> compiles the C<string> as a 1808regexp and transforms the result into a form that can be assigned to a 1809variable: 1810 1811 $reg = qr/foo+bar?/; # reg contains a compiled regexp 1812 1813Then C<$reg> can be used as a regexp: 1814 1815 $x = "fooooba"; 1816 $x =~ $reg; # matches, just like /foo+bar?/ 1817 $x =~ /$reg/; # same thing, alternate form 1818 1819C<$reg> can also be interpolated into a larger regexp: 1820 1821 $x =~ /(abc)?$reg/; # still matches 1822 1823As with the matching operator, the regexp quote can use different 1824delimiters, e.g., C<qr!!>, C<qr{}> and C<qr~~>. The single quote 1825delimiters C<qr''> prevent any interpolation from taking place. 1826 1827Pre-compiled regexps are useful for creating dynamic matches that 1828don't need to be recompiled each time they are encountered. Using 1829pre-compiled regexps, C<simple_grep> program can be expanded into a 1830program that matches multiple patterns: 1831 1832 % cat > multi_grep 1833 #!/usr/bin/perl 1834 # multi_grep - match any of <number> regexps 1835 # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ... 1836 1837 $number = shift; 1838 $regexp[$_] = shift foreach (0..$number-1); 1839 @compiled = map qr/$_/, @regexp; 1840 while ($line = <>) { 1841 foreach $pattern (@compiled) { 1842 if ($line =~ /$pattern/) { 1843 print $line; 1844 last; # we matched, so move onto the next line 1845 } 1846 } 1847 } 1848 ^D 1849 1850 % multi_grep 2 last for multi_grep 1851 $regexp[$_] = shift foreach (0..$number-1); 1852 foreach $pattern (@compiled) { 1853 last; 1854 1855Storing pre-compiled regexps in an array C<@compiled> allows us to 1856simply loop through the regexps without any recompilation, thus gaining 1857flexibility without sacrificing speed. 1858 1859=head2 Embedding comments and modifiers in a regular expression 1860 1861Starting with this section, we will be discussing Perl's set of 1862B<extended patterns>. These are extensions to the traditional regular 1863expression syntax that provide powerful new tools for pattern 1864matching. We have already seen extensions in the form of the minimal 1865matching constructs C<??>, C<*?>, C<+?>, C<{n,m}?>, and C<{n,}?>. The 1866rest of the extensions below have the form C<(?char...)>, where the 1867C<char> is a character that determines the type of extension. 1868 1869The first extension is an embedded comment C<(?#text)>. This embeds a 1870comment into the regular expression without affecting its meaning. The 1871comment should not have any closing parentheses in the text. An 1872example is 1873 1874 /(?# Match an integer:)[+-]?\d+/; 1875 1876This style of commenting has been largely superseded by the raw, 1877freeform commenting that is allowed with the C<//x> modifier. 1878 1879The modifiers C<//i>, C<//m>, C<//s>, and C<//x> can also embedded in 1880a regexp using C<(?i)>, C<(?m)>, C<(?s)>, and C<(?x)>. For instance, 1881 1882 /(?i)yes/; # match 'yes' case insensitively 1883 /yes/i; # same thing 1884 /(?x)( # freeform version of an integer regexp 1885 [+-]? # match an optional sign 1886 \d+ # match a sequence of digits 1887 ) 1888 /x; 1889 1890Embedded modifiers can have two important advantages over the usual 1891modifiers. Embedded modifiers allow a custom set of modifiers to 1892I<each> regexp pattern. This is great for matching an array of regexps 1893that must have different modifiers: 1894 1895 $pattern[0] = '(?i)doctor'; 1896 $pattern[1] = 'Johnson'; 1897 ... 1898 while (<>) { 1899 foreach $patt (@pattern) { 1900 print if /$patt/; 1901 } 1902 } 1903 1904The second advantage is that embedded modifiers only affect the regexp 1905inside the group the embedded modifier is contained in. So grouping 1906can be used to localize the modifier's effects: 1907 1908 /Answer: ((?i)yes)/; # matches 'Answer: yes', 'Answer: YES', etc. 1909 1910Embedded modifiers can also turn off any modifiers already present 1911by using, e.g., C<(?-i)>. Modifiers can also be combined into 1912a single expression, e.g., C<(?s-i)> turns on single line mode and 1913turns off case insensitivity. 1914 1915=head2 Non-capturing groupings 1916 1917We noted in Part 1 that groupings C<()> had two distinct functions: 1) 1918group regexp elements together as a single unit, and 2) extract, or 1919capture, substrings that matched the regexp in the 1920grouping. Non-capturing groupings, denoted by C<(?:regexp)>, allow the 1921regexp to be treated as a single unit, but don't extract substrings or 1922set matching variables C<$1>, etc. Both capturing and non-capturing 1923groupings are allowed to co-exist in the same regexp. Because there is 1924no extraction, non-capturing groupings are faster than capturing 1925groupings. Non-capturing groupings are also handy for choosing exactly 1926which parts of a regexp are to be extracted to matching variables: 1927 1928 # match a number, $1-$4 are set, but we only want $1 1929 /([+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?)/; 1930 1931 # match a number faster , only $1 is set 1932 /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)/; 1933 1934 # match a number, get $1 = whole number, $2 = exponent 1935 /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE]([+-]?\d+))?)/; 1936 1937Non-capturing groupings are also useful for removing nuisance 1938elements gathered from a split operation: 1939 1940 $x = '12a34b5'; 1941 @num = split /(a|b)/, $x; # @num = ('12','a','34','b','5') 1942 @num = split /(?:a|b)/, $x; # @num = ('12','34','5') 1943 1944Non-capturing groupings may also have embedded modifiers: 1945C<(?i-m:regexp)> is a non-capturing grouping that matches C<regexp> 1946case insensitively and turns off multi-line mode. 1947 1948=head2 Looking ahead and looking behind 1949 1950This section concerns the lookahead and lookbehind assertions. First, 1951a little background. 1952 1953In Perl regular expressions, most regexp elements 'eat up' a certain 1954amount of string when they match. For instance, the regexp element 1955C<[abc}]> eats up one character of the string when it matches, in the 1956sense that perl moves to the next character position in the string 1957after the match. There are some elements, however, that don't eat up 1958characters (advance the character position) if they match. The examples 1959we have seen so far are the anchors. The anchor C<^> matches the 1960beginning of the line, but doesn't eat any characters. Similarly, the 1961word boundary anchor C<\b> matches, e.g., if the character to the left 1962is a word character and the character to the right is a non-word 1963character, but it doesn't eat up any characters itself. Anchors are 1964examples of 'zero-width assertions'. Zero-width, because they consume 1965no characters, and assertions, because they test some property of the 1966string. In the context of our walk in the woods analogy to regexp 1967matching, most regexp elements move us along a trail, but anchors have 1968us stop a moment and check our surroundings. If the local environment 1969checks out, we can proceed forward. But if the local environment 1970doesn't satisfy us, we must backtrack. 1971 1972Checking the environment entails either looking ahead on the trail, 1973looking behind, or both. C<^> looks behind, to see that there are no 1974characters before. C<$> looks ahead, to see that there are no 1975characters after. C<\b> looks both ahead and behind, to see if the 1976characters on either side differ in their 'word'-ness. 1977 1978The lookahead and lookbehind assertions are generalizations of the 1979anchor concept. Lookahead and lookbehind are zero-width assertions 1980that let us specify which characters we want to test for. The 1981lookahead assertion is denoted by C<(?=regexp)> and the lookbehind 1982assertion is denoted by C<< (?<=fixed-regexp) >>. Some examples are 1983 1984 $x = "I catch the housecat 'Tom-cat' with catnip"; 1985 $x =~ /cat(?=\s+)/; # matches 'cat' in 'housecat' 1986 @catwords = ($x =~ /(?<=\s)cat\w+/g); # matches, 1987 # $catwords[0] = 'catch' 1988 # $catwords[1] = 'catnip' 1989 $x =~ /\bcat\b/; # matches 'cat' in 'Tom-cat' 1990 $x =~ /(?<=\s)cat(?=\s)/; # doesn't match; no isolated 'cat' in 1991 # middle of $x 1992 1993Note that the parentheses in C<(?=regexp)> and C<< (?<=regexp) >> are 1994non-capturing, since these are zero-width assertions. Thus in the 1995second regexp, the substrings captured are those of the whole regexp 1996itself. Lookahead C<(?=regexp)> can match arbitrary regexps, but 1997lookbehind C<< (?<=fixed-regexp) >> only works for regexps of fixed 1998width, i.e., a fixed number of characters long. Thus 1999C<< (?<=(ab|bc)) >> is fine, but C<< (?<=(ab)*) >> is not. The 2000negated versions of the lookahead and lookbehind assertions are 2001denoted by C<(?!regexp)> and C<< (?<!fixed-regexp) >> respectively. 2002They evaluate true if the regexps do I<not> match: 2003 2004 $x = "foobar"; 2005 $x =~ /foo(?!bar)/; # doesn't match, 'bar' follows 'foo' 2006 $x =~ /foo(?!baz)/; # matches, 'baz' doesn't follow 'foo' 2007 $x =~ /(?<!\s)foo/; # matches, there is no \s before 'foo' 2008 2009The C<\C> is unsupported in lookbehind, because the already 2010treacherous definition of C<\C> would become even more so 2011when going backwards. 2012 2013=head2 Using independent subexpressions to prevent backtracking 2014 2015The last few extended patterns in this tutorial are experimental as of 20165.6.0. Play with them, use them in some code, but don't rely on them 2017just yet for production code. 2018 2019S<B<Independent subexpressions> > are regular expressions, in the 2020context of a larger regular expression, that function independently of 2021the larger regular expression. That is, they consume as much or as 2022little of the string as they wish without regard for the ability of 2023the larger regexp to match. Independent subexpressions are represented 2024by C<< (?>regexp) >>. We can illustrate their behavior by first 2025considering an ordinary regexp: 2026 2027 $x = "ab"; 2028 $x =~ /a*ab/; # matches 2029 2030This obviously matches, but in the process of matching, the 2031subexpression C<a*> first grabbed the C<a>. Doing so, however, 2032wouldn't allow the whole regexp to match, so after backtracking, C<a*> 2033eventually gave back the C<a> and matched the empty string. Here, what 2034C<a*> matched was I<dependent> on what the rest of the regexp matched. 2035 2036Contrast that with an independent subexpression: 2037 2038 $x =~ /(?>a*)ab/; # doesn't match! 2039 2040The independent subexpression C<< (?>a*) >> doesn't care about the rest 2041of the regexp, so it sees an C<a> and grabs it. Then the rest of the 2042regexp C<ab> cannot match. Because C<< (?>a*) >> is independent, there 2043is no backtracking and the independent subexpression does not give 2044up its C<a>. Thus the match of the regexp as a whole fails. A similar 2045behavior occurs with completely independent regexps: 2046 2047 $x = "ab"; 2048 $x =~ /a*/g; # matches, eats an 'a' 2049 $x =~ /\Gab/g; # doesn't match, no 'a' available 2050 2051Here C<//g> and C<\G> create a 'tag team' handoff of the string from 2052one regexp to the other. Regexps with an independent subexpression are 2053much like this, with a handoff of the string to the independent 2054subexpression, and a handoff of the string back to the enclosing 2055regexp. 2056 2057The ability of an independent subexpression to prevent backtracking 2058can be quite useful. Suppose we want to match a non-empty string 2059enclosed in parentheses up to two levels deep. Then the following 2060regexp matches: 2061 2062 $x = "abc(de(fg)h"; # unbalanced parentheses 2063 $x =~ /\( ( [^()]+ | \([^()]*\) )+ \)/x; 2064 2065The regexp matches an open parenthesis, one or more copies of an 2066alternation, and a close parenthesis. The alternation is two-way, with 2067the first alternative C<[^()]+> matching a substring with no 2068parentheses and the second alternative C<\([^()]*\)> matching a 2069substring delimited by parentheses. The problem with this regexp is 2070that it is pathological: it has nested indeterminate quantifiers 2071of the form C<(a+|b)+>. We discussed in Part 1 how nested quantifiers 2072like this could take an exponentially long time to execute if there 2073was no match possible. To prevent the exponential blowup, we need to 2074prevent useless backtracking at some point. This can be done by 2075enclosing the inner quantifier as an independent subexpression: 2076 2077 $x =~ /\( ( (?>[^()]+) | \([^()]*\) )+ \)/x; 2078 2079Here, C<< (?>[^()]+) >> breaks the degeneracy of string partitioning 2080by gobbling up as much of the string as possible and keeping it. Then 2081match failures fail much more quickly. 2082 2083=head2 Conditional expressions 2084 2085A S<B<conditional expression> > is a form of if-then-else statement 2086that allows one to choose which patterns are to be matched, based on 2087some condition. There are two types of conditional expression: 2088C<(?(condition)yes-regexp)> and 2089C<(?(condition)yes-regexp|no-regexp)>. C<(?(condition)yes-regexp)> is 2090like an S<C<'if () {}'> > statement in Perl. If the C<condition> is true, 2091the C<yes-regexp> will be matched. If the C<condition> is false, the 2092C<yes-regexp> will be skipped and perl will move onto the next regexp 2093element. The second form is like an S<C<'if () {} else {}'> > statement 2094in Perl. If the C<condition> is true, the C<yes-regexp> will be 2095matched, otherwise the C<no-regexp> will be matched. 2096 2097The C<condition> can have two forms. The first form is simply an 2098integer in parentheses C<(integer)>. It is true if the corresponding 2099backreference C<\integer> matched earlier in the regexp. The second 2100form is a bare zero width assertion C<(?...)>, either a 2101lookahead, a lookbehind, or a code assertion (discussed in the next 2102section). 2103 2104The integer form of the C<condition> allows us to choose, with more 2105flexibility, what to match based on what matched earlier in the 2106regexp. This searches for words of the form C<"$x$x"> or 2107C<"$x$y$y$x">: 2108 2109 % simple_grep '^(\w+)(\w+)?(?(2)\2\1|\1)$' /usr/dict/words 2110 beriberi 2111 coco 2112 couscous 2113 deed 2114 ... 2115 toot 2116 toto 2117 tutu 2118 2119The lookbehind C<condition> allows, along with backreferences, 2120an earlier part of the match to influence a later part of the 2121match. For instance, 2122 2123 /[ATGC]+(?(?<=AA)G|C)$/; 2124 2125matches a DNA sequence such that it either ends in C<AAG>, or some 2126other base pair combination and C<C>. Note that the form is 2127C<< (?(?<=AA)G|C) >> and not C<< (?((?<=AA))G|C) >>; for the 2128lookahead, lookbehind or code assertions, the parentheses around the 2129conditional are not needed. 2130 2131=head2 A bit of magic: executing Perl code in a regular expression 2132 2133Normally, regexps are a part of Perl expressions. 2134S<B<Code evaluation> > expressions turn that around by allowing 2135arbitrary Perl code to be a part of a regexp. A code evaluation 2136expression is denoted C<(?{code})>, with C<code> a string of Perl 2137statements. 2138 2139Code expressions are zero-width assertions, and the value they return 2140depends on their environment. There are two possibilities: either the 2141code expression is used as a conditional in a conditional expression 2142C<(?(condition)...)>, or it is not. If the code expression is a 2143conditional, the code is evaluated and the result (i.e., the result of 2144the last statement) is used to determine truth or falsehood. If the 2145code expression is not used as a conditional, the assertion always 2146evaluates true and the result is put into the special variable 2147C<$^R>. The variable C<$^R> can then be used in code expressions later 2148in the regexp. Here are some silly examples: 2149 2150 $x = "abcdef"; 2151 $x =~ /abc(?{print "Hi Mom!";})def/; # matches, 2152 # prints 'Hi Mom!' 2153 $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match, 2154 # no 'Hi Mom!' 2155 2156Pay careful attention to the next example: 2157 2158 $x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match, 2159 # no 'Hi Mom!' 2160 # but why not? 2161 2162At first glance, you'd think that it shouldn't print, because obviously 2163the C<ddd> isn't going to match the target string. But look at this 2164example: 2165 2166 $x =~ /abc(?{print "Hi Mom!";})[d]dd/; # doesn't match, 2167 # but _does_ print 2168 2169Hmm. What happened here? If you've been following along, you know that 2170the above pattern should be effectively the same as the last one -- 2171enclosing the d in a character class isn't going to change what it 2172matches. So why does the first not print while the second one does? 2173 2174The answer lies in the optimizations the REx engine makes. In the first 2175case, all the engine sees are plain old characters (aside from the 2176C<?{}> construct). It's smart enough to realize that the string 'ddd' 2177doesn't occur in our target string before actually running the pattern 2178through. But in the second case, we've tricked it into thinking that our 2179pattern is more complicated than it is. It takes a look, sees our 2180character class, and decides that it will have to actually run the 2181pattern to determine whether or not it matches, and in the process of 2182running it hits the print statement before it discovers that we don't 2183have a match. 2184 2185To take a closer look at how the engine does optimizations, see the 2186section L<"Pragmas and debugging"> below. 2187 2188More fun with C<?{}>: 2189 2190 $x =~ /(?{print "Hi Mom!";})/; # matches, 2191 # prints 'Hi Mom!' 2192 $x =~ /(?{$c = 1;})(?{print "$c";})/; # matches, 2193 # prints '1' 2194 $x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches, 2195 # prints '1' 2196 2197The bit of magic mentioned in the section title occurs when the regexp 2198backtracks in the process of searching for a match. If the regexp 2199backtracks over a code expression and if the variables used within are 2200localized using C<local>, the changes in the variables produced by the 2201code expression are undone! Thus, if we wanted to count how many times 2202a character got matched inside a group, we could use, e.g., 2203 2204 $x = "aaaa"; 2205 $count = 0; # initialize 'a' count 2206 $c = "bob"; # test if $c gets clobbered 2207 $x =~ /(?{local $c = 0;}) # initialize count 2208 ( a # match 'a' 2209 (?{local $c = $c + 1;}) # increment count 2210 )* # do this any number of times, 2211 aa # but match 'aa' at the end 2212 (?{$count = $c;}) # copy local $c var into $count 2213 /x; 2214 print "'a' count is $count, \$c variable is '$c'\n"; 2215 2216This prints 2217 2218 'a' count is 2, $c variable is 'bob' 2219 2220If we replace the S<C< (?{local $c = $c + 1;})> > with 2221S<C< (?{$c = $c + 1;})> >, the variable changes are I<not> undone 2222during backtracking, and we get 2223 2224 'a' count is 4, $c variable is 'bob' 2225 2226Note that only localized variable changes are undone. Other side 2227effects of code expression execution are permanent. Thus 2228 2229 $x = "aaaa"; 2230 $x =~ /(a(?{print "Yow\n";}))*aa/; 2231 2232produces 2233 2234 Yow 2235 Yow 2236 Yow 2237 Yow 2238 2239The result C<$^R> is automatically localized, so that it will behave 2240properly in the presence of backtracking. 2241 2242This example uses a code expression in a conditional to match the 2243article 'the' in either English or German: 2244 2245 $lang = 'DE'; # use German 2246 ... 2247 $text = "das"; 2248 print "matched\n" 2249 if $text =~ /(?(?{ 2250 $lang eq 'EN'; # is the language English? 2251 }) 2252 the | # if so, then match 'the' 2253 (die|das|der) # else, match 'die|das|der' 2254 ) 2255 /xi; 2256 2257Note that the syntax here is C<(?(?{...})yes-regexp|no-regexp)>, not 2258C<(?((?{...}))yes-regexp|no-regexp)>. In other words, in the case of a 2259code expression, we don't need the extra parentheses around the 2260conditional. 2261 2262If you try to use code expressions with interpolating variables, perl 2263may surprise you: 2264 2265 $bar = 5; 2266 $pat = '(?{ 1 })'; 2267 /foo(?{ $bar })bar/; # compiles ok, $bar not interpolated 2268 /foo(?{ 1 })$bar/; # compile error! 2269 /foo${pat}bar/; # compile error! 2270 2271 $pat = qr/(?{ $foo = 1 })/; # precompile code regexp 2272 /foo${pat}bar/; # compiles ok 2273 2274If a regexp has (1) code expressions and interpolating variables,or 2275(2) a variable that interpolates a code expression, perl treats the 2276regexp as an error. If the code expression is precompiled into a 2277variable, however, interpolating is ok. The question is, why is this 2278an error? 2279 2280The reason is that variable interpolation and code expressions 2281together pose a security risk. The combination is dangerous because 2282many programmers who write search engines often take user input and 2283plug it directly into a regexp: 2284 2285 $regexp = <>; # read user-supplied regexp 2286 $chomp $regexp; # get rid of possible newline 2287 $text =~ /$regexp/; # search $text for the $regexp 2288 2289If the C<$regexp> variable contains a code expression, the user could 2290then execute arbitrary Perl code. For instance, some joker could 2291search for S<C<system('rm -rf *');> > to erase your files. In this 2292sense, the combination of interpolation and code expressions B<taints> 2293your regexp. So by default, using both interpolation and code 2294expressions in the same regexp is not allowed. If you're not 2295concerned about malicious users, it is possible to bypass this 2296security check by invoking S<C<use re 'eval'> >: 2297 2298 use re 'eval'; # throw caution out the door 2299 $bar = 5; 2300 $pat = '(?{ 1 })'; 2301 /foo(?{ 1 })$bar/; # compiles ok 2302 /foo${pat}bar/; # compiles ok 2303 2304Another form of code expression is the S<B<pattern code expression> >. 2305The pattern code expression is like a regular code expression, except 2306that the result of the code evaluation is treated as a regular 2307expression and matched immediately. A simple example is 2308 2309 $length = 5; 2310 $char = 'a'; 2311 $x = 'aaaaabb'; 2312 $x =~ /(??{$char x $length})/x; # matches, there are 5 of 'a' 2313 2314 2315This final example contains both ordinary and pattern code 2316expressions. It detects if a binary string C<1101010010001...> has a 2317Fibonacci spacing 0,1,1,2,3,5,... of the C<1>'s: 2318 2319 $s0 = 0; $s1 = 1; # initial conditions 2320 $x = "1101010010001000001"; 2321 print "It is a Fibonacci sequence\n" 2322 if $x =~ /^1 # match an initial '1' 2323 ( 2324 (??{'0' x $s0}) # match $s0 of '0' 2325 1 # and then a '1' 2326 (?{ 2327 $largest = $s0; # largest seq so far 2328 $s2 = $s1 + $s0; # compute next term 2329 $s0 = $s1; # in Fibonacci sequence 2330 $s1 = $s2; 2331 }) 2332 )+ # repeat as needed 2333 $ # that is all there is 2334 /x; 2335 print "Largest sequence matched was $largest\n"; 2336 2337This prints 2338 2339 It is a Fibonacci sequence 2340 Largest sequence matched was 5 2341 2342Ha! Try that with your garden variety regexp package... 2343 2344Note that the variables C<$s0> and C<$s1> are not substituted when the 2345regexp is compiled, as happens for ordinary variables outside a code 2346expression. Rather, the code expressions are evaluated when perl 2347encounters them during the search for a match. 2348 2349The regexp without the C<//x> modifier is 2350 2351 /^1((??{'0'x$s0})1(?{$largest=$s0;$s2=$s1+$s0$s0=$s1;$s1=$s2;}))+$/; 2352 2353and is a great start on an Obfuscated Perl entry :-) When working with 2354code and conditional expressions, the extended form of regexps is 2355almost necessary in creating and debugging regexps. 2356 2357=head2 Pragmas and debugging 2358 2359Speaking of debugging, there are several pragmas available to control 2360and debug regexps in Perl. We have already encountered one pragma in 2361the previous section, S<C<use re 'eval';> >, that allows variable 2362interpolation and code expressions to coexist in a regexp. The other 2363pragmas are 2364 2365 use re 'taint'; 2366 $tainted = <>; 2367 @parts = ($tainted =~ /(\w+)\s+(\w+)/; # @parts is now tainted 2368 2369The C<taint> pragma causes any substrings from a match with a tainted 2370variable to be tainted as well. This is not normally the case, as 2371regexps are often used to extract the safe bits from a tainted 2372variable. Use C<taint> when you are not extracting safe bits, but are 2373performing some other processing. Both C<taint> and C<eval> pragmas 2374are lexically scoped, which means they are in effect only until 2375the end of the block enclosing the pragmas. 2376 2377 use re 'debug'; 2378 /^(.*)$/s; # output debugging info 2379 2380 use re 'debugcolor'; 2381 /^(.*)$/s; # output debugging info in living color 2382 2383The global C<debug> and C<debugcolor> pragmas allow one to get 2384detailed debugging info about regexp compilation and 2385execution. C<debugcolor> is the same as debug, except the debugging 2386information is displayed in color on terminals that can display 2387termcap color sequences. Here is example output: 2388 2389 % perl -e 'use re "debug"; "abc" =~ /a*b+c/;' 2390 Compiling REx `a*b+c' 2391 size 9 first at 1 2392 1: STAR(4) 2393 2: EXACT <a>(0) 2394 4: PLUS(7) 2395 5: EXACT <b>(0) 2396 7: EXACT <c>(9) 2397 9: END(0) 2398 floating `bc' at 0..2147483647 (checking floating) minlen 2 2399 Guessing start of match, REx `a*b+c' against `abc'... 2400 Found floating substr `bc' at offset 1... 2401 Guessed: match at offset 0 2402 Matching REx `a*b+c' against `abc' 2403 Setting an EVAL scope, savestack=3 2404 0 <> <abc> | 1: STAR 2405 EXACT <a> can match 1 times out of 32767... 2406 Setting an EVAL scope, savestack=3 2407 1 <a> <bc> | 4: PLUS 2408 EXACT <b> can match 1 times out of 32767... 2409 Setting an EVAL scope, savestack=3 2410 2 <ab> <c> | 7: EXACT <c> 2411 3 <abc> <> | 9: END 2412 Match successful! 2413 Freeing REx: `a*b+c' 2414 2415If you have gotten this far into the tutorial, you can probably guess 2416what the different parts of the debugging output tell you. The first 2417part 2418 2419 Compiling REx `a*b+c' 2420 size 9 first at 1 2421 1: STAR(4) 2422 2: EXACT <a>(0) 2423 4: PLUS(7) 2424 5: EXACT <b>(0) 2425 7: EXACT <c>(9) 2426 9: END(0) 2427 2428describes the compilation stage. C<STAR(4)> means that there is a 2429starred object, in this case C<'a'>, and if it matches, goto line 4, 2430i.e., C<PLUS(7)>. The middle lines describe some heuristics and 2431optimizations performed before a match: 2432 2433 floating `bc' at 0..2147483647 (checking floating) minlen 2 2434 Guessing start of match, REx `a*b+c' against `abc'... 2435 Found floating substr `bc' at offset 1... 2436 Guessed: match at offset 0 2437 2438Then the match is executed and the remaining lines describe the 2439process: 2440 2441 Matching REx `a*b+c' against `abc' 2442 Setting an EVAL scope, savestack=3 2443 0 <> <abc> | 1: STAR 2444 EXACT <a> can match 1 times out of 32767... 2445 Setting an EVAL scope, savestack=3 2446 1 <a> <bc> | 4: PLUS 2447 EXACT <b> can match 1 times out of 32767... 2448 Setting an EVAL scope, savestack=3 2449 2 <ab> <c> | 7: EXACT <c> 2450 3 <abc> <> | 9: END 2451 Match successful! 2452 Freeing REx: `a*b+c' 2453 2454Each step is of the form S<C<< n <x> <y> >> >, with C<< <x> >> the 2455part of the string matched and C<< <y> >> the part not yet 2456matched. The S<C<< | 1: STAR >> > says that perl is at line number 1 2457n the compilation list above. See 2458L<perldebguts/"Debugging regular expressions"> for much more detail. 2459 2460An alternative method of debugging regexps is to embed C<print> 2461statements within the regexp. This provides a blow-by-blow account of 2462the backtracking in an alternation: 2463 2464 "that this" =~ m@(?{print "Start at position ", pos, "\n";}) 2465 t(?{print "t1\n";}) 2466 h(?{print "h1\n";}) 2467 i(?{print "i1\n";}) 2468 s(?{print "s1\n";}) 2469 | 2470 t(?{print "t2\n";}) 2471 h(?{print "h2\n";}) 2472 a(?{print "a2\n";}) 2473 t(?{print "t2\n";}) 2474 (?{print "Done at position ", pos, "\n";}) 2475 @x; 2476 2477prints 2478 2479 Start at position 0 2480 t1 2481 h1 2482 t2 2483 h2 2484 a2 2485 t2 2486 Done at position 4 2487 2488=head1 BUGS 2489 2490Code expressions, conditional expressions, and independent expressions 2491are B<experimental>. Don't use them in production code. Yet. 2492 2493=head1 SEE ALSO 2494 2495This is just a tutorial. For the full story on perl regular 2496expressions, see the L<perlre> regular expressions reference page. 2497 2498For more information on the matching C<m//> and substitution C<s///> 2499operators, see L<perlop/"Regexp Quote-Like Operators">. For 2500information on the C<split> operation, see L<perlfunc/split>. 2501 2502For an excellent all-around resource on the care and feeding of 2503regular expressions, see the book I<Mastering Regular Expressions> by 2504Jeffrey Friedl (published by O'Reilly, ISBN 1556592-257-3). 2505 2506=head1 AUTHOR AND COPYRIGHT 2507 2508Copyright (c) 2000 Mark Kvale 2509All rights reserved. 2510 2511This document may be distributed under the same terms as Perl itself. 2512 2513=head2 Acknowledgments 2514 2515The inspiration for the stop codon DNA example came from the ZIP 2516code example in chapter 7 of I<Mastering Regular Expressions>. 2517 2518The author would like to thank Jeff Pinyan, Andrew Johnson, Peter 2519Haworth, Ronald J Kimball, and Joe Smith for all their helpful 2520comments. 2521 2522=cut 2523 2524