1=head1 NAME 2 3perlretut - Perl regular expressions tutorial 4 5=head1 DESCRIPTION 6 7This page provides a basic tutorial on understanding, creating and 8using regular expressions in Perl. It serves as a complement to the 9reference page on regular expressions L<perlre>. Regular expressions 10are an integral part of the C<m//>, C<s///>, C<qr//> and C<split> 11operators and so this tutorial also overlaps with 12L<perlop/"Regexp Quote-Like Operators"> and L<perlfunc/split>. 13 14Perl is widely renowned for excellence in text processing, and regular 15expressions are one of the big factors behind this fame. Perl regular 16expressions display an efficiency and flexibility unknown in most 17other computer languages. Mastering even the basics of regular 18expressions will allow you to manipulate text with surprising ease. 19 20What is a regular expression? A regular expression is simply a string 21that describes a pattern. Patterns are in common use these days; 22examples are the patterns typed into a search engine to find web pages 23and the patterns used to list files in a directory, e.g., C<ls *.txt> 24or C<dir *.*>. In Perl, the patterns described by regular expressions 25are used to search strings, extract desired parts of strings, and to 26do search and replace operations. 27 28Regular expressions have the undeserved reputation of being abstract 29and difficult to understand. Regular expressions are constructed using 30simple concepts like conditionals and loops and are no more difficult 31to understand than the corresponding C<if> conditionals and C<while> 32loops in the Perl language itself. In fact, the main challenge in 33learning regular expressions is just getting used to the terse 34notation used to express these concepts. 35 36This tutorial flattens the learning curve by discussing regular 37expression concepts, along with their notation, one at a time and with 38many examples. The first part of the tutorial will progress from the 39simplest word searches to the basic regular expression concepts. If 40you master the first part, you will have all the tools needed to solve 41about 98% of your needs. The second part of the tutorial is for those 42comfortable with the basics and hungry for more power tools. It 43discusses the more advanced regular expression operators and 44introduces the latest cutting edge innovations in 5.6.0. 45 46A note: to save time, 'regular expression' is often abbreviated as 47regexp or regex. Regexp is a more natural abbreviation than regex, but 48is harder to pronounce. The Perl pod documentation is evenly split on 49regexp vs regex; in Perl, there is more than one way to abbreviate it. 50We'll use regexp in this tutorial. 51 52=head1 Part 1: The basics 53 54=head2 Simple word matching 55 56The simplest regexp is simply a word, or more generally, a string of 57characters. A regexp consisting of a word matches any string that 58contains that word: 59 60 "Hello World" =~ /World/; # matches 61 62What is this perl statement all about? C<"Hello World"> is a simple 63double quoted string. C<World> is the regular expression and the 64C<//> enclosing C</World/> tells perl to search a string for a match. 65The operator C<=~> associates the string with the regexp match and 66produces a true value if the regexp matched, or false if the regexp 67did not match. In our case, C<World> matches the second word in 68C<"Hello World">, so the expression is true. Expressions like this 69are useful in conditionals: 70 71 if ("Hello World" =~ /World/) { 72 print "It matches\n"; 73 } 74 else { 75 print "It doesn't match\n"; 76 } 77 78There are useful variations on this theme. The sense of the match can 79be reversed by using C<!~> operator: 80 81 if ("Hello World" !~ /World/) { 82 print "It doesn't match\n"; 83 } 84 else { 85 print "It matches\n"; 86 } 87 88The literal string in the regexp can be replaced by a variable: 89 90 $greeting = "World"; 91 if ("Hello World" =~ /$greeting/) { 92 print "It matches\n"; 93 } 94 else { 95 print "It doesn't match\n"; 96 } 97 98If you're matching against the special default variable C<$_>, the 99C<$_ =~> part can be omitted: 100 101 $_ = "Hello World"; 102 if (/World/) { 103 print "It matches\n"; 104 } 105 else { 106 print "It doesn't match\n"; 107 } 108 109And finally, the C<//> default delimiters for a match can be changed 110to arbitrary delimiters by putting an C<'m'> out front: 111 112 "Hello World" =~ m!World!; # matches, delimited by '!' 113 "Hello World" =~ m{World}; # matches, note the matching '{}' 114 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', 115 # '/' becomes an ordinary char 116 117C</World/>, C<m!World!>, and C<m{World}> all represent the 118same thing. When, e.g., C<""> is used as a delimiter, the forward 119slash C<'/'> becomes an ordinary character and can be used in a regexp 120without trouble. 121 122Let's consider how different regexps would match C<"Hello World">: 123 124 "Hello World" =~ /world/; # doesn't match 125 "Hello World" =~ /o W/; # matches 126 "Hello World" =~ /oW/; # doesn't match 127 "Hello World" =~ /World /; # doesn't match 128 129The first regexp C<world> doesn't match because regexps are 130case-sensitive. The second regexp matches because the substring 131S<C<'o W'> > occurs in the string S<C<"Hello World"> >. The space 132character ' ' is treated like any other character in a regexp and is 133needed to match in this case. The lack of a space character is the 134reason the third regexp C<'oW'> doesn't match. The fourth regexp 135C<'World '> doesn't match because there is a space at the end of the 136regexp, but not at the end of the string. The lesson here is that 137regexps must match a part of the string I<exactly> in order for the 138statement to be true. 139 140If a regexp matches in more than one place in the string, perl will 141always match at the earliest possible point in the string: 142 143 "Hello World" =~ /o/; # matches 'o' in 'Hello' 144 "That hat is red" =~ /hat/; # matches 'hat' in 'That' 145 146With respect to character matching, there are a few more points you 147need to know about. First of all, not all characters can be used 'as 148is' in a match. Some characters, called B<metacharacters>, are reserved 149for use in regexp notation. The metacharacters are 150 151 {}[]()^$.|*+?\ 152 153The significance of each of these will be explained 154in the rest of the tutorial, but for now, it is important only to know 155that a metacharacter can be matched by putting a backslash before it: 156 157 "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter 158 "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + 159 "The interval is [0,1)." =~ /[0,1)./ # is a syntax error! 160 "The interval is [0,1)." =~ /\[0,1\)\./ # matches 161 "/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/; # matches 162 163In the last regexp, the forward slash C<'/'> is also backslashed, 164because it is used to delimit the regexp. This can lead to LTS 165(leaning toothpick syndrome), however, and it is often more readable 166to change delimiters. 167 168 169The backslash character C<'\'> is a metacharacter itself and needs to 170be backslashed: 171 172 'C:\WIN32' =~ /C:\\WIN/; # matches 173 174In addition to the metacharacters, there are some ASCII characters 175which don't have printable character equivalents and are instead 176represented by B<escape sequences>. Common examples are C<\t> for a 177tab, C<\n> for a newline, C<\r> for a carriage return and C<\a> for a 178bell. If your string is better thought of as a sequence of arbitrary 179bytes, the octal escape sequence, e.g., C<\033>, or hexadecimal escape 180sequence, e.g., C<\x1B> may be a more natural representation for your 181bytes. Here are some examples of escapes: 182 183 "1000\t2000" =~ m(0\t2) # matches 184 "1000\n2000" =~ /0\n20/ # matches 185 "1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000" 186 "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat 187 188If you've been around Perl a while, all this talk of escape sequences 189may seem familiar. Similar escape sequences are used in double-quoted 190strings and in fact the regexps in Perl are mostly treated as 191double-quoted strings. This means that variables can be used in 192regexps as well. Just like double-quoted strings, the values of the 193variables in the regexp will be substituted in before the regexp is 194evaluated for matching purposes. So we have: 195 196 $foo = 'house'; 197 'housecat' =~ /$foo/; # matches 198 'cathouse' =~ /cat$foo/; # matches 199 'housecat' =~ /${foo}cat/; # matches 200 201So far, so good. With the knowledge above you can already perform 202searches with just about any literal string regexp you can dream up. 203Here is a I<very simple> emulation of the Unix grep program: 204 205 % cat > simple_grep 206 #!/usr/bin/perl 207 $regexp = shift; 208 while (<>) { 209 print if /$regexp/; 210 } 211 ^D 212 213 % chmod +x simple_grep 214 215 % simple_grep abba /usr/dict/words 216 Babbage 217 cabbage 218 cabbages 219 sabbath 220 Sabbathize 221 Sabbathizes 222 sabbatical 223 scabbard 224 scabbards 225 226This program is easy to understand. C<#!/usr/bin/perl> is the standard 227way to invoke a perl program from the shell. 228S<C<$regexp = shift;> > saves the first command line argument as the 229regexp to be used, leaving the rest of the command line arguments to 230be treated as files. S<C<< while (<>) >> > loops over all the lines in 231all the files. For each line, S<C<print if /$regexp/;> > prints the 232line if the regexp matches the line. In this line, both C<print> and 233C</$regexp/> use the default variable C<$_> implicitly. 234 235With all of the regexps above, if the regexp matched anywhere in the 236string, it was considered a match. Sometimes, however, we'd like to 237specify I<where> in the string the regexp should try to match. To do 238this, we would use the B<anchor> metacharacters C<^> and C<$>. The 239anchor C<^> means match at the beginning of the string and the anchor 240C<$> means match at the end of the string, or before a newline at the 241end of the string. Here is how they are used: 242 243 "housekeeper" =~ /keeper/; # matches 244 "housekeeper" =~ /^keeper/; # doesn't match 245 "housekeeper" =~ /keeper$/; # matches 246 "housekeeper\n" =~ /keeper$/; # matches 247 248The second regexp doesn't match because C<^> constrains C<keeper> to 249match only at the beginning of the string, but C<"housekeeper"> has 250keeper starting in the middle. The third regexp does match, since the 251C<$> constrains C<keeper> to match only at the end of the string. 252 253When both C<^> and C<$> are used at the same time, the regexp has to 254match both the beginning and the end of the string, i.e., the regexp 255matches the whole string. Consider 256 257 "keeper" =~ /^keep$/; # doesn't match 258 "keeper" =~ /^keeper$/; # matches 259 "" =~ /^$/; # ^$ matches an empty string 260 261The first regexp doesn't match because the string has more to it than 262C<keep>. Since the second regexp is exactly the string, it 263matches. Using both C<^> and C<$> in a regexp forces the complete 264string to match, so it gives you complete control over which strings 265match and which don't. Suppose you are looking for a fellow named 266bert, off in a string by himself: 267 268 "dogbert" =~ /bert/; # matches, but not what you want 269 270 "dilbert" =~ /^bert/; # doesn't match, but .. 271 "bertram" =~ /^bert/; # matches, so still not good enough 272 273 "bertram" =~ /^bert$/; # doesn't match, good 274 "dilbert" =~ /^bert$/; # doesn't match, good 275 "bert" =~ /^bert$/; # matches, perfect 276 277Of course, in the case of a literal string, one could just as easily 278use the string equivalence S<C<$string eq 'bert'> > and it would be 279more efficient. The C<^...$> regexp really becomes useful when we 280add in the more powerful regexp tools below. 281 282=head2 Using character classes 283 284Although one can already do quite a lot with the literal string 285regexps above, we've only scratched the surface of regular expression 286technology. In this and subsequent sections we will introduce regexp 287concepts (and associated metacharacter notations) that will allow a 288regexp to not just represent a single character sequence, but a I<whole 289class> of them. 290 291One such concept is that of a B<character class>. A character class 292allows a set of possible characters, rather than just a single 293character, to match at a particular point in a regexp. Character 294classes are denoted by brackets C<[...]>, with the set of characters 295to be possibly matched inside. Here are some examples: 296 297 /cat/; # matches 'cat' 298 /[bcr]at/; # matches 'bat, 'cat', or 'rat' 299 /item[0123456789]/; # matches 'item0' or ... or 'item9' 300 "abc" =~ /[cab]/; # matches 'a' 301 302In the last statement, even though C<'c'> is the first character in 303the class, C<'a'> matches because the first character position in the 304string is the earliest point at which the regexp can match. 305 306 /[yY][eE][sS]/; # match 'yes' in a case-insensitive way 307 # 'yes', 'Yes', 'YES', etc. 308 309This regexp displays a common task: perform a a case-insensitive 310match. Perl provides away of avoiding all those brackets by simply 311appending an C<'i'> to the end of the match. Then C</[yY][eE][sS]/;> 312can be rewritten as C</yes/i;>. The C<'i'> stands for 313case-insensitive and is an example of a B<modifier> of the matching 314operation. We will meet other modifiers later in the tutorial. 315 316We saw in the section above that there were ordinary characters, which 317represented themselves, and special characters, which needed a 318backslash C<\> to represent themselves. The same is true in a 319character class, but the sets of ordinary and special characters 320inside a character class are different than those outside a character 321class. The special characters for a character class are C<-]\^$>. C<]> 322is special because it denotes the end of a character class. C<$> is 323special because it denotes a scalar variable. C<\> is special because 324it is used in escape sequences, just like above. Here is how the 325special characters C<]$\> are handled: 326 327 /[\]c]def/; # matches ']def' or 'cdef' 328 $x = 'bcr'; 329 /[$x]at/; # matches 'bat', 'cat', or 'rat' 330 /[\$x]at/; # matches '$at' or 'xat' 331 /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' 332 333The last two are a little tricky. in C<[\$x]>, the backslash protects 334the dollar sign, so the character class has two members C<$> and C<x>. 335In C<[\\$x]>, the backslash is protected, so C<$x> is treated as a 336variable and substituted in double quote fashion. 337 338The special character C<'-'> acts as a range operator within character 339classes, so that a contiguous set of characters can be written as a 340range. With ranges, the unwieldy C<[0123456789]> and C<[abc...xyz]> 341become the svelte C<[0-9]> and C<[a-z]>. Some examples are 342 343 /item[0-9]/; # matches 'item0' or ... or 'item9' 344 /[0-9bx-z]aa/; # matches '0aa', ..., '9aa', 345 # 'baa', 'xaa', 'yaa', or 'zaa' 346 /[0-9a-fA-F]/; # matches a hexadecimal digit 347 /[0-9a-zA-Z_]/; # matches a "word" character, 348 # like those in a perl variable name 349 350If C<'-'> is the first or last character in a character class, it is 351treated as an ordinary character; C<[-ab]>, C<[ab-]> and C<[a\-b]> are 352all equivalent. 353 354The special character C<^> in the first position of a character class 355denotes a B<negated character class>, which matches any character but 356those in the brackets. Both C<[...]> and C<[^...]> must match a 357character, or the match fails. Then 358 359 /[^a]at/; # doesn't match 'aat' or 'at', but matches 360 # all other 'bat', 'cat, '0at', '%at', etc. 361 /[^0-9]/; # matches a non-numeric character 362 /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary 363 364Now, even C<[0-9]> can be a bother the write multiple times, so in the 365interest of saving keystrokes and making regexps more readable, Perl 366has several abbreviations for common character classes: 367 368=over 4 369 370=item * 371 372\d is a digit and represents [0-9] 373 374=item * 375 376\s is a whitespace character and represents [\ \t\r\n\f] 377 378=item * 379 380\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_] 381 382=item * 383 384\D is a negated \d; it represents any character but a digit [^0-9] 385 386=item * 387 388\S is a negated \s; it represents any non-whitespace character [^\s] 389 390=item * 391 392\W is a negated \w; it represents any non-word character [^\w] 393 394=item * 395 396The period '.' matches any character but "\n" 397 398=back 399 400The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside 401of character classes. Here are some in use: 402 403 /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format 404 /[\d\s]/; # matches any digit or whitespace character 405 /\w\W\w/; # matches a word char, followed by a 406 # non-word char, followed by a word char 407 /..rt/; # matches any two chars, followed by 'rt' 408 /end\./; # matches 'end.' 409 /end[.]/; # same thing, matches 'end.' 410 411Because a period is a metacharacter, it needs to be escaped to match 412as an ordinary period. Because, for example, C<\d> and C<\w> are sets 413of characters, it is incorrect to think of C<[^\d\w]> as C<[\D\W]>; in 414fact C<[^\d\w]> is the same as C<[^\w]>, which is the same as 415C<[\W]>. Think DeMorgan's laws. 416 417An anchor useful in basic regexps is the S<B<word anchor> > 418C<\b>. This matches a boundary between a word character and a non-word 419character C<\w\W> or C<\W\w>: 420 421 $x = "Housecat catenates house and cat"; 422 $x =~ /cat/; # matches cat in 'housecat' 423 $x =~ /\bcat/; # matches cat in 'catenates' 424 $x =~ /cat\b/; # matches cat in 'housecat' 425 $x =~ /\bcat\b/; # matches 'cat' at end of string 426 427Note in the last example, the end of the string is considered a word 428boundary. 429 430You might wonder why C<'.'> matches everything but C<"\n"> - why not 431every character? The reason is that often one is matching against 432lines and would like to ignore the newline characters. For instance, 433while the string C<"\n"> represents one line, we would like to think 434of as empty. Then 435 436 "" =~ /^$/; # matches 437 "\n" =~ /^$/; # matches, "\n" is ignored 438 439 "" =~ /./; # doesn't match; it needs a char 440 "" =~ /^.$/; # doesn't match; it needs a char 441 "\n" =~ /^.$/; # doesn't match; it needs a char other than "\n" 442 "a" =~ /^.$/; # matches 443 "a\n" =~ /^.$/; # matches, ignores the "\n" 444 445This behavior is convenient, because we usually want to ignore 446newlines when we count and match characters in a line. Sometimes, 447however, we want to keep track of newlines. We might even want C<^> 448and C<$> to anchor at the beginning and end of lines within the 449string, rather than just the beginning and end of the string. Perl 450allows us to choose between ignoring and paying attention to newlines 451by using the C<//s> and C<//m> modifiers. C<//s> and C<//m> stand for 452single line and multi-line and they determine whether a string is to 453be treated as one continuous string, or as a set of lines. The two 454modifiers affect two aspects of how the regexp is interpreted: 1) how 455the C<'.'> character class is defined, and 2) where the anchors C<^> 456and C<$> are able to match. Here are the four possible combinations: 457 458=over 4 459 460=item * 461 462no modifiers (//): Default behavior. C<'.'> matches any character 463except C<"\n">. C<^> matches only at the beginning of the string and 464C<$> matches only at the end or before a newline at the end. 465 466=item * 467 468s modifier (//s): Treat string as a single long line. C<'.'> matches 469any character, even C<"\n">. C<^> matches only at the beginning of 470the string and C<$> matches only at the end or before a newline at the 471end. 472 473=item * 474 475m modifier (//m): Treat string as a set of multiple lines. C<'.'> 476matches any character except C<"\n">. C<^> and C<$> are able to match 477at the start or end of I<any> line within the string. 478 479=item * 480 481both s and m modifiers (//sm): Treat string as a single long line, but 482detect multiple lines. C<'.'> matches any character, even 483C<"\n">. C<^> and C<$>, however, are able to match at the start or end 484of I<any> line within the string. 485 486=back 487 488Here are examples of C<//s> and C<//m> in action: 489 490 $x = "There once was a girl\nWho programmed in Perl\n"; 491 492 $x =~ /^Who/; # doesn't match, "Who" not at start of string 493 $x =~ /^Who/s; # doesn't match, "Who" not at start of string 494 $x =~ /^Who/m; # matches, "Who" at start of second line 495 $x =~ /^Who/sm; # matches, "Who" at start of second line 496 497 $x =~ /girl.Who/; # doesn't match, "." doesn't match "\n" 498 $x =~ /girl.Who/s; # matches, "." matches "\n" 499 $x =~ /girl.Who/m; # doesn't match, "." doesn't match "\n" 500 $x =~ /girl.Who/sm; # matches, "." matches "\n" 501 502Most of the time, the default behavior is what is want, but C<//s> and 503C<//m> are occasionally very useful. If C<//m> is being used, the start 504of the string can still be matched with C<\A> and the end of string 505can still be matched with the anchors C<\Z> (matches both the end and 506the newline before, like C<$>), and C<\z> (matches only the end): 507 508 $x =~ /^Who/m; # matches, "Who" at start of second line 509 $x =~ /\AWho/m; # doesn't match, "Who" is not at start of string 510 511 $x =~ /girl$/m; # matches, "girl" at end of first line 512 $x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string 513 514 $x =~ /Perl\Z/m; # matches, "Perl" is at newline before end 515 $x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of string 516 517We now know how to create choices among classes of characters in a 518regexp. What about choices among words or character strings? Such 519choices are described in the next section. 520 521=head2 Matching this or that 522 523Sometimes we would like to our regexp to be able to match different 524possible words or character strings. This is accomplished by using 525the B<alternation> metacharacter C<|>. To match C<dog> or C<cat>, we 526form the regexp C<dog|cat>. As before, perl will try to match the 527regexp at the earliest possible point in the string. At each 528character position, perl will first try to match the first 529alternative, C<dog>. If C<dog> doesn't match, perl will then try the 530next alternative, C<cat>. If C<cat> doesn't match either, then the 531match fails and perl moves to the next position in the string. Some 532examples: 533 534 "cats and dogs" =~ /cat|dog|bird/; # matches "cat" 535 "cats and dogs" =~ /dog|cat|bird/; # matches "cat" 536 537Even though C<dog> is the first alternative in the second regexp, 538C<cat> is able to match earlier in the string. 539 540 "cats" =~ /c|ca|cat|cats/; # matches "c" 541 "cats" =~ /cats|cat|ca|c/; # matches "cats" 542 543Here, all the alternatives match at the first string position, so the 544first alternative is the one that matches. If some of the 545alternatives are truncations of the others, put the longest ones first 546to give them a chance to match. 547 548 "cab" =~ /a|b|c/ # matches "c" 549 # /a|b|c/ == /[abc]/ 550 551The last example points out that character classes are like 552alternations of characters. At a given character position, the first 553alternative that allows the regexp match to succeed wil be the one 554that matches. 555 556=head2 Grouping things and hierarchical matching 557 558Alternation allows a regexp to choose among alternatives, but by 559itself it unsatisfying. The reason is that each alternative is a whole 560regexp, but sometime we want alternatives for just part of a 561regexp. For instance, suppose we want to search for housecats or 562housekeepers. The regexp C<housecat|housekeeper> fits the bill, but is 563inefficient because we had to type C<house> twice. It would be nice to 564have parts of the regexp be constant, like C<house>, and and some 565parts have alternatives, like C<cat|keeper>. 566 567The B<grouping> metacharacters C<()> solve this problem. Grouping 568allows parts of a regexp to be treated as a single unit. Parts of a 569regexp are grouped by enclosing them in parentheses. Thus we could solve 570the C<housecat|housekeeper> by forming the regexp as 571C<house(cat|keeper)>. The regexp C<house(cat|keeper)> means match 572C<house> followed by either C<cat> or C<keeper>. Some more examples 573are 574 575 /(a|b)b/; # matches 'ab' or 'bb' 576 /(ac|b)b/; # matches 'acb' or 'bb' 577 /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere 578 /(a|[bc])d/; # matches 'ad', 'bd', or 'cd' 579 580 /house(cat|)/; # matches either 'housecat' or 'house' 581 /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or 582 # 'house'. Note groups can be nested. 583 584 /(19|20|)\d\d/; # match years 19xx, 20xx, or the Y2K problem, xx 585 "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', 586 # because '20\d\d' can't match 587 588Alternations behave the same way in groups as out of them: at a given 589string position, the leftmost alternative that allows the regexp to 590match is taken. So in the last example at tth first string position, 591C<"20"> matches the second alternative, but there is nothing left over 592to match the next two digits C<\d\d>. So perl moves on to the next 593alternative, which is the null alternative and that works, since 594C<"20"> is two digits. 595 596The process of trying one alternative, seeing if it matches, and 597moving on to the next alternative if it doesn't, is called 598B<backtracking>. The term 'backtracking' comes from the idea that 599matching a regexp is like a walk in the woods. Successfully matching 600a regexp is like arriving at a destination. There are many possible 601trailheads, one for each string position, and each one is tried in 602order, left to right. From each trailhead there may be many paths, 603some of which get you there, and some which are dead ends. When you 604walk along a trail and hit a dead end, you have to backtrack along the 605trail to an earlier point to try another trail. If you hit your 606destination, you stop immediately and forget about trying all the 607other trails. You are persistent, and only if you have tried all the 608trails from all the trailheads and not arrived at your destination, do 609you declare failure. To be concrete, here is a step-by-step analysis 610of what perl does when it tries to match the regexp 611 612 "abcde" =~ /(abd|abc)(df|d|de)/; 613 614=over 4 615 616=item 0 617 618Start with the first letter in the string 'a'. 619 620=item 1 621 622Try the first alternative in the first group 'abd'. 623 624=item 2 625 626Match 'a' followed by 'b'. So far so good. 627 628=item 3 629 630'd' in the regexp doesn't match 'c' in the string - a dead 631end. So backtrack two characters and pick the second alternative in 632the first group 'abc'. 633 634=item 4 635 636Match 'a' followed by 'b' followed by 'c'. We are on a roll 637and have satisfied the first group. Set $1 to 'abc'. 638 639=item 5 640 641Move on to the second group and pick the first alternative 642'df'. 643 644=item 6 645 646Match the 'd'. 647 648=item 7 649 650'f' in the regexp doesn't match 'e' in the string, so a dead 651end. Backtrack one character and pick the second alternative in the 652second group 'd'. 653 654=item 8 655 656'd' matches. The second grouping is satisfied, so set $2 to 657'd'. 658 659=item 9 660 661We are at the end of the regexp, so we are done! We have 662matched 'abcd' out of the string "abcde". 663 664=back 665 666There are a couple of things to note about this analysis. First, the 667third alternative in the second group 'de' also allows a match, but we 668stopped before we got to it - at a given character position, leftmost 669wins. Second, we were able to get a match at the first character 670position of the string 'a'. If there were no matches at the first 671position, perl would move to the second character position 'b' and 672attempt the match all over again. Only when all possible paths at all 673possible character positions have been exhausted does perl give give 674up and declare S<C<$string =~ /(abd|abc)(df|d|de)/;> > to be false. 675 676Even with all this work, regexp matching happens remarkably fast. To 677speed things up, during compilation stage, perl compiles the regexp 678into a compact sequence of opcodes that can often fit inside a 679processor cache. When the code is executed, these opcodes can then run 680at full throttle and search very quickly. 681 682=head2 Extracting matches 683 684The grouping metacharacters C<()> also serve another completely 685different function: they allow the extraction of the parts of a string 686that matched. This is very useful to find out what matched and for 687text processing in general. For each grouping, the part that matched 688inside goes into the special variables C<$1>, C<$2>, etc. They can be 689used just as ordinary variables: 690 691 # extract hours, minutes, seconds 692 $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format 693 $hours = $1; 694 $minutes = $2; 695 $seconds = $3; 696 697Now, we know that in scalar context, 698S<C<$time =~ /(\d\d):(\d\d):(\d\d)/> > returns a true or false 699value. In list context, however, it returns the list of matched values 700C<($1,$2,$3)>. So we could write the code more compactly as 701 702 # extract hours, minutes, seconds 703 ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); 704 705If the groupings in a regexp are nested, C<$1> gets the group with the 706leftmost opening parenthesis, C<$2> the next opening parenthesis, 707etc. For example, here is a complex regexp and the matching variables 708indicated below it: 709 710 /(ab(cd|ef)((gi)|j))/; 711 1 2 34 712 713so that if the regexp matched, e.g., C<$2> would contain 'cd' or 'ef'. 714For convenience, perl sets C<$+> to the highest numbered C<$1>, C<$2>, 715... that got assigned. 716 717Closely associated with the matching variables C<$1>, C<$2>, ... are 718the B<backreferences> C<\1>, C<\2>, ... . Backreferences are simply 719matching variables that can be used I<inside> a regexp. This is a 720really nice feature - what matches later in a regexp can depend on 721what matched earlier in the regexp. Suppose we wanted to look 722for doubled words in text, like 'the the'. The following regexp finds 723all 3-letter doubles with a space in between: 724 725 /(\w\w\w)\s\1/; 726 727The grouping assigns a value to \1, so that the same 3 letter sequence 728is used for both parts. Here are some words with repeated parts: 729 730 % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\1$' /usr/dict/words 731 beriberi 732 booboo 733 coco 734 mama 735 murmur 736 papa 737 738The regexp has a single grouping which considers 4-letter 739combinations, then 3-letter combinations, etc. and uses C<\1> to look for 740a repeat. Although C<$1> and C<\1> represent the same thing, care should be 741taken to use matched variables C<$1>, C<$2>, ... only outside a regexp 742and backreferences C<\1>, C<\2>, ... only inside a regexp; not doing 743so may lead to surprising and/or undefined results. 744 745In addition to what was matched, Perl 5.6.0 also provides the 746positions of what was matched with the C<@-> and C<@+> 747arrays. C<$-[0]> is the position of the start of the entire match and 748C<$+[0]> is the position of the end. Similarly, C<$-[n]> is the 749position of the start of the C<$n> match and C<$+[n]> is the position 750of the end. If C<$n> is undefined, so are C<$-[n]> and C<$+[n]>. Then 751this code 752 753 $x = "Mmm...donut, thought Homer"; 754 $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches 755 foreach $expr (1..$#-) { 756 print "Match $expr: '${$expr}' at position ($-[$expr],$+[$expr])\n"; 757 } 758 759prints 760 761 Match 1: 'Mmm' at position (0,3) 762 Match 2: 'donut' at position (6,11) 763 764Even if there are no groupings in a regexp, it is still possible to 765find out what exactly matched in a string. If you use them, perl 766will set C<$`> to the part of the string before the match, will set C<$&> 767to the part of the string that matched, and will set C<$'> to the part 768of the string after the match. An example: 769 770 $x = "the cat caught the mouse"; 771 $x =~ /cat/; # $` = 'the ', $& = 'cat', $' = ' caught the mouse' 772 $x =~ /the/; # $` = '', $& = 'the', $' = ' cat caught the mouse' 773 774In the second match, S<C<$` = ''> > because the regexp matched at the 775first character position in the string and stopped, it never saw the 776second 'the'. It is important to note that using C<$`> and C<$'> 777slows down regexp matching quite a bit, and C< $& > slows it down to a 778lesser extent, because if they are used in one regexp in a program, 779they are generated for <all> regexps in the program. So if raw 780performance is a goal of your application, they should be avoided. 781If you need them, use C<@-> and C<@+> instead: 782 783 $` is the same as substr( $x, 0, $-[0] ) 784 $& is the same as substr( $x, $-[0], $+[0]-$-[0] ) 785 $' is the same as substr( $x, $+[0] ) 786 787=head2 Matching repetitions 788 789The examples in the previous section display an annoying weakness. We 790were only matching 3-letter words, or syllables of 4 letters or 791less. We'd like to be able to match words or syllables of any length, 792without writing out tedious alternatives like 793C<\w\w\w\w|\w\w\w|\w\w|\w>. 794 795This is exactly the problem the B<quantifier> metacharacters C<?>, 796C<*>, C<+>, and C<{}> were created for. They allow us to determine the 797number of repeats of a portion of a regexp we consider to be a 798match. Quantifiers are put immediately after the character, character 799class, or grouping that we want to specify. They have the following 800meanings: 801 802=over 4 803 804=item * 805 806C<a?> = match 'a' 1 or 0 times 807 808=item * 809 810C<a*> = match 'a' 0 or more times, i.e., any number of times 811 812=item * 813 814C<a+> = match 'a' 1 or more times, i.e., at least once 815 816=item * 817 818C<a{n,m}> = match at least C<n> times, but not more than C<m> 819times. 820 821=item * 822 823C<a{n,}> = match at least C<n> or more times 824 825=item * 826 827C<a{n}> = match exactly C<n> times 828 829=back 830 831Here are some examples: 832 833 /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and 834 # any number of digits 835 /(\w+)\s+\1/; # match doubled words of arbitrary length 836 /y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes' 837 $year =~ /\d{2,4}/; # make sure year is at least 2 but not more 838 # than 4 digits 839 $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates 840 $year =~ /\d{2}(\d{2})?/; # same thing written differently. However, 841 # this produces $1 and the other does not. 842 843 % simple_grep '^(\w+)\1$' /usr/dict/words # isn't this easier? 844 beriberi 845 booboo 846 coco 847 mama 848 murmur 849 papa 850 851For all of these quantifiers, perl will try to match as much of the 852string as possible, while still allowing the regexp to succeed. Thus 853with C</a?.../>, perl will first try to match the regexp with the C<a> 854present; if that fails, perl will try to match the regexp without the 855C<a> present. For the quantifier C<*>, we get the following: 856 857 $x = "the cat in the hat"; 858 $x =~ /^(.*)(cat)(.*)$/; # matches, 859 # $1 = 'the ' 860 # $2 = 'cat' 861 # $3 = ' in the hat' 862 863Which is what we might expect, the match finds the only C<cat> in the 864string and locks onto it. Consider, however, this regexp: 865 866 $x =~ /^(.*)(at)(.*)$/; # matches, 867 # $1 = 'the cat in the h' 868 # $2 = 'at' 869 # $3 = '' (0 matches) 870 871One might initially guess that perl would find the C<at> in C<cat> and 872stop there, but that wouldn't give the longest possible string to the 873first quantifier C<.*>. Instead, the first quantifier C<.*> grabs as 874much of the string as possible while still having the regexp match. In 875this example, that means having the C<at> sequence with the final C<at> 876in the string. The other important principle illustrated here is that 877when there are two or more elements in a regexp, the I<leftmost> 878quantifier, if there is one, gets to grab as much the string as 879possible, leaving the rest of the regexp to fight over scraps. Thus in 880our example, the first quantifier C<.*> grabs most of the string, while 881the second quantifier C<.*> gets the empty string. Quantifiers that 882grab as much of the string as possible are called B<maximal match> or 883B<greedy> quantifiers. 884 885When a regexp can match a string in several different ways, we can use 886the principles above to predict which way the regexp will match: 887 888=over 4 889 890=item * 891 892Principle 0: Taken as a whole, any regexp will be matched at the 893earliest possible position in the string. 894 895=item * 896 897Principle 1: In an alternation C<a|b|c...>, the leftmost alternative 898that allows a match for the whole regexp will be the one used. 899 900=item * 901 902Principle 2: The maximal matching quantifiers C<?>, C<*>, C<+> and 903C<{n,m}> will in general match as much of the string as possible while 904still allowing the whole regexp to match. 905 906=item * 907 908Principle 3: If there are two or more elements in a regexp, the 909leftmost greedy quantifier, if any, will match as much of the string 910as possible while still allowing the whole regexp to match. The next 911leftmost greedy quantifier, if any, will try to match as much of the 912string remaining available to it as possible, while still allowing the 913whole regexp to match. And so on, until all the regexp elements are 914satisfied. 915 916=back 917 918As we have seen above, Principle 0 overrides the others - the regexp 919will be matched as early as possible, with the other principles 920determining how the regexp matches at that earliest character 921position. 922 923Here is an example of these principles in action: 924 925 $x = "The programming republic of Perl"; 926 $x =~ /^(.+)(e|r)(.*)$/; # matches, 927 # $1 = 'The programming republic of Pe' 928 # $2 = 'r' 929 # $3 = 'l' 930 931This regexp matches at the earliest string position, C<'T'>. One 932might think that C<e>, being leftmost in the alternation, would be 933matched, but C<r> produces the longest string in the first quantifier. 934 935 $x =~ /(m{1,2})(.*)$/; # matches, 936 # $1 = 'mm' 937 # $2 = 'ing republic of Perl' 938 939Here, The earliest possible match is at the first C<'m'> in 940C<programming>. C<m{1,2}> is the first quantifier, so it gets to match 941a maximal C<mm>. 942 943 $x =~ /.*(m{1,2})(.*)$/; # matches, 944 # $1 = 'm' 945 # $2 = 'ing republic of Perl' 946 947Here, the regexp matches at the start of the string. The first 948quantifier C<.*> grabs as much as possible, leaving just a single 949C<'m'> for the second quantifier C<m{1,2}>. 950 951 $x =~ /(.?)(m{1,2})(.*)$/; # matches, 952 # $1 = 'a' 953 # $2 = 'mm' 954 # $3 = 'ing republic of Perl' 955 956Here, C<.?> eats its maximal one character at the earliest possible 957position in the string, C<'a'> in C<programming>, leaving C<m{1,2}> 958the opportunity to match both C<m>'s. Finally, 959 960 "aXXXb" =~ /(X*)/; # matches with $1 = '' 961 962because it can match zero copies of C<'X'> at the beginning of the 963string. If you definitely want to match at least one C<'X'>, use 964C<X+>, not C<X*>. 965 966Sometimes greed is not good. At times, we would like quantifiers to 967match a I<minimal> piece of string, rather than a maximal piece. For 968this purpose, Larry Wall created the S<B<minimal match> > or 969B<non-greedy> quantifiers C<??>,C<*?>, C<+?>, and C<{}?>. These are 970the usual quantifiers with a C<?> appended to them. They have the 971following meanings: 972 973=over 4 974 975=item * 976 977C<a??> = match 'a' 0 or 1 times. Try 0 first, then 1. 978 979=item * 980 981C<a*?> = match 'a' 0 or more times, i.e., any number of times, 982but as few times as possible 983 984=item * 985 986C<a+?> = match 'a' 1 or more times, i.e., at least once, but 987as few times as possible 988 989=item * 990 991C<a{n,m}?> = match at least C<n> times, not more than C<m> 992times, as few times as possible 993 994=item * 995 996C<a{n,}?> = match at least C<n> times, but as few times as 997possible 998 999=item * 1000 1001C<a{n}?> = match exactly C<n> times. Because we match exactly 1002C<n> times, C<a{n}?> is equivalent to C<a{n}> and is just there for 1003notational consistency. 1004 1005=back 1006 1007Let's look at the example above, but with minimal quantifiers: 1008 1009 $x = "The programming republic of Perl"; 1010 $x =~ /^(.+?)(e|r)(.*)$/; # matches, 1011 # $1 = 'Th' 1012 # $2 = 'e' 1013 # $3 = ' programming republic of Perl' 1014 1015The minimal string that will allow both the start of the string C<^> 1016and the alternation to match is C<Th>, with the alternation C<e|r> 1017matching C<e>. The second quantifier C<.*> is free to gobble up the 1018rest of the string. 1019 1020 $x =~ /(m{1,2}?)(.*?)$/; # matches, 1021 # $1 = 'm' 1022 # $2 = 'ming republic of Perl' 1023 1024The first string position that this regexp can match is at the first 1025C<'m'> in C<programming>. At this position, the minimal C<m{1,2}?> 1026matches just one C<'m'>. Although the second quantifier C<.*?> would 1027prefer to match no characters, it is constrained by the end-of-string 1028anchor C<$> to match the rest of the string. 1029 1030 $x =~ /(.*?)(m{1,2}?)(.*)$/; # matches, 1031 # $1 = 'The progra' 1032 # $2 = 'm' 1033 # $3 = 'ming republic of Perl' 1034 1035In this regexp, you might expect the first minimal quantifier C<.*?> 1036to match the empty string, because it is not constrained by a C<^> 1037anchor to match the beginning of the word. Principle 0 applies here, 1038however. Because it is possible for the whole regexp to match at the 1039start of the string, it I<will> match at the start of the string. Thus 1040the first quantifier has to match everything up to the first C<m>. The 1041second minimal quantifier matches just one C<m> and the third 1042quantifier matches the rest of the string. 1043 1044 $x =~ /(.??)(m{1,2})(.*)$/; # matches, 1045 # $1 = 'a' 1046 # $2 = 'mm' 1047 # $3 = 'ing republic of Perl' 1048 1049Just as in the previous regexp, the first quantifier C<.??> can match 1050earliest at position C<'a'>, so it does. The second quantifier is 1051greedy, so it matches C<mm>, and the third matches the rest of the 1052string. 1053 1054We can modify principle 3 above to take into account non-greedy 1055quantifiers: 1056 1057=over 4 1058 1059=item * 1060 1061Principle 3: If there are two or more elements in a regexp, the 1062leftmost greedy (non-greedy) quantifier, if any, will match as much 1063(little) of the string as possible while still allowing the whole 1064regexp to match. The next leftmost greedy (non-greedy) quantifier, if 1065any, will try to match as much (little) of the string remaining 1066available to it as possible, while still allowing the whole regexp to 1067match. And so on, until all the regexp elements are satisfied. 1068 1069=back 1070 1071Just like alternation, quantifiers are also susceptible to 1072backtracking. Here is a step-by-step analysis of the example 1073 1074 $x = "the cat in the hat"; 1075 $x =~ /^(.*)(at)(.*)$/; # matches, 1076 # $1 = 'the cat in the h' 1077 # $2 = 'at' 1078 # $3 = '' (0 matches) 1079 1080=over 4 1081 1082=item 0 1083 1084Start with the first letter in the string 't'. 1085 1086=item 1 1087 1088The first quantifier '.*' starts out by matching the whole 1089string 'the cat in the hat'. 1090 1091=item 2 1092 1093'a' in the regexp element 'at' doesn't match the end of the 1094string. Backtrack one character. 1095 1096=item 3 1097 1098'a' in the regexp element 'at' still doesn't match the last 1099letter of the string 't', so backtrack one more character. 1100 1101=item 4 1102 1103Now we can match the 'a' and the 't'. 1104 1105=item 5 1106 1107Move on to the third element '.*'. Since we are at the end of 1108the string and '.*' can match 0 times, assign it the empty string. 1109 1110=item 6 1111 1112We are done! 1113 1114=back 1115 1116Most of the time, all this moving forward and backtracking happens 1117quickly and searching is fast. There are some pathological regexps, 1118however, whose execution time exponentially grows with the size of the 1119string. A typical structure that blows up in your face is of the form 1120 1121 /(a|b+)*/; 1122 1123The problem is the nested indeterminate quantifiers. There are many 1124different ways of partitioning a string of length n between the C<+> 1125and C<*>: one repetition with C<b+> of length n, two repetitions with 1126the first C<b+> length k and the second with length n-k, m repetitions 1127whose bits add up to length n, etc. In fact there are an exponential 1128number of ways to partition a string as a function of length. A 1129regexp may get lucky and match early in the process, but if there is 1130no match, perl will try I<every> possibility before giving up. So be 1131careful with nested C<*>'s, C<{n,m}>'s, and C<+>'s. The book 1132I<Mastering regular expressions> by Jeffrey Friedl gives a wonderful 1133discussion of this and other efficiency issues. 1134 1135=head2 Building a regexp 1136 1137At this point, we have all the basic regexp concepts covered, so let's 1138give a more involved example of a regular expression. We will build a 1139regexp that matches numbers. 1140 1141The first task in building a regexp is to decide what we want to match 1142and what we want to exclude. In our case, we want to match both 1143integers and floating point numbers and we want to reject any string 1144that isn't a number. 1145 1146The next task is to break the problem down into smaller problems that 1147are easily converted into a regexp. 1148 1149The simplest case is integers. These consist of a sequence of digits, 1150with an optional sign in front. The digits we can represent with 1151C<\d+> and the sign can be matched with C<[+-]>. Thus the integer 1152regexp is 1153 1154 /[+-]?\d+/; # matches integers 1155 1156A floating point number potentially has a sign, an integral part, a 1157decimal point, a fractional part, and an exponent. One or more of these 1158parts is optional, so we need to check out the different 1159possibilities. Floating point numbers which are in proper form include 1160123., 0.345, .34, -1e6, and 25.4E-72. As with integers, the sign out 1161front is completely optional and can be matched by C<[+-]?>. We can 1162see that if there is no exponent, floating point numbers must have a 1163decimal point, otherwise they are integers. We might be tempted to 1164model these with C<\d*\.\d*>, but this would also match just a single 1165decimal point, which is not a number. So the three cases of floating 1166point number sans exponent are 1167 1168 /[+-]?\d+\./; # 1., 321., etc. 1169 /[+-]?\.\d+/; # .1, .234, etc. 1170 /[+-]?\d+\.\d+/; # 1.0, 30.56, etc. 1171 1172These can be combined into a single regexp with a three-way alternation: 1173 1174 /[+-]?(\d+\.\d+|\d+\.|\.\d+)/; # floating point, no exponent 1175 1176In this alternation, it is important to put C<'\d+\.\d+'> before 1177C<'\d+\.'>. If C<'\d+\.'> were first, the regexp would happily match that 1178and ignore the fractional part of the number. 1179 1180Now consider floating point numbers with exponents. The key 1181observation here is that I<both> integers and numbers with decimal 1182points are allowed in front of an exponent. Then exponents, like the 1183overall sign, are independent of whether we are matching numbers with 1184or without decimal points, and can be 'decoupled' from the 1185mantissa. The overall form of the regexp now becomes clear: 1186 1187 /^(optional sign)(integer | f.p. mantissa)(optional exponent)$/; 1188 1189The exponent is an C<e> or C<E>, followed by an integer. So the 1190exponent regexp is 1191 1192 /[eE][+-]?\d+/; # exponent 1193 1194Putting all the parts together, we get a regexp that matches numbers: 1195 1196 /^[+-]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/; # Ta da! 1197 1198Long regexps like this may impress your friends, but can be hard to 1199decipher. In complex situations like this, the C<//x> modifier for a 1200match is invaluable. It allows one to put nearly arbitrary whitespace 1201and comments into a regexp without affecting their meaning. Using it, 1202we can rewrite our 'extended' regexp in the more pleasing form 1203 1204 /^ 1205 [+-]? # first, match an optional sign 1206 ( # then match integers or f.p. mantissas: 1207 \d+\.\d+ # mantissa of the form a.b 1208 |\d+\. # mantissa of the form a. 1209 |\.\d+ # mantissa of the form .b 1210 |\d+ # integer of the form a 1211 ) 1212 ([eE][+-]?\d+)? # finally, optionally match an exponent 1213 $/x; 1214 1215If whitespace is mostly irrelevant, how does one include space 1216characters in an extended regexp? The answer is to backslash it 1217S<C<'\ '> > or put it in a character class S<C<[ ]> >. The same thing 1218goes for pound signs, use C<\#> or C<[#]>. For instance, Perl allows 1219a space between the sign and the mantissa/integer, and we could add 1220this to our regexp as follows: 1221 1222 /^ 1223 [+-]?\ * # first, match an optional sign *and space* 1224 ( # then match integers or f.p. mantissas: 1225 \d+\.\d+ # mantissa of the form a.b 1226 |\d+\. # mantissa of the form a. 1227 |\.\d+ # mantissa of the form .b 1228 |\d+ # integer of the form a 1229 ) 1230 ([eE][+-]?\d+)? # finally, optionally match an exponent 1231 $/x; 1232 1233In this form, it is easier to see a way to simplify the 1234alternation. Alternatives 1, 2, and 4 all start with C<\d+>, so it 1235could be factored out: 1236 1237 /^ 1238 [+-]?\ * # first, match an optional sign 1239 ( # then match integers or f.p. mantissas: 1240 \d+ # start out with a ... 1241 ( 1242 \.\d* # mantissa of the form a.b or a. 1243 )? # ? takes care of integers of the form a 1244 |\.\d+ # mantissa of the form .b 1245 ) 1246 ([eE][+-]?\d+)? # finally, optionally match an exponent 1247 $/x; 1248 1249or written in the compact form, 1250 1251 /^[+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?$/; 1252 1253This is our final regexp. To recap, we built a regexp by 1254 1255=over 4 1256 1257=item * 1258 1259specifying the task in detail, 1260 1261=item * 1262 1263breaking down the problem into smaller parts, 1264 1265=item * 1266 1267translating the small parts into regexps, 1268 1269=item * 1270 1271combining the regexps, 1272 1273=item * 1274 1275and optimizing the final combined regexp. 1276 1277=back 1278 1279These are also the typical steps involved in writing a computer 1280program. This makes perfect sense, because regular expressions are 1281essentially programs written a little computer language that specifies 1282patterns. 1283 1284=head2 Using regular expressions in Perl 1285 1286The last topic of Part 1 briefly covers how regexps are used in Perl 1287programs. Where do they fit into Perl syntax? 1288 1289We have already introduced the matching operator in its default 1290C</regexp/> and arbitrary delimiter C<m!regexp!> forms. We have used 1291the binding operator C<=~> and its negation C<!~> to test for string 1292matches. Associated with the matching operator, we have discussed the 1293single line C<//s>, multi-line C<//m>, case-insensitive C<//i> and 1294extended C<//x> modifiers. 1295 1296There are a few more things you might want to know about matching 1297operators. First, we pointed out earlier that variables in regexps are 1298substituted before the regexp is evaluated: 1299 1300 $pattern = 'Seuss'; 1301 while (<>) { 1302 print if /$pattern/; 1303 } 1304 1305This will print any lines containing the word C<Seuss>. It is not as 1306efficient as it could be, however, because perl has to re-evaluate 1307C<$pattern> each time through the loop. If C<$pattern> won't be 1308changing over the lifetime of the script, we can add the C<//o> 1309modifier, which directs perl to only perform variable substitutions 1310once: 1311 1312 #!/usr/bin/perl 1313 # Improved simple_grep 1314 $regexp = shift; 1315 while (<>) { 1316 print if /$regexp/o; # a good deal faster 1317 } 1318 1319If you change C<$pattern> after the first substitution happens, perl 1320will ignore it. If you don't want any substitutions at all, use the 1321special delimiter C<m''>: 1322 1323 $pattern = 'Seuss'; 1324 while (<>) { 1325 print if m'$pattern'; # matches '$pattern', not 'Seuss' 1326 } 1327 1328C<m''> acts like single quotes on a regexp; all other C<m> delimiters 1329act like double quotes. If the regexp evaluates to the empty string, 1330the regexp in the I<last successful match> is used instead. So we have 1331 1332 "dog" =~ /d/; # 'd' matches 1333 "dogbert =~ //; # this matches the 'd' regexp used before 1334 1335The final two modifiers C<//g> and C<//c> concern multiple matches. 1336The modifier C<//g> stands for global matching and allows the the 1337matching operator to match within a string as many times as possible. 1338In scalar context, successive invocations against a string will have 1339`C<//g> jump from match to match, keeping track of position in the 1340string as it goes along. You can get or set the position with the 1341C<pos()> function. 1342 1343The use of C<//g> is shown in the following example. Suppose we have 1344a string that consists of words separated by spaces. If we know how 1345many words there are in advance, we could extract the words using 1346groupings: 1347 1348 $x = "cat dog house"; # 3 words 1349 $x =~ /^\s*(\w+)\s+(\w+)\s+(\w+)\s*$/; # matches, 1350 # $1 = 'cat' 1351 # $2 = 'dog' 1352 # $3 = 'house' 1353 1354But what if we had an indeterminate number of words? This is the sort 1355of task C<//g> was made for. To extract all words, form the simple 1356regexp C<(\w+)> and loop over all matches with C</(\w+)/g>: 1357 1358 while ($x =~ /(\w+)/g) { 1359 print "Word is $1, ends at position ", pos $x, "\n"; 1360 } 1361 1362prints 1363 1364 Word is cat, ends at position 3 1365 Word is dog, ends at position 7 1366 Word is house, ends at position 13 1367 1368A failed match or changing the target string resets the position. If 1369you don't want the position reset after failure to match, add the 1370C<//c>, as in C</regexp/gc>. The current position in the string is 1371associated with the string, not the regexp. This means that different 1372strings have different positions and their respective positions can be 1373set or read independently. 1374 1375In list context, C<//g> returns a list of matched groupings, or if 1376there are no groupings, a list of matches to the whole regexp. So if 1377we wanted just the words, we could use 1378 1379 @words = ($x =~ /(\w+)/g); # matches, 1380 # $word[0] = 'cat' 1381 # $word[1] = 'dog' 1382 # $word[2] = 'house' 1383 1384Closely associated with the C<//g> modifier is the C<\G> anchor. The 1385C<\G> anchor matches at the point where the previous C<//g> match left 1386off. C<\G> allows us to easily do context-sensitive matching: 1387 1388 $metric = 1; # use metric units 1389 ... 1390 $x = <FILE>; # read in measurement 1391 $x =~ /^([+-]?\d+)\s*/g; # get magnitude 1392 $weight = $1; 1393 if ($metric) { # error checking 1394 print "Units error!" unless $x =~ /\Gkg\./g; 1395 } 1396 else { 1397 print "Units error!" unless $x =~ /\Glbs\./g; 1398 } 1399 $x =~ /\G\s+(widget|sprocket)/g; # continue processing 1400 1401The combination of C<//g> and C<\G> allows us to process the string a 1402bit at a time and use arbitrary Perl logic to decide what to do next. 1403 1404C<\G> is also invaluable in processing fixed length records with 1405regexps. Suppose we have a snippet of coding region DNA, encoded as 1406base pair letters C<ATCGTTGAAT...> and we want to find all the stop 1407codons C<TGA>. In a coding region, codons are 3-letter sequences, so 1408we can think of the DNA snippet as a sequence of 3-letter records. The 1409naive regexp 1410 1411 # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC" 1412 $dna = "ATCGTTGAATGCAAATGACATGAC"; 1413 $dna =~ /TGA/; 1414 1415doesn't work; it may match an C<TGA>, but there is no guarantee that 1416the match is aligned with codon boundaries, e.g., the substring 1417S<C<GTT GAA> > gives a match. A better solution is 1418 1419 while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *? 1420 print "Got a TGA stop codon at position ", pos $dna, "\n"; 1421 } 1422 1423which prints 1424 1425 Got a TGA stop codon at position 18 1426 Got a TGA stop codon at position 23 1427 1428Position 18 is good, but position 23 is bogus. What happened? 1429 1430The answer is that our regexp works well until we get past the last 1431real match. Then the regexp will fail to match a synchronized C<TGA> 1432and start stepping ahead one character position at a time, not what we 1433want. The solution is to use C<\G> to anchor the match to the codon 1434alignment: 1435 1436 while ($dna =~ /\G(\w\w\w)*?TGA/g) { 1437 print "Got a TGA stop codon at position ", pos $dna, "\n"; 1438 } 1439 1440This prints 1441 1442 Got a TGA stop codon at position 18 1443 1444which is the correct answer. This example illustrates that it is 1445important not only to match what is desired, but to reject what is not 1446desired. 1447 1448B<search and replace> 1449 1450Regular expressions also play a big role in B<search and replace> 1451operations in Perl. Search and replace is accomplished with the 1452C<s///> operator. The general form is 1453C<s/regexp/replacement/modifiers>, with everything we know about 1454regexps and modifiers applying in this case as well. The 1455C<replacement> is a Perl double quoted string that replaces in the 1456string whatever is matched with the C<regexp>. The operator C<=~> is 1457also used here to associate a string with C<s///>. If matching 1458against C<$_>, the S<C<$_ =~> > can be dropped. If there is a match, 1459C<s///> returns the number of substitutions made, otherwise it returns 1460false. Here are a few examples: 1461 1462 $x = "Time to feed the cat!"; 1463 $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" 1464 if ($x =~ s/^(Time.*hacker)!$/$1 now!/) { 1465 $more_insistent = 1; 1466 } 1467 $y = "'quoted words'"; 1468 $y =~ s/^'(.*)'$/$1/; # strip single quotes, 1469 # $y contains "quoted words" 1470 1471In the last example, the whole string was matched, but only the part 1472inside the single quotes was grouped. With the C<s///> operator, the 1473matched variables C<$1>, C<$2>, etc. are immediately available for use 1474in the replacement expression, so we use C<$1> to replace the quoted 1475string with just what was quoted. With the global modifier, C<s///g> 1476will search and replace all occurrences of the regexp in the string: 1477 1478 $x = "I batted 4 for 4"; 1479 $x =~ s/4/four/; # doesn't do it all: 1480 # $x contains "I batted four for 4" 1481 $x = "I batted 4 for 4"; 1482 $x =~ s/4/four/g; # does it all: 1483 # $x contains "I batted four for four" 1484 1485If you prefer 'regex' over 'regexp' in this tutorial, you could use 1486the following program to replace it: 1487 1488 % cat > simple_replace 1489 #!/usr/bin/perl 1490 $regexp = shift; 1491 $replacement = shift; 1492 while (<>) { 1493 s/$regexp/$replacement/go; 1494 print; 1495 } 1496 ^D 1497 1498 % simple_replace regexp regex perlretut.pod 1499 1500In C<simple_replace> we used the C<s///g> modifier to replace all 1501occurrences of the regexp on each line and the C<s///o> modifier to 1502compile the regexp only once. As with C<simple_grep>, both the 1503C<print> and the C<s/$regexp/$replacement/go> use C<$_> implicitly. 1504 1505A modifier available specifically to search and replace is the 1506C<s///e> evaluation modifier. C<s///e> wraps an C<eval{...}> around 1507the replacement string and the evaluated result is substituted for the 1508matched substring. C<s///e> is useful if you need to do a bit of 1509computation in the process of replacing text. This example counts 1510character frequencies in a line: 1511 1512 $x = "Bill the cat"; 1513 $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself 1514 print "frequency of '$_' is $chars{$_}\n" 1515 foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars); 1516 1517This prints 1518 1519 frequency of ' ' is 2 1520 frequency of 't' is 2 1521 frequency of 'l' is 2 1522 frequency of 'B' is 1 1523 frequency of 'c' is 1 1524 frequency of 'e' is 1 1525 frequency of 'h' is 1 1526 frequency of 'i' is 1 1527 frequency of 'a' is 1 1528 1529As with the match C<m//> operator, C<s///> can use other delimiters, 1530such as C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are 1531used C<s'''>, then the regexp and replacement are treated as single 1532quoted strings and there are no substitutions. C<s///> in list context 1533returns the same thing as in scalar context, i.e., the number of 1534matches. 1535 1536B<The split operator> 1537 1538The B<C<split> > function can also optionally use a matching operator 1539C<m//> to split a string. C<split /regexp/, string, limit> splits 1540C<string> into a list of substrings and returns that list. The regexp 1541is used to match the character sequence that the C<string> is split 1542with respect to. The C<limit>, if present, constrains splitting into 1543no more than C<limit> number of strings. For example, to split a 1544string into words, use 1545 1546 $x = "Calvin and Hobbes"; 1547 @words = split /\s+/, $x; # $word[0] = 'Calvin' 1548 # $word[1] = 'and' 1549 # $word[2] = 'Hobbes' 1550 1551If the empty regexp C<//> is used, the regexp always matches and 1552the string is split into individual characters. If the regexp has 1553groupings, then list produced contains the matched substrings from the 1554groupings as well. For instance, 1555 1556 $x = "/usr/bin/perl"; 1557 @dirs = split m!/!, $x; # $dirs[0] = '' 1558 # $dirs[1] = 'usr' 1559 # $dirs[2] = 'bin' 1560 # $dirs[3] = 'perl' 1561 @parts = split m!(/)!, $x; # $parts[0] = '' 1562 # $parts[1] = '/' 1563 # $parts[2] = 'usr' 1564 # $parts[3] = '/' 1565 # $parts[4] = 'bin' 1566 # $parts[5] = '/' 1567 # $parts[6] = 'perl' 1568 1569Since the first character of $x matched the regexp, C<split> prepended 1570an empty initial element to the list. 1571 1572If you have read this far, congratulations! You now have all the basic 1573tools needed to use regular expressions to solve a wide range of text 1574processing problems. If this is your first time through the tutorial, 1575why not stop here and play around with regexps a while... S<Part 2> 1576concerns the more esoteric aspects of regular expressions and those 1577concepts certainly aren't needed right at the start. 1578 1579=head1 Part 2: Power tools 1580 1581OK, you know the basics of regexps and you want to know more. If 1582matching regular expressions is analogous to a walk in the woods, then 1583the tools discussed in Part 1 are analogous to topo maps and a 1584compass, basic tools we use all the time. Most of the tools in part 2 1585are are analogous to flare guns and satellite phones. They aren't used 1586too often on a hike, but when we are stuck, they can be invaluable. 1587 1588What follows are the more advanced, less used, or sometimes esoteric 1589capabilities of perl regexps. In Part 2, we will assume you are 1590comfortable with the basics and concentrate on the new features. 1591 1592=head2 More on characters, strings, and character classes 1593 1594There are a number of escape sequences and character classes that we 1595haven't covered yet. 1596 1597There are several escape sequences that convert characters or strings 1598between upper and lower case. C<\l> and C<\u> convert the next 1599character to lower or upper case, respectively: 1600 1601 $x = "perl"; 1602 $string =~ /\u$x/; # matches 'Perl' in $string 1603 $x = "M(rs?|s)\\."; # note the double backslash 1604 $string =~ /\l$x/; # matches 'mr.', 'mrs.', and 'ms.', 1605 1606C<\L> and C<\U> converts a whole substring, delimited by C<\L> or 1607C<\U> and C<\E>, to lower or upper case: 1608 1609 $x = "This word is in lower case:\L SHOUT\E"; 1610 $x =~ /shout/; # matches 1611 $x = "I STILL KEYPUNCH CARDS FOR MY 360" 1612 $x =~ /\Ukeypunch/; # matches punch card string 1613 1614If there is no C<\E>, case is converted until the end of the 1615string. The regexps C<\L\u$word> or C<\u\L$word> convert the first 1616character of C<$word> to uppercase and the rest of the characters to 1617lowercase. 1618 1619Control characters can be escaped with C<\c>, so that a control-Z 1620character would be matched with C<\cZ>. The escape sequence 1621C<\Q>...C<\E> quotes, or protects most non-alphabetic characters. For 1622instance, 1623 1624 $x = "\QThat !^*&%~& cat!"; 1625 $x =~ /\Q!^*&%~&\E/; # check for rough language 1626 1627It does not protect C<$> or C<@>, so that variables can still be 1628substituted. 1629 1630With the advent of 5.6.0, perl regexps can handle more than just the 1631standard ASCII character set. Perl now supports B<Unicode>, a standard 1632for encoding the character sets from many of the world's written 1633languages. Unicode does this by allowing characters to be more than 1634one byte wide. Perl uses the UTF-8 encoding, in which ASCII characters 1635are still encoded as one byte, but characters greater than C<chr(127)> 1636may be stored as two or more bytes. 1637 1638What does this mean for regexps? Well, regexp users don't need to know 1639much about perl's internal representation of strings. But they do need 1640to know 1) how to represent Unicode characters in a regexp and 2) when 1641a matching operation will treat the string to be searched as a 1642sequence of bytes (the old way) or as a sequence of Unicode characters 1643(the new way). The answer to 1) is that Unicode characters greater 1644than C<chr(127)> may be represented using the C<\x{hex}> notation, 1645with C<hex> a hexadecimal integer: 1646 1647 use utf8; # We will be doing Unicode processing 1648 /\x{263a}/; # match a Unicode smiley face :) 1649 1650Unicode characters in the range of 128-255 use two hexadecimal digits 1651with braces: C<\x{ab}>. Note that this is different than C<\xab>, 1652which is just a hexadecimal byte with no Unicode 1653significance. 1654 1655Figuring out the hexadecimal sequence of a Unicode character you want 1656or deciphering someone else's hexadecimal Unicode regexp is about as 1657much fun as programming in machine code. So another way to specify 1658Unicode characters is to use the S<B<named character> > escape 1659sequence C<\N{name}>. C<name> is a name for the Unicode character, as 1660specified in the Unicode standard. For instance, if we wanted to 1661represent or match the astrological sign for the planet Mercury, we 1662could use 1663 1664 use utf8; # We will be doing Unicode processing 1665 use charnames ":full"; # use named chars with Unicode full names 1666 $x = "abc\N{MERCURY}def"; 1667 $x =~ /\N{MERCURY}/; # matches 1668 1669One can also use short names or restrict names to a certain alphabet: 1670 1671 use utf8; # We will be doing Unicode processing 1672 1673 use charnames ':full'; 1674 print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n"; 1675 1676 use charnames ":short"; 1677 print "\N{greek:Sigma} is an upper-case sigma.\n"; 1678 1679 use charnames qw(greek); 1680 print "\N{sigma} is Greek sigma\n"; 1681 1682A list of full names is found in the file Names.txt in the 1683lib/perl5/5.6.0/unicode directory. 1684 1685The answer to requirement 2), as of 5.6.0, is that if a regexp 1686contains Unicode characters, the string is searched as a sequence of 1687Unicode characters. Otherwise, the string is searched as a sequence of 1688bytes. If the string is being searched as a sequence of Unicode 1689characters, but matching a single byte is required, we can use the C<\C> 1690escape sequence. C<\C> is a character class akin to C<.> except that 1691it matches I<any> byte 0-255. So 1692 1693 use utf8; # We will be doing Unicode processing 1694 use charnames ":full"; # use named chars with Unicode full names 1695 $x = "a"; 1696 $x =~ /\C/; # matches 'a', eats one byte 1697 $x = ""; 1698 $x =~ /\C/; # doesn't match, no bytes to match 1699 $x = "\N{MERCURY}"; # two-byte Unicode character 1700 $x =~ /\C/; # matches, but dangerous! 1701 1702The last regexp matches, but is dangerous because the string 1703I<character> position is no longer synchronized to the string I<byte> 1704position. This generates the warning 'Malformed UTF-8 1705character'. C<\C> is best used for matching the binary data in strings 1706with binary data intermixed with Unicode characters. 1707 1708Let us now discuss the rest of the character classes. Just as with 1709Unicode characters, there are named Unicode character classes 1710represented by the C<\p{name}> escape sequence. Closely associated is 1711the C<\P{name}> character class, which is the negation of the 1712C<\p{name}> class. For example, to match lower and uppercase 1713characters, 1714 1715 use utf8; # We will be doing Unicode processing 1716 use charnames ":full"; # use named chars with Unicode full names 1717 $x = "BOB"; 1718 $x =~ /^\p{IsUpper}/; # matches, uppercase char class 1719 $x =~ /^\P{IsUpper}/; # doesn't match, char class sans uppercase 1720 $x =~ /^\p{IsLower}/; # doesn't match, lowercase char class 1721 $x =~ /^\P{IsLower}/; # matches, char class sans lowercase 1722 1723Here is the association between some Perl named classes and the 1724traditional Unicode classes: 1725 1726 Perl class name Unicode class name or regular expression 1727 1728 IsAlpha /^[LM]/ 1729 IsAlnum /^[LMN]/ 1730 IsASCII $code <= 127 1731 IsCntrl /^C/ 1732 IsBlank $code =~ /^(0020|0009)$/ || /^Z[^lp]/ 1733 IsDigit Nd 1734 IsGraph /^([LMNPS]|Co)/ 1735 IsLower Ll 1736 IsPrint /^([LMNPS]|Co|Zs)/ 1737 IsPunct /^P/ 1738 IsSpace /^Z/ || ($code =~ /^(0009|000A|000B|000C|000D)$/ 1739 IsSpacePerl /^Z/ || ($code =~ /^(0009|000A|000C|000D)$/ 1740 IsUpper /^L[ut]/ 1741 IsWord /^[LMN]/ || $code eq "005F" 1742 IsXDigit $code =~ /^00(3[0-9]|[46][1-6])$/ 1743 1744You can also use the official Unicode class names with the C<\p> and 1745C<\P>, like C<\p{L}> for Unicode 'letters', or C<\p{Lu}> for uppercase 1746letters, or C<\P{Nd}> for non-digits. If a C<name> is just one 1747letter, the braces can be dropped. For instance, C<\pM> is the 1748character class of Unicode 'marks'. 1749 1750C<\X> is an abbreviation for a character class sequence that includes 1751the Unicode 'combining character sequences'. A 'combining character 1752sequence' is a base character followed by any number of combining 1753characters. An example of a combining character is an accent. Using 1754the Unicode full names, e.g., S<C<A + COMBINING RING> > is a combining 1755character sequence with base character C<A> and combining character 1756S<C<COMBINING RING> >, which translates in Danish to A with the circle 1757atop it, as in the word Angstrom. C<\X> is equivalent to C<\PM\pM*}>, 1758i.e., a non-mark followed by one or more marks. 1759 1760As if all those classes weren't enough, Perl also defines POSIX style 1761character classes. These have the form C<[:name:]>, with C<name> the 1762name of the POSIX class. The POSIX classes are C<alpha>, C<alnum>, 1763C<ascii>, C<cntrl>, C<digit>, C<graph>, C<lower>, C<print>, C<punct>, 1764C<space>, C<upper>, and C<xdigit>, and two extensions, C<word> (a Perl 1765extension to match C<\w>), and C<blank> (a GNU extension). If C<utf8> 1766is being used, then these classes are defined the same as their 1767corresponding perl Unicode classes: C<[:upper:]> is the same as 1768C<\p{IsUpper}>, etc. The POSIX character classes, however, don't 1769require using C<utf8>. The C<[:digit:]>, C<[:word:]>, and 1770C<[:space:]> correspond to the familiar C<\d>, C<\w>, and C<\s> 1771character classes. To negate a POSIX class, put a C<^> in front of 1772the name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and under 1773C<utf8>, C<\P{IsDigit}>. The Unicode and POSIX character classes can 1774be used just like C<\d>, both inside and outside of character classes: 1775 1776 /\s+[abc[:digit:]xyz]\s*/; # match a,b,c,x,y,z, or a digit 1777 /^=item\s[:digit:]/; # match '=item', 1778 # followed by a space and a digit 1779 use utf8; 1780 use charnames ":full"; 1781 /\s+[abc\p{IsDigit}xyz]\s+/; # match a,b,c,x,y,z, or a digit 1782 /^=item\s\p{IsDigit}/; # match '=item', 1783 # followed by a space and a digit 1784 1785Whew! That is all the rest of the characters and character classes. 1786 1787=head2 Compiling and saving regular expressions 1788 1789In Part 1 we discussed the C<//o> modifier, which compiles a regexp 1790just once. This suggests that a compiled regexp is some data structure 1791that can be stored once and used again and again. The regexp quote 1792C<qr//> does exactly that: C<qr/string/> compiles the C<string> as a 1793regexp and transforms the result into a form that can be assigned to a 1794variable: 1795 1796 $reg = qr/foo+bar?/; # reg contains a compiled regexp 1797 1798Then C<$reg> can be used as a regexp: 1799 1800 $x = "fooooba"; 1801 $x =~ $reg; # matches, just like /foo+bar?/ 1802 $x =~ /$reg/; # same thing, alternate form 1803 1804C<$reg> can also be interpolated into a larger regexp: 1805 1806 $x =~ /(abc)?$reg/; # still matches 1807 1808As with the matching operator, the regexp quote can use different 1809delimiters, e.g., C<qr!!>, C<qr{}> and C<qr~~>. The single quote 1810delimiters C<qr''> prevent any interpolation from taking place. 1811 1812Pre-compiled regexps are useful for creating dynamic matches that 1813don't need to be recompiled each time they are encountered. Using 1814pre-compiled regexps, C<simple_grep> program can be expanded into a 1815program that matches multiple patterns: 1816 1817 % cat > multi_grep 1818 #!/usr/bin/perl 1819 # multi_grep - match any of <number> regexps 1820 # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ... 1821 1822 $number = shift; 1823 $regexp[$_] = shift foreach (0..$number-1); 1824 @compiled = map qr/$_/, @regexp; 1825 while ($line = <>) { 1826 foreach $pattern (@compiled) { 1827 if ($line =~ /$pattern/) { 1828 print $line; 1829 last; # we matched, so move onto the next line 1830 } 1831 } 1832 } 1833 ^D 1834 1835 % multi_grep 2 last for multi_grep 1836 $regexp[$_] = shift foreach (0..$number-1); 1837 foreach $pattern (@compiled) { 1838 last; 1839 1840Storing pre-compiled regexps in an array C<@compiled> allows us to 1841simply loop through the regexps without any recompilation, thus gaining 1842flexibility without sacrificing speed. 1843 1844=head2 Embedding comments and modifiers in a regular expression 1845 1846Starting with this section, we will be discussing Perl's set of 1847B<extended patterns>. These are extensions to the traditional regular 1848expression syntax that provide powerful new tools for pattern 1849matching. We have already seen extensions in the form of the minimal 1850matching constructs C<??>, C<*?>, C<+?>, C<{n,m}?>, and C<{n,}?>. The 1851rest of the extensions below have the form C<(?char...)>, where the 1852C<char> is a character that determines the type of extension. 1853 1854The first extension is an embedded comment C<(?#text)>. This embeds a 1855comment into the regular expression without affecting its meaning. The 1856comment should not have any closing parentheses in the text. An 1857example is 1858 1859 /(?# Match an integer:)[+-]?\d+/; 1860 1861This style of commenting has been largely superseded by the raw, 1862freeform commenting that is allowed with the C<//x> modifier. 1863 1864The modifiers C<//i>, C<//m>, C<//s>, and C<//x> can also embedded in 1865a regexp using C<(?i)>, C<(?m)>, C<(?s)>, and C<(?x)>. For instance, 1866 1867 /(?i)yes/; # match 'yes' case insensitively 1868 /yes/i; # same thing 1869 /(?x)( # freeform version of an integer regexp 1870 [+-]? # match an optional sign 1871 \d+ # match a sequence of digits 1872 ) 1873 /x; 1874 1875Embedded modifiers can have two important advantages over the usual 1876modifiers. Embedded modifiers allow a custom set of modifiers to 1877I<each> regexp pattern. This is great for matching an array of regexps 1878that must have different modifiers: 1879 1880 $pattern[0] = '(?i)doctor'; 1881 $pattern[1] = 'Johnson'; 1882 ... 1883 while (<>) { 1884 foreach $patt (@pattern) { 1885 print if /$patt/; 1886 } 1887 } 1888 1889The second advantage is that embedded modifiers only affect the regexp 1890inside the group the embedded modifier is contained in. So grouping 1891can be used to localize the modifier's effects: 1892 1893 /Answer: ((?i)yes)/; # matches 'Answer: yes', 'Answer: YES', etc. 1894 1895Embedded modifiers can also turn off any modifiers already present 1896by using, e.g., C<(?-i)>. Modifiers can also be combined into 1897a single expression, e.g., C<(?s-i)> turns on single line mode and 1898turns off case insensitivity. 1899 1900=head2 Non-capturing groupings 1901 1902We noted in Part 1 that groupings C<()> had two distinct functions: 1) 1903group regexp elements together as a single unit, and 2) extract, or 1904capture, substrings that matched the regexp in the 1905grouping. Non-capturing groupings, denoted by C<(?:regexp)>, allow the 1906regexp to be treated as a single unit, but don't extract substrings or 1907set matching variables C<$1>, etc. Both capturing and non-capturing 1908groupings are allowed to co-exist in the same regexp. Because there is 1909no extraction, non-capturing groupings are faster than capturing 1910groupings. Non-capturing groupings are also handy for choosing exactly 1911which parts of a regexp are to be extracted to matching variables: 1912 1913 # match a number, $1-$4 are set, but we only want $1 1914 /([+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?)/; 1915 1916 # match a number faster , only $1 is set 1917 /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)/; 1918 1919 # match a number, get $1 = whole number, $2 = exponent 1920 /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE]([+-]?\d+))?)/; 1921 1922Non-capturing groupings are also useful for removing nuisance 1923elements gathered from a split operation: 1924 1925 $x = '12a34b5'; 1926 @num = split /(a|b)/, $x; # @num = ('12','a','34','b','5') 1927 @num = split /(?:a|b)/, $x; # @num = ('12','34','5') 1928 1929Non-capturing groupings may also have embedded modifiers: 1930C<(?i-m:regexp)> is a non-capturing grouping that matches C<regexp> 1931case insensitively and turns off multi-line mode. 1932 1933=head2 Looking ahead and looking behind 1934 1935This section concerns the lookahead and lookbehind assertions. First, 1936a little background. 1937 1938In Perl regular expressions, most regexp elements 'eat up' a certain 1939amount of string when they match. For instance, the regexp element 1940C<[abc}]> eats up one character of the string when it matches, in the 1941sense that perl moves to the next character position in the string 1942after the match. There are some elements, however, that don't eat up 1943characters (advance the character position) if they match. The examples 1944we have seen so far are the anchors. The anchor C<^> matches the 1945beginning of the line, but doesn't eat any characters. Similarly, the 1946word boundary anchor C<\b> matches, e.g., if the character to the left 1947is a word character and the character to the right is a non-word 1948character, but it doesn't eat up any characters itself. Anchors are 1949examples of 'zero-width assertions'. Zero-width, because they consume 1950no characters, and assertions, because they test some property of the 1951string. In the context of our walk in the woods analogy to regexp 1952matching, most regexp elements move us along a trail, but anchors have 1953us stop a moment and check our surroundings. If the local environment 1954checks out, we can proceed forward. But if the local environment 1955doesn't satisfy us, we must backtrack. 1956 1957Checking the environment entails either looking ahead on the trail, 1958looking behind, or both. C<^> looks behind, to see that there are no 1959characters before. C<$> looks ahead, to see that there are no 1960characters after. C<\b> looks both ahead and behind, to see if the 1961characters on either side differ in their 'word'-ness. 1962 1963The lookahead and lookbehind assertions are generalizations of the 1964anchor concept. Lookahead and lookbehind are zero-width assertions 1965that let us specify which characters we want to test for. The 1966lookahead assertion is denoted by C<(?=regexp)> and the lookbehind 1967assertion is denoted by C<< (?<=fixed-regexp) >>. Some examples are 1968 1969 $x = "I catch the housecat 'Tom-cat' with catnip"; 1970 $x =~ /cat(?=\s+)/; # matches 'cat' in 'housecat' 1971 @catwords = ($x =~ /(?<=\s)cat\w+/g); # matches, 1972 # $catwords[0] = 'catch' 1973 # $catwords[1] = 'catnip' 1974 $x =~ /\bcat\b/; # matches 'cat' in 'Tom-cat' 1975 $x =~ /(?<=\s)cat(?=\s)/; # doesn't match; no isolated 'cat' in 1976 # middle of $x 1977 1978Note that the parentheses in C<(?=regexp)> and C<< (?<=regexp) >> are 1979non-capturing, since these are zero-width assertions. Thus in the 1980second regexp, the substrings captured are those of the whole regexp 1981itself. Lookahead C<(?=regexp)> can match arbitrary regexps, but 1982lookbehind C<< (?<=fixed-regexp) >> only works for regexps of fixed 1983width, i.e., a fixed number of characters long. Thus 1984C<< (?<=(ab|bc)) >> is fine, but C<< (?<=(ab)*) >> is not. The 1985negated versions of the lookahead and lookbehind assertions are 1986denoted by C<(?!regexp)> and C<< (?<!fixed-regexp) >> respectively. 1987They evaluate true if the regexps do I<not> match: 1988 1989 $x = "foobar"; 1990 $x =~ /foo(?!bar)/; # doesn't match, 'bar' follows 'foo' 1991 $x =~ /foo(?!baz)/; # matches, 'baz' doesn't follow 'foo' 1992 $x =~ /(?<!\s)foo/; # matches, there is no \s before 'foo' 1993 1994=head2 Using independent subexpressions to prevent backtracking 1995 1996The last few extended patterns in this tutorial are experimental as of 19975.6.0. Play with them, use them in some code, but don't rely on them 1998just yet for production code. 1999 2000S<B<Independent subexpressions> > are regular expressions, in the 2001context of a larger regular expression, that function independently of 2002the larger regular expression. That is, they consume as much or as 2003little of the string as they wish without regard for the ability of 2004the larger regexp to match. Independent subexpressions are represented 2005by C<< (?>regexp) >>. We can illustrate their behavior by first 2006considering an ordinary regexp: 2007 2008 $x = "ab"; 2009 $x =~ /a*ab/; # matches 2010 2011This obviously matches, but in the process of matching, the 2012subexpression C<a*> first grabbed the C<a>. Doing so, however, 2013wouldn't allow the whole regexp to match, so after backtracking, C<a*> 2014eventually gave back the C<a> and matched the empty string. Here, what 2015C<a*> matched was I<dependent> on what the rest of the regexp matched. 2016 2017Contrast that with an independent subexpression: 2018 2019 $x =~ /(?>a*)ab/; # doesn't match! 2020 2021The independent subexpression C<< (?>a*) >> doesn't care about the rest 2022of the regexp, so it sees an C<a> and grabs it. Then the rest of the 2023regexp C<ab> cannot match. Because C<< (?>a*) >> is independent, there 2024is no backtracking and and the independent subexpression does not give 2025up its C<a>. Thus the match of the regexp as a whole fails. A similar 2026behavior occurs with completely independent regexps: 2027 2028 $x = "ab"; 2029 $x =~ /a*/g; # matches, eats an 'a' 2030 $x =~ /\Gab/g; # doesn't match, no 'a' available 2031 2032Here C<//g> and C<\G> create a 'tag team' handoff of the string from 2033one regexp to the other. Regexps with an independent subexpression are 2034much like this, with a handoff of the string to the independent 2035subexpression, and a handoff of the string back to the enclosing 2036regexp. 2037 2038The ability of an independent subexpression to prevent backtracking 2039can be quite useful. Suppose we want to match a non-empty string 2040enclosed in parentheses up to two levels deep. Then the following 2041regexp matches: 2042 2043 $x = "abc(de(fg)h"; # unbalanced parentheses 2044 $x =~ /\( ( [^()]+ | \([^()]*\) )+ \)/x; 2045 2046The regexp matches an open parenthesis, one or more copies of an 2047alternation, and a close parenthesis. The alternation is two-way, with 2048the first alternative C<[^()]+> matching a substring with no 2049parentheses and the second alternative C<\([^()]*\)> matching a 2050substring delimited by parentheses. The problem with this regexp is 2051that it is pathological: it has nested indeterminate quantifiers 2052 of the form C<(a+|b)+>. We discussed in Part 1 how nested quantifiers 2053like this could take an exponentially long time to execute if there 2054was no match possible. To prevent the exponential blowup, we need to 2055prevent useless backtracking at some point. This can be done by 2056enclosing the inner quantifier as an independent subexpression: 2057 2058 $x =~ /\( ( (?>[^()]+) | \([^()]*\) )+ \)/x; 2059 2060Here, C<< (?>[^()]+) >> breaks the degeneracy of string partitioning 2061by gobbling up as much of the string as possible and keeping it. Then 2062match failures fail much more quickly. 2063 2064=head2 Conditional expressions 2065 2066A S<B<conditional expression> > is a form of if-then-else statement 2067that allows one to choose which patterns are to be matched, based on 2068some condition. There are two types of conditional expression: 2069C<(?(condition)yes-regexp)> and 2070C<(?(condition)yes-regexp|no-regexp)>. C<(?(condition)yes-regexp)> is 2071like an S<C<'if () {}'> > statement in Perl. If the C<condition> is true, 2072the C<yes-regexp> will be matched. If the C<condition> is false, the 2073C<yes-regexp> will be skipped and perl will move onto the next regexp 2074element. The second form is like an S<C<'if () {} else {}'> > statement 2075in Perl. If the C<condition> is true, the C<yes-regexp> will be 2076matched, otherwise the C<no-regexp> will be matched. 2077 2078The C<condition> can have two forms. The first form is simply an 2079integer in parentheses C<(integer)>. It is true if the corresponding 2080backreference C<\integer> matched earlier in the regexp. The second 2081form is a bare zero width assertion C<(?...)>, either a 2082lookahead, a lookbehind, or a code assertion (discussed in the next 2083section). 2084 2085The integer form of the C<condition> allows us to choose, with more 2086flexibility, what to match based on what matched earlier in the 2087regexp. This searches for words of the form C<"$x$x"> or 2088C<"$x$y$y$x">: 2089 2090 % simple_grep '^(\w+)(\w+)?(?(2)\2\1|\1)$' /usr/dict/words 2091 beriberi 2092 coco 2093 couscous 2094 deed 2095 ... 2096 toot 2097 toto 2098 tutu 2099 2100The lookbehind C<condition> allows, along with backreferences, 2101an earlier part of the match to influence a later part of the 2102match. For instance, 2103 2104 /[ATGC]+(?(?<=AA)G|C)$/; 2105 2106matches a DNA sequence such that it either ends in C<AAG>, or some 2107other base pair combination and C<C>. Note that the form is 2108C<< (?(?<=AA)G|C) >> and not C<< (?((?<=AA))G|C) >>; for the 2109lookahead, lookbehind or code assertions, the parentheses around the 2110conditional are not needed. 2111 2112=head2 A bit of magic: executing Perl code in a regular expression 2113 2114Normally, regexps are a part of Perl expressions. 2115S<B<Code evaluation> > expressions turn that around by allowing 2116arbitrary Perl code to be a part of of a regexp. A code evaluation 2117expression is denoted C<(?{code})>, with C<code> a string of Perl 2118statements. 2119 2120Code expressions are zero-width assertions, and the value they return 2121depends on their environment. There are two possibilities: either the 2122code expression is used as a conditional in a conditional expression 2123C<(?(condition)...)>, or it is not. If the code expression is a 2124conditional, the code is evaluated and the result (i.e., the result of 2125the last statement) is used to determine truth or falsehood. If the 2126code expression is not used as a conditional, the assertion always 2127evaluates true and the result is put into the special variable 2128C<$^R>. The variable C<$^R> can then be used in code expressions later 2129in the regexp. Here are some silly examples: 2130 2131 $x = "abcdef"; 2132 $x =~ /abc(?{print "Hi Mom!";})def/; # matches, 2133 # prints 'Hi Mom!' 2134 $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match, 2135 # no 'Hi Mom!' 2136 2137Pay careful attention to the next example: 2138 2139 $x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match, 2140 # no 'Hi Mom!' 2141 # but why not? 2142 2143At first glance, you'd think that it shouldn't print, because obviously 2144the C<ddd> isn't going to match the target string. But look at this 2145example: 2146 2147 $x =~ /abc(?{print "Hi Mom!";})[d]dd/; # doesn't match, 2148 # but _does_ print 2149 2150Hmm. What happened here? If you've been following along, you know that 2151the above pattern should be effectively the same as the last one -- 2152enclosing the d in a character class isn't going to change what it 2153matches. So why does the first not print while the second one does? 2154 2155The answer lies in the optimizations the REx engine makes. In the first 2156case, all the engine sees are plain old characters (aside from the 2157C<?{}> construct). It's smart enough to realize that the string 'ddd' 2158doesn't occur in our target string before actually running the pattern 2159through. But in the second case, we've tricked it into thinking that our 2160pattern is more complicated than it is. It takes a look, sees our 2161character class, and decides that it will have to actually run the 2162pattern to determine whether or not it matches, and in the process of 2163running it hits the print statement before it discovers that we don't 2164have a match. 2165 2166To take a closer look at how the engine does optimizations, see the 2167section L<"Pragmas and debugging"> below. 2168 2169More fun with C<?{}>: 2170 2171 $x =~ /(?{print "Hi Mom!";})/; # matches, 2172 # prints 'Hi Mom!' 2173 $x =~ /(?{$c = 1;})(?{print "$c";})/; # matches, 2174 # prints '1' 2175 $x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches, 2176 # prints '1' 2177 2178The bit of magic mentioned in the section title occurs when the regexp 2179backtracks in the process of searching for a match. If the regexp 2180backtracks over a code expression and if the variables used within are 2181localized using C<local>, the changes in the variables produced by the 2182code expression are undone! Thus, if we wanted to count how many times 2183a character got matched inside a group, we could use, e.g., 2184 2185 $x = "aaaa"; 2186 $count = 0; # initialize 'a' count 2187 $c = "bob"; # test if $c gets clobbered 2188 $x =~ /(?{local $c = 0;}) # initialize count 2189 ( a # match 'a' 2190 (?{local $c = $c + 1;}) # increment count 2191 )* # do this any number of times, 2192 aa # but match 'aa' at the end 2193 (?{$count = $c;}) # copy local $c var into $count 2194 /x; 2195 print "'a' count is $count, \$c variable is '$c'\n"; 2196 2197This prints 2198 2199 'a' count is 2, $c variable is 'bob' 2200 2201If we replace the S<C< (?{local $c = $c + 1;})> > with 2202S<C< (?{$c = $c + 1;})> >, the variable changes are I<not> undone 2203during backtracking, and we get 2204 2205 'a' count is 4, $c variable is 'bob' 2206 2207Note that only localized variable changes are undone. Other side 2208effects of code expression execution are permanent. Thus 2209 2210 $x = "aaaa"; 2211 $x =~ /(a(?{print "Yow\n";}))*aa/; 2212 2213produces 2214 2215 Yow 2216 Yow 2217 Yow 2218 Yow 2219 2220The result C<$^R> is automatically localized, so that it will behave 2221properly in the presence of backtracking. 2222 2223This example uses a code expression in a conditional to match the 2224article 'the' in either English or German: 2225 2226 $lang = 'DE'; # use German 2227 ... 2228 $text = "das"; 2229 print "matched\n" 2230 if $text =~ /(?(?{ 2231 $lang eq 'EN'; # is the language English? 2232 }) 2233 the | # if so, then match 'the' 2234 (die|das|der) # else, match 'die|das|der' 2235 ) 2236 /xi; 2237 2238Note that the syntax here is C<(?(?{...})yes-regexp|no-regexp)>, not 2239C<(?((?{...}))yes-regexp|no-regexp)>. In other words, in the case of a 2240code expression, we don't need the extra parentheses around the 2241conditional. 2242 2243If you try to use code expressions with interpolating variables, perl 2244may surprise you: 2245 2246 $bar = 5; 2247 $pat = '(?{ 1 })'; 2248 /foo(?{ $bar })bar/; # compiles ok, $bar not interpolated 2249 /foo(?{ 1 })$bar/; # compile error! 2250 /foo${pat}bar/; # compile error! 2251 2252 $pat = qr/(?{ $foo = 1 })/; # precompile code regexp 2253 /foo${pat}bar/; # compiles ok 2254 2255If a regexp has (1) code expressions and interpolating variables,or 2256(2) a variable that interpolates a code expression, perl treats the 2257regexp as an error. If the code expression is precompiled into a 2258variable, however, interpolating is ok. The question is, why is this 2259an error? 2260 2261The reason is that variable interpolation and code expressions 2262together pose a security risk. The combination is dangerous because 2263many programmers who write search engines often take user input and 2264plug it directly into a regexp: 2265 2266 $regexp = <>; # read user-supplied regexp 2267 $chomp $regexp; # get rid of possible newline 2268 $text =~ /$regexp/; # search $text for the $regexp 2269 2270If the C<$regexp> variable contains a code expression, the user could 2271then execute arbitrary Perl code. For instance, some joker could 2272search for S<C<system('rm -rf *');> > to erase your files. In this 2273sense, the combination of interpolation and code expressions B<taints> 2274your regexp. So by default, using both interpolation and code 2275expressions in the same regexp is not allowed. If you're not 2276concerned about malicious users, it is possible to bypass this 2277security check by invoking S<C<use re 'eval'> >: 2278 2279 use re 'eval'; # throw caution out the door 2280 $bar = 5; 2281 $pat = '(?{ 1 })'; 2282 /foo(?{ 1 })$bar/; # compiles ok 2283 /foo${pat}bar/; # compiles ok 2284 2285Another form of code expression is the S<B<pattern code expression> >. 2286The pattern code expression is like a regular code expression, except 2287that the result of the code evaluation is treated as a regular 2288expression and matched immediately. A simple example is 2289 2290 $length = 5; 2291 $char = 'a'; 2292 $x = 'aaaaabb'; 2293 $x =~ /(??{$char x $length})/x; # matches, there are 5 of 'a' 2294 2295 2296This final example contains both ordinary and pattern code 2297expressions. It detects if a binary string C<1101010010001...> has a 2298Fibonacci spacing 0,1,1,2,3,5,... of the C<1>'s: 2299 2300 $s0 = 0; $s1 = 1; # initial conditions 2301 $x = "1101010010001000001"; 2302 print "It is a Fibonacci sequence\n" 2303 if $x =~ /^1 # match an initial '1' 2304 ( 2305 (??{'0' x $s0}) # match $s0 of '0' 2306 1 # and then a '1' 2307 (?{ 2308 $largest = $s0; # largest seq so far 2309 $s2 = $s1 + $s0; # compute next term 2310 $s0 = $s1; # in Fibonacci sequence 2311 $s1 = $s2; 2312 }) 2313 )+ # repeat as needed 2314 $ # that is all there is 2315 /x; 2316 print "Largest sequence matched was $largest\n"; 2317 2318This prints 2319 2320 It is a Fibonacci sequence 2321 Largest sequence matched was 5 2322 2323Ha! Try that with your garden variety regexp package... 2324 2325Note that the variables C<$s0> and C<$s1> are not substituted when the 2326regexp is compiled, as happens for ordinary variables outside a code 2327expression. Rather, the code expressions are evaluated when perl 2328encounters them during the search for a match. 2329 2330The regexp without the C<//x> modifier is 2331 2332 /^1((??{'0'x$s0})1(?{$largest=$s0;$s2=$s1+$s0$s0=$s1;$s1=$s2;}))+$/; 2333 2334and is a great start on an Obfuscated Perl entry :-) When working with 2335code and conditional expressions, the extended form of regexps is 2336almost necessary in creating and debugging regexps. 2337 2338=head2 Pragmas and debugging 2339 2340Speaking of debugging, there are several pragmas available to control 2341and debug regexps in Perl. We have already encountered one pragma in 2342the previous section, S<C<use re 'eval';> >, that allows variable 2343interpolation and code expressions to coexist in a regexp. The other 2344pragmas are 2345 2346 use re 'taint'; 2347 $tainted = <>; 2348 @parts = ($tainted =~ /(\w+)\s+(\w+)/; # @parts is now tainted 2349 2350The C<taint> pragma causes any substrings from a match with a tainted 2351variable to be tainted as well. This is not normally the case, as 2352regexps are often used to extract the safe bits from a tainted 2353variable. Use C<taint> when you are not extracting safe bits, but are 2354performing some other processing. Both C<taint> and C<eval> pragmas 2355are lexically scoped, which means they are in effect only until 2356the end of the block enclosing the pragmas. 2357 2358 use re 'debug'; 2359 /^(.*)$/s; # output debugging info 2360 2361 use re 'debugcolor'; 2362 /^(.*)$/s; # output debugging info in living color 2363 2364The global C<debug> and C<debugcolor> pragmas allow one to get 2365detailed debugging info about regexp compilation and 2366execution. C<debugcolor> is the same as debug, except the debugging 2367information is displayed in color on terminals that can display 2368termcap color sequences. Here is example output: 2369 2370 % perl -e 'use re "debug"; "abc" =~ /a*b+c/;' 2371 Compiling REx `a*b+c' 2372 size 9 first at 1 2373 1: STAR(4) 2374 2: EXACT <a>(0) 2375 4: PLUS(7) 2376 5: EXACT <b>(0) 2377 7: EXACT <c>(9) 2378 9: END(0) 2379 floating `bc' at 0..2147483647 (checking floating) minlen 2 2380 Guessing start of match, REx `a*b+c' against `abc'... 2381 Found floating substr `bc' at offset 1... 2382 Guessed: match at offset 0 2383 Matching REx `a*b+c' against `abc' 2384 Setting an EVAL scope, savestack=3 2385 0 <> <abc> | 1: STAR 2386 EXACT <a> can match 1 times out of 32767... 2387 Setting an EVAL scope, savestack=3 2388 1 <a> <bc> | 4: PLUS 2389 EXACT <b> can match 1 times out of 32767... 2390 Setting an EVAL scope, savestack=3 2391 2 <ab> <c> | 7: EXACT <c> 2392 3 <abc> <> | 9: END 2393 Match successful! 2394 Freeing REx: `a*b+c' 2395 2396If you have gotten this far into the tutorial, you can probably guess 2397what the different parts of the debugging output tell you. The first 2398part 2399 2400 Compiling REx `a*b+c' 2401 size 9 first at 1 2402 1: STAR(4) 2403 2: EXACT <a>(0) 2404 4: PLUS(7) 2405 5: EXACT <b>(0) 2406 7: EXACT <c>(9) 2407 9: END(0) 2408 2409describes the compilation stage. C<STAR(4)> means that there is a 2410starred object, in this case C<'a'>, and if it matches, goto line 4, 2411i.e., C<PLUS(7)>. The middle lines describe some heuristics and 2412optimizations performed before a match: 2413 2414 floating `bc' at 0..2147483647 (checking floating) minlen 2 2415 Guessing start of match, REx `a*b+c' against `abc'... 2416 Found floating substr `bc' at offset 1... 2417 Guessed: match at offset 0 2418 2419Then the match is executed and the remaining lines describe the 2420process: 2421 2422 Matching REx `a*b+c' against `abc' 2423 Setting an EVAL scope, savestack=3 2424 0 <> <abc> | 1: STAR 2425 EXACT <a> can match 1 times out of 32767... 2426 Setting an EVAL scope, savestack=3 2427 1 <a> <bc> | 4: PLUS 2428 EXACT <b> can match 1 times out of 32767... 2429 Setting an EVAL scope, savestack=3 2430 2 <ab> <c> | 7: EXACT <c> 2431 3 <abc> <> | 9: END 2432 Match successful! 2433 Freeing REx: `a*b+c' 2434 2435Each step is of the form S<C<< n <x> <y> >> >, with C<< <x> >> the 2436part of the string matched and C<< <y> >> the part not yet 2437matched. The S<C<< | 1: STAR >> > says that perl is at line number 1 2438n the compilation list above. See 2439L<perldebguts/"Debugging regular expressions"> for much more detail. 2440 2441An alternative method of debugging regexps is to embed C<print> 2442statements within the regexp. This provides a blow-by-blow account of 2443the backtracking in an alternation: 2444 2445 "that this" =~ m@(?{print "Start at position ", pos, "\n";}) 2446 t(?{print "t1\n";}) 2447 h(?{print "h1\n";}) 2448 i(?{print "i1\n";}) 2449 s(?{print "s1\n";}) 2450 | 2451 t(?{print "t2\n";}) 2452 h(?{print "h2\n";}) 2453 a(?{print "a2\n";}) 2454 t(?{print "t2\n";}) 2455 (?{print "Done at position ", pos, "\n";}) 2456 @x; 2457 2458prints 2459 2460 Start at position 0 2461 t1 2462 h1 2463 t2 2464 h2 2465 a2 2466 t2 2467 Done at position 4 2468 2469=head1 BUGS 2470 2471Code expressions, conditional expressions, and independent expressions 2472are B<experimental>. Don't use them in production code. Yet. 2473 2474=head1 SEE ALSO 2475 2476This is just a tutorial. For the full story on perl regular 2477expressions, see the L<perlre> regular expressions reference page. 2478 2479For more information on the matching C<m//> and substitution C<s///> 2480operators, see L<perlop/"Regexp Quote-Like Operators">. For 2481information on the C<split> operation, see L<perlfunc/split>. 2482 2483For an excellent all-around resource on the care and feeding of 2484regular expressions, see the book I<Mastering Regular Expressions> by 2485Jeffrey Friedl (published by O'Reilly, ISBN 1556592-257-3). 2486 2487=head1 AUTHOR AND COPYRIGHT 2488 2489Copyright (c) 2000 Mark Kvale 2490All rights reserved. 2491 2492This document may be distributed under the same terms as Perl itself. 2493 2494=head2 Acknowledgments 2495 2496The inspiration for the stop codon DNA example came from the ZIP 2497code example in chapter 7 of I<Mastering Regular Expressions>. 2498 2499The author would like to thank Jeff Pinyan, Andrew Johnson, Peter 2500Haworth, Ronald J Kimball, and Joe Smith for all their helpful 2501comments. 2502 2503=cut 2504 2505