1=head1 NAME 2 3perlretut - Perl regular expressions tutorial 4 5=head1 DESCRIPTION 6 7This page provides a basic tutorial on understanding, creating and 8using regular expressions in Perl. It serves as a complement to the 9reference page on regular expressions L<perlre>. Regular expressions 10are an integral part of the C<m//>, C<s///>, C<qr//> and C<split> 11operators and so this tutorial also overlaps with 12L<perlop/"Regexp Quote-Like Operators"> and L<perlfunc/split>. 13 14Perl is widely renowned for excellence in text processing, and regular 15expressions are one of the big factors behind this fame. Perl regular 16expressions display an efficiency and flexibility unknown in most 17other computer languages. Mastering even the basics of regular 18expressions will allow you to manipulate text with surprising ease. 19 20What is a regular expression? At its most basic, a regular expression 21is a template that is used to determine if a string has certain 22characteristics. The string is most often some text, such as a line, 23sentence, web page, or even a whole book, but less commonly it could be 24some binary data as well. 25Suppose we want to determine if the text in variable, C<$var> contains 26the sequence of characters S<C<m u s h r o o m>> 27(blanks added for legibility). We can write in Perl 28 29 $var =~ m/mushroom/ 30 31The value of this expression will be TRUE if C<$var> contains that 32sequence of characters, and FALSE otherwise. The portion enclosed in 33C<'E<sol>'> characters denotes the characteristic we are looking for. 34We use the term I<pattern> for it. The process of looking to see if the 35pattern occurs in the string is called I<matching>, and the C<"=~"> 36operator along with the C<m//> tell Perl to try to match the pattern 37against the string. Note that the pattern is also a string, but a very 38special kind of one, as we will see. Patterns are in common use these 39days; 40examples are the patterns typed into a search engine to find web pages 41and the patterns used to list files in a directory, I<e.g.>, "C<ls *.txt>" 42or "C<dir *.*>". In Perl, the patterns described by regular expressions 43are used not only to search strings, but to also extract desired parts 44of strings, and to do search and replace operations. 45 46Regular expressions have the undeserved reputation of being abstract 47and difficult to understand. This really stems simply because the 48notation used to express them tends to be terse and dense, and not 49because of inherent complexity. We recommend using the C</x> regular 50expression modifier (described below) along with plenty of white space 51to make them less dense, and easier to read. Regular expressions are 52constructed using 53simple concepts like conditionals and loops and are no more difficult 54to understand than the corresponding C<if> conditionals and C<while> 55loops in the Perl language itself. 56 57This tutorial flattens the learning curve by discussing regular 58expression concepts, along with their notation, one at a time and with 59many examples. The first part of the tutorial will progress from the 60simplest word searches to the basic regular expression concepts. If 61you master the first part, you will have all the tools needed to solve 62about 98% of your needs. The second part of the tutorial is for those 63comfortable with the basics and hungry for more power tools. It 64discusses the more advanced regular expression operators and 65introduces the latest cutting-edge innovations. 66 67A note: to save time, "regular expression" is often abbreviated as 68regexp or regex. Regexp is a more natural abbreviation than regex, but 69is harder to pronounce. The Perl pod documentation is evenly split on 70regexp vs regex; in Perl, there is more than one way to abbreviate it. 71We'll use regexp in this tutorial. 72 73New in v5.22, L<C<use re 'strict'>|re/'strict' mode> applies stricter 74rules than otherwise when compiling regular expression patterns. It can 75find things that, while legal, may not be what you intended. 76 77=head1 Part 1: The basics 78 79=head2 Simple word matching 80 81The simplest regexp is simply a word, or more generally, a string of 82characters. A regexp consisting of just a word matches any string that 83contains that word: 84 85 "Hello World" =~ /World/; # matches 86 87What is this Perl statement all about? C<"Hello World"> is a simple 88double-quoted string. C<World> is the regular expression and the 89C<//> enclosing C</World/> tells Perl to search a string for a match. 90The operator C<=~> associates the string with the regexp match and 91produces a true value if the regexp matched, or false if the regexp 92did not match. In our case, C<World> matches the second word in 93C<"Hello World">, so the expression is true. Expressions like this 94are useful in conditionals: 95 96 if ("Hello World" =~ /World/) { 97 print "It matches\n"; 98 } 99 else { 100 print "It doesn't match\n"; 101 } 102 103There are useful variations on this theme. The sense of the match can 104be reversed by using the C<!~> operator: 105 106 if ("Hello World" !~ /World/) { 107 print "It doesn't match\n"; 108 } 109 else { 110 print "It matches\n"; 111 } 112 113The literal string in the regexp can be replaced by a variable: 114 115 my $greeting = "World"; 116 if ("Hello World" =~ /$greeting/) { 117 print "It matches\n"; 118 } 119 else { 120 print "It doesn't match\n"; 121 } 122 123If you're matching against the special default variable C<$_>, the 124C<$_ =~> part can be omitted: 125 126 $_ = "Hello World"; 127 if (/World/) { 128 print "It matches\n"; 129 } 130 else { 131 print "It doesn't match\n"; 132 } 133 134And finally, the C<//> default delimiters for a match can be changed 135to arbitrary delimiters by putting an C<'m'> out front: 136 137 "Hello World" =~ m!World!; # matches, delimited by '!' 138 "Hello World" =~ m{World}; # matches, note the matching '{}' 139 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', 140 # '/' becomes an ordinary char 141 142C</World/>, C<m!World!>, and C<m{World}> all represent the 143same thing. When, I<e.g.>, the quote (C<'"'>) is used as a delimiter, the forward 144slash C<'/'> becomes an ordinary character and can be used in this regexp 145without trouble. 146 147Let's consider how different regexps would match C<"Hello World">: 148 149 "Hello World" =~ /world/; # doesn't match 150 "Hello World" =~ /o W/; # matches 151 "Hello World" =~ /oW/; # doesn't match 152 "Hello World" =~ /World /; # doesn't match 153 154The first regexp C<world> doesn't match because regexps are 155case-sensitive. The second regexp matches because the substring 156S<C<'o W'>> occurs in the string S<C<"Hello World">>. The space 157character C<' '> is treated like any other character in a regexp and is 158needed to match in this case. The lack of a space character is the 159reason the third regexp C<'oW'> doesn't match. The fourth regexp 160"C<World >" doesn't match because there is a space at the end of the 161regexp, but not at the end of the string. The lesson here is that 162regexps must match a part of the string I<exactly> in order for the 163statement to be true. 164 165If a regexp matches in more than one place in the string, Perl will 166always match at the earliest possible point in the string: 167 168 "Hello World" =~ /o/; # matches 'o' in 'Hello' 169 "That hat is red" =~ /hat/; # matches 'hat' in 'That' 170 171With respect to character matching, there are a few more points you 172need to know about. First of all, not all characters can be used "as 173is" in a match. Some characters, called I<metacharacters>, are 174generally reserved for use in regexp notation. The metacharacters are 175 176 {}[]()^$.|*+?-#\ 177 178This list is not as definitive as it may appear (or be claimed to be in 179other documentation). For example, C<"#"> is a metacharacter only when 180the C</x> pattern modifier (described below) is used, and both C<"}"> 181and C<"]"> are metacharacters only when paired with opening C<"{"> or 182C<"["> respectively; other gotchas apply. 183 184The significance of each of these will be explained 185in the rest of the tutorial, but for now, it is important only to know 186that a metacharacter can be matched as-is by putting a backslash before 187it: 188 189 "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter 190 "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + 191 "The interval is [0,1)." =~ /[0,1)./ # is a syntax error! 192 "The interval is [0,1)." =~ /\[0,1\)\./ # matches 193 "#!/usr/bin/perl" =~ /#!\/usr\/bin\/perl/; # matches 194 195In the last regexp, the forward slash C<'/'> is also backslashed, 196because it is used to delimit the regexp. This can lead to LTS 197(leaning toothpick syndrome), however, and it is often more readable 198to change delimiters. 199 200 "#!/usr/bin/perl" =~ m!#\!/usr/bin/perl!; # easier to read 201 202The backslash character C<'\'> is a metacharacter itself and needs to 203be backslashed: 204 205 'C:\WIN32' =~ /C:\\WIN/; # matches 206 207In situations where it doesn't make sense for a particular metacharacter 208to mean what it normally does, it automatically loses its 209metacharacter-ness and becomes an ordinary character that is to be 210matched literally. For example, the C<'}'> is a metacharacter only when 211it is the mate of a C<'{'> metacharacter. Otherwise it is treated as a 212literal RIGHT CURLY BRACKET. This may lead to unexpected results. 213L<C<use re 'strict'>|re/'strict' mode> can catch some of these. 214 215In addition to the metacharacters, there are some ASCII characters 216which don't have printable character equivalents and are instead 217represented by I<escape sequences>. Common examples are C<\t> for a 218tab, C<\n> for a newline, C<\r> for a carriage return and C<\a> for a 219bell (or alert). If your string is better thought of as a sequence of arbitrary 220bytes, the octal escape sequence, I<e.g.>, C<\033>, or hexadecimal escape 221sequence, I<e.g.>, C<\x1B> may be a more natural representation for your 222bytes. Here are some examples of escapes: 223 224 "1000\t2000" =~ m(0\t2) # matches 225 "1000\n2000" =~ /0\n20/ # matches 226 "1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000" 227 "cat" =~ /\o{143}\x61\x74/ # matches in ASCII, but a weird way 228 # to spell cat 229 230If you've been around Perl a while, all this talk of escape sequences 231may seem familiar. Similar escape sequences are used in double-quoted 232strings and in fact the regexps in Perl are mostly treated as 233double-quoted strings. This means that variables can be used in 234regexps as well. Just like double-quoted strings, the values of the 235variables in the regexp will be substituted in before the regexp is 236evaluated for matching purposes. So we have: 237 238 $foo = 'house'; 239 'housecat' =~ /$foo/; # matches 240 'cathouse' =~ /cat$foo/; # matches 241 'housecat' =~ /${foo}cat/; # matches 242 243So far, so good. With the knowledge above you can already perform 244searches with just about any literal string regexp you can dream up. 245Here is a I<very simple> emulation of the Unix grep program: 246 247 % cat > simple_grep 248 #!/usr/bin/perl 249 $regexp = shift; 250 while (<>) { 251 print if /$regexp/; 252 } 253 ^D 254 255 % chmod +x simple_grep 256 257 % simple_grep abba /usr/dict/words 258 Babbage 259 cabbage 260 cabbages 261 sabbath 262 Sabbathize 263 Sabbathizes 264 sabbatical 265 scabbard 266 scabbards 267 268This program is easy to understand. C<#!/usr/bin/perl> is the standard 269way to invoke a perl program from the shell. 270S<C<$regexp = shift;>> saves the first command line argument as the 271regexp to be used, leaving the rest of the command line arguments to 272be treated as files. S<C<< while (<>) >>> loops over all the lines in 273all the files. For each line, S<C<print if /$regexp/;>> prints the 274line if the regexp matches the line. In this line, both C<print> and 275C</$regexp/> use the default variable C<$_> implicitly. 276 277With all of the regexps above, if the regexp matched anywhere in the 278string, it was considered a match. Sometimes, however, we'd like to 279specify I<where> in the string the regexp should try to match. To do 280this, we would use the I<anchor> metacharacters C<'^'> and C<'$'>. The 281anchor C<'^'> means match at the beginning of the string and the anchor 282C<'$'> means match at the end of the string, or before a newline at the 283end of the string. Here is how they are used: 284 285 "housekeeper" =~ /keeper/; # matches 286 "housekeeper" =~ /^keeper/; # doesn't match 287 "housekeeper" =~ /keeper$/; # matches 288 "housekeeper\n" =~ /keeper$/; # matches 289 290The second regexp doesn't match because C<'^'> constrains C<keeper> to 291match only at the beginning of the string, but C<"housekeeper"> has 292keeper starting in the middle. The third regexp does match, since the 293C<'$'> constrains C<keeper> to match only at the end of the string. 294 295When both C<'^'> and C<'$'> are used at the same time, the regexp has to 296match both the beginning and the end of the string, I<i.e.>, the regexp 297matches the whole string. Consider 298 299 "keeper" =~ /^keep$/; # doesn't match 300 "keeper" =~ /^keeper$/; # matches 301 "" =~ /^$/; # ^$ matches an empty string 302 303The first regexp doesn't match because the string has more to it than 304C<keep>. Since the second regexp is exactly the string, it 305matches. Using both C<'^'> and C<'$'> in a regexp forces the complete 306string to match, so it gives you complete control over which strings 307match and which don't. Suppose you are looking for a fellow named 308bert, off in a string by himself: 309 310 "dogbert" =~ /bert/; # matches, but not what you want 311 312 "dilbert" =~ /^bert/; # doesn't match, but .. 313 "bertram" =~ /^bert/; # matches, so still not good enough 314 315 "bertram" =~ /^bert$/; # doesn't match, good 316 "dilbert" =~ /^bert$/; # doesn't match, good 317 "bert" =~ /^bert$/; # matches, perfect 318 319Of course, in the case of a literal string, one could just as easily 320use the string comparison S<C<$string eq 'bert'>> and it would be 321more efficient. The C<^...$> regexp really becomes useful when we 322add in the more powerful regexp tools below. 323 324=head2 Using character classes 325 326Although one can already do quite a lot with the literal string 327regexps above, we've only scratched the surface of regular expression 328technology. In this and subsequent sections we will introduce regexp 329concepts (and associated metacharacter notations) that will allow a 330regexp to represent not just a single character sequence, but a I<whole 331class> of them. 332 333One such concept is that of a I<character class>. A character class 334allows a set of possible characters, rather than just a single 335character, to match at a particular point in a regexp. You can define 336your own custom character classes. These 337are denoted by brackets C<[...]>, with the set of characters 338to be possibly matched inside. Here are some examples: 339 340 /cat/; # matches 'cat' 341 /[bcr]at/; # matches 'bat, 'cat', or 'rat' 342 /item[0123456789]/; # matches 'item0' or ... or 'item9' 343 "abc" =~ /[cab]/; # matches 'a' 344 345In the last statement, even though C<'c'> is the first character in 346the class, C<'a'> matches because the first character position in the 347string is the earliest point at which the regexp can match. 348 349 /[yY][eE][sS]/; # match 'yes' in a case-insensitive way 350 # 'yes', 'Yes', 'YES', etc. 351 352This regexp displays a common task: perform a case-insensitive 353match. Perl provides a way of avoiding all those brackets by simply 354appending an C<'i'> to the end of the match. Then C</[yY][eE][sS]/;> 355can be rewritten as C</yes/i;>. The C<'i'> stands for 356case-insensitive and is an example of a I<modifier> of the matching 357operation. We will meet other modifiers later in the tutorial. 358 359We saw in the section above that there were ordinary characters, which 360represented themselves, and special characters, which needed a 361backslash C<'\'> to represent themselves. The same is true in a 362character class, but the sets of ordinary and special characters 363inside a character class are different than those outside a character 364class. The special characters for a character class are C<-]\^$> (and 365the pattern delimiter, whatever it is). 366C<']'> is special because it denotes the end of a character class. C<'$'> is 367special because it denotes a scalar variable. C<'\'> is special because 368it is used in escape sequences, just like above. Here is how the 369special characters C<]$\> are handled: 370 371 /[\]c]def/; # matches ']def' or 'cdef' 372 $x = 'bcr'; 373 /[$x]at/; # matches 'bat', 'cat', or 'rat' 374 /[\$x]at/; # matches '$at' or 'xat' 375 /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' 376 377The last two are a little tricky. In C<[\$x]>, the backslash protects 378the dollar sign, so the character class has two members C<'$'> and C<'x'>. 379In C<[\\$x]>, the backslash is protected, so C<$x> is treated as a 380variable and substituted in double quote fashion. 381 382The special character C<'-'> acts as a range operator within character 383classes, so that a contiguous set of characters can be written as a 384range. With ranges, the unwieldy C<[0123456789]> and C<[abc...xyz]> 385become the svelte C<[0-9]> and C<[a-z]>. Some examples are 386 387 /item[0-9]/; # matches 'item0' or ... or 'item9' 388 /[0-9bx-z]aa/; # matches '0aa', ..., '9aa', 389 # 'baa', 'xaa', 'yaa', or 'zaa' 390 /[0-9a-fA-F]/; # matches a hexadecimal digit 391 /[0-9a-zA-Z_]/; # matches a "word" character, 392 # like those in a Perl variable name 393 394If C<'-'> is the first or last character in a character class, it is 395treated as an ordinary character; C<[-ab]>, C<[ab-]> and C<[a\-b]> are 396all equivalent. 397 398The special character C<'^'> in the first position of a character class 399denotes a I<negated character class>, which matches any character but 400those in the brackets. Both C<[...]> and C<[^...]> must match a 401character, or the match fails. Then 402 403 /[^a]at/; # doesn't match 'aat' or 'at', but matches 404 # all other 'bat', 'cat, '0at', '%at', etc. 405 /[^0-9]/; # matches a non-numeric character 406 /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary 407 408Now, even C<[0-9]> can be a bother to write multiple times, so in the 409interest of saving keystrokes and making regexps more readable, Perl 410has several abbreviations for common character classes, as shown below. 411Since the introduction of Unicode, unless the C</a> modifier is in 412effect, these character classes match more than just a few characters in 413the ASCII range. 414 415=over 4 416 417=item * 418 419C<\d> matches a digit, not just C<[0-9]> but also digits from non-roman scripts 420 421=item * 422 423C<\s> matches a whitespace character, the set C<[\ \t\r\n\f]> and others 424 425=item * 426 427C<\w> matches a word character (alphanumeric or C<'_'>), not just C<[0-9a-zA-Z_]> 428but also digits and characters from non-roman scripts 429 430=item * 431 432C<\D> is a negated C<\d>; it represents any other character than a digit, or C<[^\d]> 433 434=item * 435 436C<\S> is a negated C<\s>; it represents any non-whitespace character C<[^\s]> 437 438=item * 439 440C<\W> is a negated C<\w>; it represents any non-word character C<[^\w]> 441 442=item * 443 444The period C<'.'> matches any character but C<"\n"> (unless the modifier C</s> is 445in effect, as explained below). 446 447=item * 448 449C<\N>, like the period, matches any character but C<"\n">, but it does so 450regardless of whether the modifier C</s> is in effect. 451 452=back 453 454The C</a> modifier, available starting in Perl 5.14, is used to 455restrict the matches of C<\d>, C<\s>, and C<\w> to just those in the ASCII range. 456It is useful to keep your program from being needlessly exposed to full 457Unicode (and its accompanying security considerations) when all you want 458is to process English-like text. (The "a" may be doubled, C</aa>, to 459provide even more restrictions, preventing case-insensitive matching of 460ASCII with non-ASCII characters; otherwise a Unicode "Kelvin Sign" 461would caselessly match a "k" or "K".) 462 463The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside 464of bracketed character classes. Here are some in use: 465 466 /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format 467 /[\d\s]/; # matches any digit or whitespace character 468 /\w\W\w/; # matches a word char, followed by a 469 # non-word char, followed by a word char 470 /..rt/; # matches any two chars, followed by 'rt' 471 /end\./; # matches 'end.' 472 /end[.]/; # same thing, matches 'end.' 473 474Because a period is a metacharacter, it needs to be escaped to match 475as an ordinary period. Because, for example, C<\d> and C<\w> are sets 476of characters, it is incorrect to think of C<[^\d\w]> as C<[\D\W]>; in 477fact C<[^\d\w]> is the same as C<[^\w]>, which is the same as 478C<[\W]>. Think DeMorgan's laws. 479 480In actuality, the period and C<\d\s\w\D\S\W> abbreviations are 481themselves types of character classes, so the ones surrounded by 482brackets are just one type of character class. When we need to make a 483distinction, we refer to them as "bracketed character classes." 484 485An anchor useful in basic regexps is the I<word anchor> 486C<\b>. This matches a boundary between a word character and a non-word 487character C<\w\W> or C<\W\w>: 488 489 $x = "Housecat catenates house and cat"; 490 $x =~ /cat/; # matches cat in 'housecat' 491 $x =~ /\bcat/; # matches cat in 'catenates' 492 $x =~ /cat\b/; # matches cat in 'housecat' 493 $x =~ /\bcat\b/; # matches 'cat' at end of string 494 495Note in the last example, the end of the string is considered a word 496boundary. 497 498For natural language processing (so that, for example, apostrophes are 499included in words), use instead C<\b{wb}> 500 501 "don't" =~ / .+? \b{wb} /x; # matches the whole string 502 503You might wonder why C<'.'> matches everything but C<"\n"> - why not 504every character? The reason is that often one is matching against 505lines and would like to ignore the newline characters. For instance, 506while the string C<"\n"> represents one line, we would like to think 507of it as empty. Then 508 509 "" =~ /^$/; # matches 510 "\n" =~ /^$/; # matches, $ anchors before "\n" 511 512 "" =~ /./; # doesn't match; it needs a char 513 "" =~ /^.$/; # doesn't match; it needs a char 514 "\n" =~ /^.$/; # doesn't match; it needs a char other than "\n" 515 "a" =~ /^.$/; # matches 516 "a\n" =~ /^.$/; # matches, $ anchors before "\n" 517 518This behavior is convenient, because we usually want to ignore 519newlines when we count and match characters in a line. Sometimes, 520however, we want to keep track of newlines. We might even want C<'^'> 521and C<'$'> to anchor at the beginning and end of lines within the 522string, rather than just the beginning and end of the string. Perl 523allows us to choose between ignoring and paying attention to newlines 524by using the C</s> and C</m> modifiers. C</s> and C</m> stand for 525single line and multi-line and they determine whether a string is to 526be treated as one continuous string, or as a set of lines. The two 527modifiers affect two aspects of how the regexp is interpreted: 1) how 528the C<'.'> character class is defined, and 2) where the anchors C<'^'> 529and C<'$'> are able to match. Here are the four possible combinations: 530 531=over 4 532 533=item * 534 535no modifiers: Default behavior. C<'.'> matches any character 536except C<"\n">. C<'^'> matches only at the beginning of the string and 537C<'$'> matches only at the end or before a newline at the end. 538 539=item * 540 541s modifier (C</s>): Treat string as a single long line. C<'.'> matches 542any character, even C<"\n">. C<'^'> matches only at the beginning of 543the string and C<'$'> matches only at the end or before a newline at the 544end. 545 546=item * 547 548m modifier (C</m>): Treat string as a set of multiple lines. C<'.'> 549matches any character except C<"\n">. C<'^'> and C<'$'> are able to match 550at the start or end of I<any> line within the string. 551 552=item * 553 554both s and m modifiers (C</sm>): Treat string as a single long line, but 555detect multiple lines. C<'.'> matches any character, even 556C<"\n">. C<'^'> and C<'$'>, however, are able to match at the start or end 557of I<any> line within the string. 558 559=back 560 561Here are examples of C</s> and C</m> in action: 562 563 $x = "There once was a girl\nWho programmed in Perl\n"; 564 565 $x =~ /^Who/; # doesn't match, "Who" not at start of string 566 $x =~ /^Who/s; # doesn't match, "Who" not at start of string 567 $x =~ /^Who/m; # matches, "Who" at start of second line 568 $x =~ /^Who/sm; # matches, "Who" at start of second line 569 570 $x =~ /girl.Who/; # doesn't match, "." doesn't match "\n" 571 $x =~ /girl.Who/s; # matches, "." matches "\n" 572 $x =~ /girl.Who/m; # doesn't match, "." doesn't match "\n" 573 $x =~ /girl.Who/sm; # matches, "." matches "\n" 574 575Most of the time, the default behavior is what is wanted, but C</s> and 576C</m> are occasionally very useful. If C</m> is being used, the start 577of the string can still be matched with C<\A> and the end of the string 578can still be matched with the anchors C<\Z> (matches both the end and 579the newline before, like C<'$'>), and C<\z> (matches only the end): 580 581 $x =~ /^Who/m; # matches, "Who" at start of second line 582 $x =~ /\AWho/m; # doesn't match, "Who" is not at start of string 583 584 $x =~ /girl$/m; # matches, "girl" at end of first line 585 $x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string 586 587 $x =~ /Perl\Z/m; # matches, "Perl" is at newline before end 588 $x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of string 589 590We now know how to create choices among classes of characters in a 591regexp. What about choices among words or character strings? Such 592choices are described in the next section. 593 594=head2 Matching this or that 595 596Sometimes we would like our regexp to be able to match different 597possible words or character strings. This is accomplished by using 598the I<alternation> metacharacter C<'|'>. To match C<dog> or C<cat>, we 599form the regexp C<dog|cat>. As before, Perl will try to match the 600regexp at the earliest possible point in the string. At each 601character position, Perl will first try to match the first 602alternative, C<dog>. If C<dog> doesn't match, Perl will then try the 603next alternative, C<cat>. If C<cat> doesn't match either, then the 604match fails and Perl moves to the next position in the string. Some 605examples: 606 607 "cats and dogs" =~ /cat|dog|bird/; # matches "cat" 608 "cats and dogs" =~ /dog|cat|bird/; # matches "cat" 609 610Even though C<dog> is the first alternative in the second regexp, 611C<cat> is able to match earlier in the string. 612 613 "cats" =~ /c|ca|cat|cats/; # matches "c" 614 "cats" =~ /cats|cat|ca|c/; # matches "cats" 615 616Here, all the alternatives match at the first string position, so the 617first alternative is the one that matches. If some of the 618alternatives are truncations of the others, put the longest ones first 619to give them a chance to match. 620 621 "cab" =~ /a|b|c/ # matches "c" 622 # /a|b|c/ == /[abc]/ 623 624The last example points out that character classes are like 625alternations of characters. At a given character position, the first 626alternative that allows the regexp match to succeed will be the one 627that matches. 628 629=head2 Grouping things and hierarchical matching 630 631Alternation allows a regexp to choose among alternatives, but by 632itself it is unsatisfying. The reason is that each alternative is a whole 633regexp, but sometime we want alternatives for just part of a 634regexp. For instance, suppose we want to search for housecats or 635housekeepers. The regexp C<housecat|housekeeper> fits the bill, but is 636inefficient because we had to type C<house> twice. It would be nice to 637have parts of the regexp be constant, like C<house>, and some 638parts have alternatives, like C<cat|keeper>. 639 640The I<grouping> metacharacters C<()> solve this problem. Grouping 641allows parts of a regexp to be treated as a single unit. Parts of a 642regexp are grouped by enclosing them in parentheses. Thus we could solve 643the C<housecat|housekeeper> by forming the regexp as 644C<house(cat|keeper)>. The regexp C<house(cat|keeper)> means match 645C<house> followed by either C<cat> or C<keeper>. Some more examples 646are 647 648 /(a|b)b/; # matches 'ab' or 'bb' 649 /(ac|b)b/; # matches 'acb' or 'bb' 650 /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere 651 /(a|[bc])d/; # matches 'ad', 'bd', or 'cd' 652 653 /house(cat|)/; # matches either 'housecat' or 'house' 654 /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or 655 # 'house'. Note groups can be nested. 656 657 /(19|20|)\d\d/; # match years 19xx, 20xx, or the Y2K problem, xx 658 "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', 659 # because '20\d\d' can't match 660 661Alternations behave the same way in groups as out of them: at a given 662string position, the leftmost alternative that allows the regexp to 663match is taken. So in the last example at the first string position, 664C<"20"> matches the second alternative, but there is nothing left over 665to match the next two digits C<\d\d>. So Perl moves on to the next 666alternative, which is the null alternative and that works, since 667C<"20"> is two digits. 668 669The process of trying one alternative, seeing if it matches, and 670moving on to the next alternative, while going back in the string 671from where the previous alternative was tried, if it doesn't, is called 672I<backtracking>. The term "backtracking" comes from the idea that 673matching a regexp is like a walk in the woods. Successfully matching 674a regexp is like arriving at a destination. There are many possible 675trailheads, one for each string position, and each one is tried in 676order, left to right. From each trailhead there may be many paths, 677some of which get you there, and some which are dead ends. When you 678walk along a trail and hit a dead end, you have to backtrack along the 679trail to an earlier point to try another trail. If you hit your 680destination, you stop immediately and forget about trying all the 681other trails. You are persistent, and only if you have tried all the 682trails from all the trailheads and not arrived at your destination, do 683you declare failure. To be concrete, here is a step-by-step analysis 684of what Perl does when it tries to match the regexp 685 686 "abcde" =~ /(abd|abc)(df|d|de)/; 687 688=over 4 689 690=item Z<>0. Start with the first letter in the string C<'a'>. 691 692E<nbsp> 693 694=item Z<>1. Try the first alternative in the first group C<'abd'>. 695 696E<nbsp> 697 698=item Z<>2. Match C<'a'> followed by C<'b'>. So far so good. 699 700E<nbsp> 701 702=item Z<>3. C<'d'> in the regexp doesn't match C<'c'> in the string - a 703dead end. So backtrack two characters and pick the second alternative 704in the first group C<'abc'>. 705 706E<nbsp> 707 708=item Z<>4. Match C<'a'> followed by C<'b'> followed by C<'c'>. We are on a roll 709and have satisfied the first group. Set C<$1> to C<'abc'>. 710 711E<nbsp> 712 713=item Z<>5 Move on to the second group and pick the first alternative C<'df'>. 714 715E<nbsp> 716 717=item Z<>6 Match the C<'d'>. 718 719E<nbsp> 720 721=item Z<>7. C<'f'> in the regexp doesn't match C<'e'> in the string, so a dead 722end. Backtrack one character and pick the second alternative in the 723second group C<'d'>. 724 725E<nbsp> 726 727=item Z<>8. C<'d'> matches. The second grouping is satisfied, so set 728C<$2> to C<'d'>. 729 730E<nbsp> 731 732=item Z<>9. We are at the end of the regexp, so we are done! We have 733matched C<'abcd'> out of the string C<"abcde">. 734 735=back 736 737There are a couple of things to note about this analysis. First, the 738third alternative in the second group C<'de'> also allows a match, but we 739stopped before we got to it - at a given character position, leftmost 740wins. Second, we were able to get a match at the first character 741position of the string C<'a'>. If there were no matches at the first 742position, Perl would move to the second character position C<'b'> and 743attempt the match all over again. Only when all possible paths at all 744possible character positions have been exhausted does Perl give 745up and declare S<C<$string =~ /(abd|abc)(df|d|de)/;>> to be false. 746 747Even with all this work, regexp matching happens remarkably fast. To 748speed things up, Perl compiles the regexp into a compact sequence of 749opcodes that can often fit inside a processor cache. When the code is 750executed, these opcodes can then run at full throttle and search very 751quickly. 752 753=head2 Extracting matches 754 755The grouping metacharacters C<()> also serve another completely 756different function: they allow the extraction of the parts of a string 757that matched. This is very useful to find out what matched and for 758text processing in general. For each grouping, the part that matched 759inside goes into the special variables C<$1>, C<$2>, I<etc>. They can be 760used just as ordinary variables: 761 762 # extract hours, minutes, seconds 763 if ($time =~ /(\d\d):(\d\d):(\d\d)/) { # match hh:mm:ss format 764 $hours = $1; 765 $minutes = $2; 766 $seconds = $3; 767 } 768 769Now, we know that in scalar context, 770S<C<$time =~ /(\d\d):(\d\d):(\d\d)/>> returns a true or false 771value. In list context, however, it returns the list of matched values 772C<($1,$2,$3)>. So we could write the code more compactly as 773 774 # extract hours, minutes, seconds 775 ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); 776 777If the groupings in a regexp are nested, C<$1> gets the group with the 778leftmost opening parenthesis, C<$2> the next opening parenthesis, 779I<etc>. Here is a regexp with nested groups: 780 781 /(ab(cd|ef)((gi)|j))/; 782 1 2 34 783 784If this regexp matches, C<$1> contains a string starting with 785C<'ab'>, C<$2> is either set to C<'cd'> or C<'ef'>, C<$3> equals either 786C<'gi'> or C<'j'>, and C<$4> is either set to C<'gi'>, just like C<$3>, 787or it remains undefined. 788 789For convenience, Perl sets C<$+> to the string held by the highest numbered 790C<$1>, C<$2>,... that got assigned (and, somewhat related, C<$^N> to the 791value of the C<$1>, C<$2>,... most-recently assigned; I<i.e.> the C<$1>, 792C<$2>,... associated with the rightmost closing parenthesis used in the 793match). 794 795 796=head2 Backreferences 797 798Closely associated with the matching variables C<$1>, C<$2>, ... are 799the I<backreferences> C<\g1>, C<\g2>,... Backreferences are simply 800matching variables that can be used I<inside> a regexp. This is a 801really nice feature; what matches later in a regexp is made to depend on 802what matched earlier in the regexp. Suppose we wanted to look 803for doubled words in a text, like "the the". The following regexp finds 804all 3-letter doubles with a space in between: 805 806 /\b(\w\w\w)\s\g1\b/; 807 808The grouping assigns a value to C<\g1>, so that the same 3-letter sequence 809is used for both parts. 810 811A similar task is to find words consisting of two identical parts: 812 813 % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\g1$' /usr/dict/words 814 beriberi 815 booboo 816 coco 817 mama 818 murmur 819 papa 820 821The regexp has a single grouping which considers 4-letter 822combinations, then 3-letter combinations, I<etc>., and uses C<\g1> to look for 823a repeat. Although C<$1> and C<\g1> represent the same thing, care should be 824taken to use matched variables C<$1>, C<$2>,... only I<outside> a regexp 825and backreferences C<\g1>, C<\g2>,... only I<inside> a regexp; not doing 826so may lead to surprising and unsatisfactory results. 827 828 829=head2 Relative backreferences 830 831Counting the opening parentheses to get the correct number for a 832backreference is error-prone as soon as there is more than one 833capturing group. A more convenient technique became available 834with Perl 5.10: relative backreferences. To refer to the immediately 835preceding capture group one now may write C<\g{-1}>, the next but 836last is available via C<\g{-2}>, and so on. 837 838Another good reason in addition to readability and maintainability 839for using relative backreferences is illustrated by the following example, 840where a simple pattern for matching peculiar strings is used: 841 842 $a99a = '([a-z])(\d)\g2\g1'; # matches a11a, g22g, x33x, etc. 843 844Now that we have this pattern stored as a handy string, we might feel 845tempted to use it as a part of some other pattern: 846 847 $line = "code=e99e"; 848 if ($line =~ /^(\w+)=$a99a$/){ # unexpected behavior! 849 print "$1 is valid\n"; 850 } else { 851 print "bad line: '$line'\n"; 852 } 853 854But this doesn't match, at least not the way one might expect. Only 855after inserting the interpolated C<$a99a> and looking at the resulting 856full text of the regexp is it obvious that the backreferences have 857backfired. The subexpression C<(\w+)> has snatched number 1 and 858demoted the groups in C<$a99a> by one rank. This can be avoided by 859using relative backreferences: 860 861 $a99a = '([a-z])(\d)\g{-1}\g{-2}'; # safe for being interpolated 862 863 864=head2 Named backreferences 865 866Perl 5.10 also introduced named capture groups and named backreferences. 867To attach a name to a capturing group, you write either 868C<< (?<name>...) >> or C<< (?'name'...) >>. The backreference may 869then be written as C<\g{name}>. It is permissible to attach the 870same name to more than one group, but then only the leftmost one of the 871eponymous set can be referenced. Outside of the pattern a named 872capture group is accessible through the C<%+> hash. 873 874Assuming that we have to match calendar dates which may be given in one 875of the three formats yyyy-mm-dd, mm/dd/yyyy or dd.mm.yyyy, we can write 876three suitable patterns where we use C<'d'>, C<'m'> and C<'y'> respectively as the 877names of the groups capturing the pertaining components of a date. The 878matching operation combines the three patterns as alternatives: 879 880 $fmt1 = '(?<y>\d\d\d\d)-(?<m>\d\d)-(?<d>\d\d)'; 881 $fmt2 = '(?<m>\d\d)/(?<d>\d\d)/(?<y>\d\d\d\d)'; 882 $fmt3 = '(?<d>\d\d)\.(?<m>\d\d)\.(?<y>\d\d\d\d)'; 883 for my $d (qw(2006-10-21 15.01.2007 10/31/2005)) { 884 if ( $d =~ m{$fmt1|$fmt2|$fmt3} ){ 885 print "day=$+{d} month=$+{m} year=$+{y}\n"; 886 } 887 } 888 889If any of the alternatives matches, the hash C<%+> is bound to contain the 890three key-value pairs. 891 892 893=head2 Alternative capture group numbering 894 895Yet another capturing group numbering technique (also as from Perl 5.10) 896deals with the problem of referring to groups within a set of alternatives. 897Consider a pattern for matching a time of the day, civil or military style: 898 899 if ( $time =~ /(\d\d|\d):(\d\d)|(\d\d)(\d\d)/ ){ 900 # process hour and minute 901 } 902 903Processing the results requires an additional if statement to determine 904whether C<$1> and C<$2> or C<$3> and C<$4> contain the goodies. It would 905be easier if we could use group numbers 1 and 2 in second alternative as 906well, and this is exactly what the parenthesized construct C<(?|...)>, 907set around an alternative achieves. Here is an extended version of the 908previous pattern: 909 910 if($time =~ /(?|(\d\d|\d):(\d\d)|(\d\d)(\d\d))\s+([A-Z][A-Z][A-Z])/){ 911 print "hour=$1 minute=$2 zone=$3\n"; 912 } 913 914Within the alternative numbering group, group numbers start at the same 915position for each alternative. After the group, numbering continues 916with one higher than the maximum reached across all the alternatives. 917 918=head2 Position information 919 920In addition to what was matched, Perl also provides the 921positions of what was matched as contents of the C<@-> and C<@+> 922arrays. C<$-[0]> is the position of the start of the entire match and 923C<$+[0]> is the position of the end. Similarly, C<$-[n]> is the 924position of the start of the C<$n> match and C<$+[n]> is the position 925of the end. If C<$n> is undefined, so are C<$-[n]> and C<$+[n]>. Then 926this code 927 928 $x = "Mmm...donut, thought Homer"; 929 $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches 930 foreach $exp (1..$#-) { 931 print "Match $exp: '${$exp}' at position ($-[$exp],$+[$exp])\n"; 932 } 933 934prints 935 936 Match 1: 'Mmm' at position (0,3) 937 Match 2: 'donut' at position (6,11) 938 939Even if there are no groupings in a regexp, it is still possible to 940find out what exactly matched in a string. If you use them, Perl 941will set C<$`> to the part of the string before the match, will set C<$&> 942to the part of the string that matched, and will set C<$'> to the part 943of the string after the match. An example: 944 945 $x = "the cat caught the mouse"; 946 $x =~ /cat/; # $` = 'the ', $& = 'cat', $' = ' caught the mouse' 947 $x =~ /the/; # $` = '', $& = 'the', $' = ' cat caught the mouse' 948 949In the second match, C<$`> equals C<''> because the regexp matched at the 950first character position in the string and stopped; it never saw the 951second "the". 952 953If your code is to run on Perl versions earlier than 9545.20, it is worthwhile to note that using C<$`> and C<$'> 955slows down regexp matching quite a bit, while C<$&> slows it down to a 956lesser extent, because if they are used in one regexp in a program, 957they are generated for I<all> regexps in the program. So if raw 958performance is a goal of your application, they should be avoided. 959If you need to extract the corresponding substrings, use C<@-> and 960C<@+> instead: 961 962 $` is the same as substr( $x, 0, $-[0] ) 963 $& is the same as substr( $x, $-[0], $+[0]-$-[0] ) 964 $' is the same as substr( $x, $+[0] ) 965 966As of Perl 5.10, the C<${^PREMATCH}>, C<${^MATCH}> and C<${^POSTMATCH}> 967variables may be used. These are only set if the C</p> modifier is 968present. Consequently they do not penalize the rest of the program. In 969Perl 5.20, C<${^PREMATCH}>, C<${^MATCH}> and C<${^POSTMATCH}> are available 970whether the C</p> has been used or not (the modifier is ignored), and 971C<$`>, C<$'> and C<$&> do not cause any speed difference. 972 973=head2 Non-capturing groupings 974 975A group that is required to bundle a set of alternatives may or may not be 976useful as a capturing group. If it isn't, it just creates a superfluous 977addition to the set of available capture group values, inside as well as 978outside the regexp. Non-capturing groupings, denoted by C<(?:regexp)>, 979still allow the regexp to be treated as a single unit, but don't establish 980a capturing group at the same time. Both capturing and non-capturing 981groupings are allowed to co-exist in the same regexp. Because there is 982no extraction, non-capturing groupings are faster than capturing 983groupings. Non-capturing groupings are also handy for choosing exactly 984which parts of a regexp are to be extracted to matching variables: 985 986 # match a number, $1-$4 are set, but we only want $1 987 /([+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?)/; 988 989 # match a number faster , only $1 is set 990 /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)/; 991 992 # match a number, get $1 = whole number, $2 = exponent 993 /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE]([+-]?\d+))?)/; 994 995Non-capturing groupings are also useful for removing nuisance 996elements gathered from a split operation where parentheses are 997required for some reason: 998 999 $x = '12aba34ba5'; 1000 @num = split /(a|b)+/, $x; # @num = ('12','a','34','a','5') 1001 @num = split /(?:a|b)+/, $x; # @num = ('12','34','5') 1002 1003In Perl 5.22 and later, all groups within a regexp can be set to 1004non-capturing by using the new C</n> flag: 1005 1006 "hello" =~ /(hi|hello)/n; # $1 is not set! 1007 1008See L<perlre/"n"> for more information. 1009 1010=head2 Matching repetitions 1011 1012The examples in the previous section display an annoying weakness. We 1013were only matching 3-letter words, or chunks of words of 4 letters or 1014less. We'd like to be able to match words or, more generally, strings 1015of any length, without writing out tedious alternatives like 1016C<\w\w\w\w|\w\w\w|\w\w|\w>. 1017 1018This is exactly the problem the I<quantifier> metacharacters C<'?'>, 1019C<'*'>, C<'+'>, and C<{}> were created for. They allow us to delimit the 1020number of repeats for a portion of a regexp we consider to be a 1021match. Quantifiers are put immediately after the character, character 1022class, or grouping that we want to specify. They have the following 1023meanings: 1024 1025=over 4 1026 1027=item * 1028 1029C<a?> means: match C<'a'> 1 or 0 times 1030 1031=item * 1032 1033C<a*> means: match C<'a'> 0 or more times, I<i.e.>, any number of times 1034 1035=item * 1036 1037C<a+> means: match C<'a'> 1 or more times, I<i.e.>, at least once 1038 1039=item * 1040 1041C<a{n,m}> means: match at least C<n> times, but not more than C<m> 1042times. 1043 1044=item * 1045 1046C<a{n,}> means: match at least C<n> or more times 1047 1048=item * 1049 1050C<a{n}> means: match exactly C<n> times 1051 1052=back 1053 1054Here are some examples: 1055 1056 /[a-z]+\s+\d*/; # match a lowercase word, at least one space, and 1057 # any number of digits 1058 /(\w+)\s+\g1/; # match doubled words of arbitrary length 1059 /y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes' 1060 $year =~ /^\d{2,4}$/; # make sure year is at least 2 but not more 1061 # than 4 digits 1062 $year =~ /^\d{4}$|^\d{2}$/; # better match; throw out 3-digit dates 1063 $year =~ /^\d{2}(\d{2})?$/; # same thing written differently. 1064 # However, this captures the last two 1065 # digits in $1 and the other does not. 1066 1067 % simple_grep '^(\w+)\g1$' /usr/dict/words # isn't this easier? 1068 beriberi 1069 booboo 1070 coco 1071 mama 1072 murmur 1073 papa 1074 1075For all of these quantifiers, Perl will try to match as much of the 1076string as possible, while still allowing the regexp to succeed. Thus 1077with C</a?.../>, Perl will first try to match the regexp with the C<'a'> 1078present; if that fails, Perl will try to match the regexp without the 1079C<'a'> present. For the quantifier C<'*'>, we get the following: 1080 1081 $x = "the cat in the hat"; 1082 $x =~ /^(.*)(cat)(.*)$/; # matches, 1083 # $1 = 'the ' 1084 # $2 = 'cat' 1085 # $3 = ' in the hat' 1086 1087Which is what we might expect, the match finds the only C<cat> in the 1088string and locks onto it. Consider, however, this regexp: 1089 1090 $x =~ /^(.*)(at)(.*)$/; # matches, 1091 # $1 = 'the cat in the h' 1092 # $2 = 'at' 1093 # $3 = '' (0 characters match) 1094 1095One might initially guess that Perl would find the C<at> in C<cat> and 1096stop there, but that wouldn't give the longest possible string to the 1097first quantifier C<.*>. Instead, the first quantifier C<.*> grabs as 1098much of the string as possible while still having the regexp match. In 1099this example, that means having the C<at> sequence with the final C<at> 1100in the string. The other important principle illustrated here is that, 1101when there are two or more elements in a regexp, the I<leftmost> 1102quantifier, if there is one, gets to grab as much of the string as 1103possible, leaving the rest of the regexp to fight over scraps. Thus in 1104our example, the first quantifier C<.*> grabs most of the string, while 1105the second quantifier C<.*> gets the empty string. Quantifiers that 1106grab as much of the string as possible are called I<maximal match> or 1107I<greedy> quantifiers. 1108 1109When a regexp can match a string in several different ways, we can use 1110the principles above to predict which way the regexp will match: 1111 1112=over 4 1113 1114=item * 1115 1116Principle 0: Taken as a whole, any regexp will be matched at the 1117earliest possible position in the string. 1118 1119=item * 1120 1121Principle 1: In an alternation C<a|b|c...>, the leftmost alternative 1122that allows a match for the whole regexp will be the one used. 1123 1124=item * 1125 1126Principle 2: The maximal matching quantifiers C<'?'>, C<'*'>, C<'+'> and 1127C<{n,m}> will in general match as much of the string as possible while 1128still allowing the whole regexp to match. 1129 1130=item * 1131 1132Principle 3: If there are two or more elements in a regexp, the 1133leftmost greedy quantifier, if any, will match as much of the string 1134as possible while still allowing the whole regexp to match. The next 1135leftmost greedy quantifier, if any, will try to match as much of the 1136string remaining available to it as possible, while still allowing the 1137whole regexp to match. And so on, until all the regexp elements are 1138satisfied. 1139 1140=back 1141 1142As we have seen above, Principle 0 overrides the others. The regexp 1143will be matched as early as possible, with the other principles 1144determining how the regexp matches at that earliest character 1145position. 1146 1147Here is an example of these principles in action: 1148 1149 $x = "The programming republic of Perl"; 1150 $x =~ /^(.+)(e|r)(.*)$/; # matches, 1151 # $1 = 'The programming republic of Pe' 1152 # $2 = 'r' 1153 # $3 = 'l' 1154 1155This regexp matches at the earliest string position, C<'T'>. One 1156might think that C<'e'>, being leftmost in the alternation, would be 1157matched, but C<'r'> produces the longest string in the first quantifier. 1158 1159 $x =~ /(m{1,2})(.*)$/; # matches, 1160 # $1 = 'mm' 1161 # $2 = 'ing republic of Perl' 1162 1163Here, The earliest possible match is at the first C<'m'> in 1164C<programming>. C<m{1,2}> is the first quantifier, so it gets to match 1165a maximal C<mm>. 1166 1167 $x =~ /.*(m{1,2})(.*)$/; # matches, 1168 # $1 = 'm' 1169 # $2 = 'ing republic of Perl' 1170 1171Here, the regexp matches at the start of the string. The first 1172quantifier C<.*> grabs as much as possible, leaving just a single 1173C<'m'> for the second quantifier C<m{1,2}>. 1174 1175 $x =~ /(.?)(m{1,2})(.*)$/; # matches, 1176 # $1 = 'a' 1177 # $2 = 'mm' 1178 # $3 = 'ing republic of Perl' 1179 1180Here, C<.?> eats its maximal one character at the earliest possible 1181position in the string, C<'a'> in C<programming>, leaving C<m{1,2}> 1182the opportunity to match both C<'m'>'s. Finally, 1183 1184 "aXXXb" =~ /(X*)/; # matches with $1 = '' 1185 1186because it can match zero copies of C<'X'> at the beginning of the 1187string. If you definitely want to match at least one C<'X'>, use 1188C<X+>, not C<X*>. 1189 1190Sometimes greed is not good. At times, we would like quantifiers to 1191match a I<minimal> piece of string, rather than a maximal piece. For 1192this purpose, Larry Wall created the I<minimal match> or 1193I<non-greedy> quantifiers C<??>, C<*?>, C<+?>, and C<{}?>. These are 1194the usual quantifiers with a C<'?'> appended to them. They have the 1195following meanings: 1196 1197=over 4 1198 1199=item * 1200 1201C<a??> means: match C<'a'> 0 or 1 times. Try 0 first, then 1. 1202 1203=item * 1204 1205C<a*?> means: match C<'a'> 0 or more times, I<i.e.>, any number of times, 1206but as few times as possible 1207 1208=item * 1209 1210C<a+?> means: match C<'a'> 1 or more times, I<i.e.>, at least once, but 1211as few times as possible 1212 1213=item * 1214 1215C<a{n,m}?> means: match at least C<n> times, not more than C<m> 1216times, as few times as possible 1217 1218=item * 1219 1220C<a{n,}?> means: match at least C<n> times, but as few times as 1221possible 1222 1223=item * 1224 1225C<a{n}?> means: match exactly C<n> times. Because we match exactly 1226C<n> times, C<a{n}?> is equivalent to C<a{n}> and is just there for 1227notational consistency. 1228 1229=back 1230 1231Let's look at the example above, but with minimal quantifiers: 1232 1233 $x = "The programming republic of Perl"; 1234 $x =~ /^(.+?)(e|r)(.*)$/; # matches, 1235 # $1 = 'Th' 1236 # $2 = 'e' 1237 # $3 = ' programming republic of Perl' 1238 1239The minimal string that will allow both the start of the string C<'^'> 1240and the alternation to match is C<Th>, with the alternation C<e|r> 1241matching C<'e'>. The second quantifier C<.*> is free to gobble up the 1242rest of the string. 1243 1244 $x =~ /(m{1,2}?)(.*?)$/; # matches, 1245 # $1 = 'm' 1246 # $2 = 'ming republic of Perl' 1247 1248The first string position that this regexp can match is at the first 1249C<'m'> in C<programming>. At this position, the minimal C<m{1,2}?> 1250matches just one C<'m'>. Although the second quantifier C<.*?> would 1251prefer to match no characters, it is constrained by the end-of-string 1252anchor C<'$'> to match the rest of the string. 1253 1254 $x =~ /(.*?)(m{1,2}?)(.*)$/; # matches, 1255 # $1 = 'The progra' 1256 # $2 = 'm' 1257 # $3 = 'ming republic of Perl' 1258 1259In this regexp, you might expect the first minimal quantifier C<.*?> 1260to match the empty string, because it is not constrained by a C<'^'> 1261anchor to match the beginning of the word. Principle 0 applies here, 1262however. Because it is possible for the whole regexp to match at the 1263start of the string, it I<will> match at the start of the string. Thus 1264the first quantifier has to match everything up to the first C<'m'>. The 1265second minimal quantifier matches just one C<'m'> and the third 1266quantifier matches the rest of the string. 1267 1268 $x =~ /(.??)(m{1,2})(.*)$/; # matches, 1269 # $1 = 'a' 1270 # $2 = 'mm' 1271 # $3 = 'ing republic of Perl' 1272 1273Just as in the previous regexp, the first quantifier C<.??> can match 1274earliest at position C<'a'>, so it does. The second quantifier is 1275greedy, so it matches C<mm>, and the third matches the rest of the 1276string. 1277 1278We can modify principle 3 above to take into account non-greedy 1279quantifiers: 1280 1281=over 4 1282 1283=item * 1284 1285Principle 3: If there are two or more elements in a regexp, the 1286leftmost greedy (non-greedy) quantifier, if any, will match as much 1287(little) of the string as possible while still allowing the whole 1288regexp to match. The next leftmost greedy (non-greedy) quantifier, if 1289any, will try to match as much (little) of the string remaining 1290available to it as possible, while still allowing the whole regexp to 1291match. And so on, until all the regexp elements are satisfied. 1292 1293=back 1294 1295Just like alternation, quantifiers are also susceptible to 1296backtracking. Here is a step-by-step analysis of the example 1297 1298 $x = "the cat in the hat"; 1299 $x =~ /^(.*)(at)(.*)$/; # matches, 1300 # $1 = 'the cat in the h' 1301 # $2 = 'at' 1302 # $3 = '' (0 matches) 1303 1304=over 4 1305 1306=item Z<>0. Start with the first letter in the string C<'t'>. 1307 1308E<nbsp> 1309 1310=item Z<>1. The first quantifier C<'.*'> starts out by matching the whole 1311string "C<the cat in the hat>". 1312 1313E<nbsp> 1314 1315=item Z<>2. C<'a'> in the regexp element C<'at'> doesn't match the end 1316of the string. Backtrack one character. 1317 1318E<nbsp> 1319 1320=item Z<>3. C<'a'> in the regexp element C<'at'> still doesn't match 1321the last letter of the string C<'t'>, so backtrack one more character. 1322 1323E<nbsp> 1324 1325=item Z<>4. Now we can match the C<'a'> and the C<'t'>. 1326 1327E<nbsp> 1328 1329=item Z<>5. Move on to the third element C<'.*'>. Since we are at the 1330end of the string and C<'.*'> can match 0 times, assign it the empty 1331string. 1332 1333E<nbsp> 1334 1335=item Z<>6. We are done! 1336 1337=back 1338 1339Most of the time, all this moving forward and backtracking happens 1340quickly and searching is fast. There are some pathological regexps, 1341however, whose execution time exponentially grows with the size of the 1342string. A typical structure that blows up in your face is of the form 1343 1344 /(a|b+)*/; 1345 1346The problem is the nested indeterminate quantifiers. There are many 1347different ways of partitioning a string of length n between the C<'+'> 1348and C<'*'>: one repetition with C<b+> of length n, two repetitions with 1349the first C<b+> length k and the second with length n-k, m repetitions 1350whose bits add up to length n, I<etc>. In fact there are an exponential 1351number of ways to partition a string as a function of its length. A 1352regexp may get lucky and match early in the process, but if there is 1353no match, Perl will try I<every> possibility before giving up. So be 1354careful with nested C<'*'>'s, C<{n,m}>'s, and C<'+'>'s. The book 1355I<Mastering Regular Expressions> by Jeffrey Friedl gives a wonderful 1356discussion of this and other efficiency issues. 1357 1358 1359=head2 Possessive quantifiers 1360 1361Backtracking during the relentless search for a match may be a waste 1362of time, particularly when the match is bound to fail. Consider 1363the simple pattern 1364 1365 /^\w+\s+\w+$/; # a word, spaces, a word 1366 1367Whenever this is applied to a string which doesn't quite meet the 1368pattern's expectations such as S<C<"abc ">> or S<C<"abc def ">>, 1369the regexp engine will backtrack, approximately once for each character 1370in the string. But we know that there is no way around taking I<all> 1371of the initial word characters to match the first repetition, that I<all> 1372spaces must be eaten by the middle part, and the same goes for the second 1373word. 1374 1375With the introduction of the I<possessive quantifiers> in Perl 5.10, we 1376have a way of instructing the regexp engine not to backtrack, with the 1377usual quantifiers with a C<'+'> appended to them. This makes them greedy as 1378well as stingy; once they succeed they won't give anything back to permit 1379another solution. They have the following meanings: 1380 1381=over 4 1382 1383=item * 1384 1385C<a{n,m}+> means: match at least C<n> times, not more than C<m> times, 1386as many times as possible, and don't give anything up. C<a?+> is short 1387for C<a{0,1}+> 1388 1389=item * 1390 1391C<a{n,}+> means: match at least C<n> times, but as many times as possible, 1392and don't give anything up. C<a*+> is short for C<a{0,}+> and C<a++> is 1393short for C<a{1,}+>. 1394 1395=item * 1396 1397C<a{n}+> means: match exactly C<n> times. It is just there for 1398notational consistency. 1399 1400=back 1401 1402These possessive quantifiers represent a special case of a more general 1403concept, the I<independent subexpression>, see below. 1404 1405As an example where a possessive quantifier is suitable we consider 1406matching a quoted string, as it appears in several programming languages. 1407The backslash is used as an escape character that indicates that the 1408next character is to be taken literally, as another character for the 1409string. Therefore, after the opening quote, we expect a (possibly 1410empty) sequence of alternatives: either some character except an 1411unescaped quote or backslash or an escaped character. 1412 1413 /"(?:[^"\\]++|\\.)*+"/; 1414 1415 1416=head2 Building a regexp 1417 1418At this point, we have all the basic regexp concepts covered, so let's 1419give a more involved example of a regular expression. We will build a 1420regexp that matches numbers. 1421 1422The first task in building a regexp is to decide what we want to match 1423and what we want to exclude. In our case, we want to match both 1424integers and floating point numbers and we want to reject any string 1425that isn't a number. 1426 1427The next task is to break the problem down into smaller problems that 1428are easily converted into a regexp. 1429 1430The simplest case is integers. These consist of a sequence of digits, 1431with an optional sign in front. The digits we can represent with 1432C<\d+> and the sign can be matched with C<[+-]>. Thus the integer 1433regexp is 1434 1435 /[+-]?\d+/; # matches integers 1436 1437A floating point number potentially has a sign, an integral part, a 1438decimal point, a fractional part, and an exponent. One or more of these 1439parts is optional, so we need to check out the different 1440possibilities. Floating point numbers which are in proper form include 1441123., 0.345, .34, -1e6, and 25.4E-72. As with integers, the sign out 1442front is completely optional and can be matched by C<[+-]?>. We can 1443see that if there is no exponent, floating point numbers must have a 1444decimal point, otherwise they are integers. We might be tempted to 1445model these with C<\d*\.\d*>, but this would also match just a single 1446decimal point, which is not a number. So the three cases of floating 1447point number without exponent are 1448 1449 /[+-]?\d+\./; # 1., 321., etc. 1450 /[+-]?\.\d+/; # .1, .234, etc. 1451 /[+-]?\d+\.\d+/; # 1.0, 30.56, etc. 1452 1453These can be combined into a single regexp with a three-way alternation: 1454 1455 /[+-]?(\d+\.\d+|\d+\.|\.\d+)/; # floating point, no exponent 1456 1457In this alternation, it is important to put C<'\d+\.\d+'> before 1458C<'\d+\.'>. If C<'\d+\.'> were first, the regexp would happily match that 1459and ignore the fractional part of the number. 1460 1461Now consider floating point numbers with exponents. The key 1462observation here is that I<both> integers and numbers with decimal 1463points are allowed in front of an exponent. Then exponents, like the 1464overall sign, are independent of whether we are matching numbers with 1465or without decimal points, and can be "decoupled" from the 1466mantissa. The overall form of the regexp now becomes clear: 1467 1468 /^(optional sign)(integer | f.p. mantissa)(optional exponent)$/; 1469 1470The exponent is an C<'e'> or C<'E'>, followed by an integer. So the 1471exponent regexp is 1472 1473 /[eE][+-]?\d+/; # exponent 1474 1475Putting all the parts together, we get a regexp that matches numbers: 1476 1477 /^[+-]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/; # Ta da! 1478 1479Long regexps like this may impress your friends, but can be hard to 1480decipher. In complex situations like this, the C</x> modifier for a 1481match is invaluable. It allows one to put nearly arbitrary whitespace 1482and comments into a regexp without affecting their meaning. Using it, 1483we can rewrite our "extended" regexp in the more pleasing form 1484 1485 /^ 1486 [+-]? # first, match an optional sign 1487 ( # then match integers or f.p. mantissas: 1488 \d+\.\d+ # mantissa of the form a.b 1489 |\d+\. # mantissa of the form a. 1490 |\.\d+ # mantissa of the form .b 1491 |\d+ # integer of the form a 1492 ) 1493 ( [eE] [+-]? \d+ )? # finally, optionally match an exponent 1494 $/x; 1495 1496If whitespace is mostly irrelevant, how does one include space 1497characters in an extended regexp? The answer is to backslash it 1498S<C<'\ '>> or put it in a character class S<C<[ ]>>. The same thing 1499goes for pound signs: use C<\#> or C<[#]>. For instance, Perl allows 1500a space between the sign and the mantissa or integer, and we could add 1501this to our regexp as follows: 1502 1503 /^ 1504 [+-]?\ * # first, match an optional sign *and space* 1505 ( # then match integers or f.p. mantissas: 1506 \d+\.\d+ # mantissa of the form a.b 1507 |\d+\. # mantissa of the form a. 1508 |\.\d+ # mantissa of the form .b 1509 |\d+ # integer of the form a 1510 ) 1511 ( [eE] [+-]? \d+ )? # finally, optionally match an exponent 1512 $/x; 1513 1514In this form, it is easier to see a way to simplify the 1515alternation. Alternatives 1, 2, and 4 all start with C<\d+>, so it 1516could be factored out: 1517 1518 /^ 1519 [+-]?\ * # first, match an optional sign 1520 ( # then match integers or f.p. mantissas: 1521 \d+ # start out with a ... 1522 ( 1523 \.\d* # mantissa of the form a.b or a. 1524 )? # ? takes care of integers of the form a 1525 |\.\d+ # mantissa of the form .b 1526 ) 1527 ( [eE] [+-]? \d+ )? # finally, optionally match an exponent 1528 $/x; 1529 1530Starting in Perl v5.26, specifying C</xx> changes the square-bracketed 1531portions of a pattern to ignore tabs and space characters unless they 1532are escaped by preceding them with a backslash. So, we could write 1533 1534 /^ 1535 [ + - ]?\ * # first, match an optional sign 1536 ( # then match integers or f.p. mantissas: 1537 \d+ # start out with a ... 1538 ( 1539 \.\d* # mantissa of the form a.b or a. 1540 )? # ? takes care of integers of the form a 1541 |\.\d+ # mantissa of the form .b 1542 ) 1543 ( [ e E ] [ + - ]? \d+ )? # finally, optionally match an exponent 1544 $/xx; 1545 1546This doesn't really improve the legibility of this example, but it's 1547available in case you want it. Squashing the pattern down to the 1548compact form, we have 1549 1550 /^[+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?$/; 1551 1552This is our final regexp. To recap, we built a regexp by 1553 1554=over 4 1555 1556=item * 1557 1558specifying the task in detail, 1559 1560=item * 1561 1562breaking down the problem into smaller parts, 1563 1564=item * 1565 1566translating the small parts into regexps, 1567 1568=item * 1569 1570combining the regexps, 1571 1572=item * 1573 1574and optimizing the final combined regexp. 1575 1576=back 1577 1578These are also the typical steps involved in writing a computer 1579program. This makes perfect sense, because regular expressions are 1580essentially programs written in a little computer language that specifies 1581patterns. 1582 1583=head2 Using regular expressions in Perl 1584 1585The last topic of Part 1 briefly covers how regexps are used in Perl 1586programs. Where do they fit into Perl syntax? 1587 1588We have already introduced the matching operator in its default 1589C</regexp/> and arbitrary delimiter C<m!regexp!> forms. We have used 1590the binding operator C<=~> and its negation C<!~> to test for string 1591matches. Associated with the matching operator, we have discussed the 1592single line C</s>, multi-line C</m>, case-insensitive C</i> and 1593extended C</x> modifiers. There are a few more things you might 1594want to know about matching operators. 1595 1596=head3 Prohibiting substitution 1597 1598If you change C<$pattern> after the first substitution happens, Perl 1599will ignore it. If you don't want any substitutions at all, use the 1600special delimiter C<m''>: 1601 1602 @pattern = ('Seuss'); 1603 while (<>) { 1604 print if m'@pattern'; # matches literal '@pattern', not 'Seuss' 1605 } 1606 1607Similar to strings, C<m''> acts like apostrophes on a regexp; all other 1608C<'m'> delimiters act like quotes. If the regexp evaluates to the empty string, 1609the regexp in the I<last successful match> is used instead. So we have 1610 1611 "dog" =~ /d/; # 'd' matches 1612 "dogbert" =~ //; # this matches the 'd' regexp used before 1613 1614 1615=head3 Global matching 1616 1617The final two modifiers we will discuss here, 1618C</g> and C</c>, concern multiple matches. 1619The modifier C</g> stands for global matching and allows the 1620matching operator to match within a string as many times as possible. 1621In scalar context, successive invocations against a string will have 1622C</g> jump from match to match, keeping track of position in the 1623string as it goes along. You can get or set the position with the 1624C<pos()> function. 1625 1626The use of C</g> is shown in the following example. Suppose we have 1627a string that consists of words separated by spaces. If we know how 1628many words there are in advance, we could extract the words using 1629groupings: 1630 1631 $x = "cat dog house"; # 3 words 1632 $x =~ /^\s*(\w+)\s+(\w+)\s+(\w+)\s*$/; # matches, 1633 # $1 = 'cat' 1634 # $2 = 'dog' 1635 # $3 = 'house' 1636 1637But what if we had an indeterminate number of words? This is the sort 1638of task C</g> was made for. To extract all words, form the simple 1639regexp C<(\w+)> and loop over all matches with C</(\w+)/g>: 1640 1641 while ($x =~ /(\w+)/g) { 1642 print "Word is $1, ends at position ", pos $x, "\n"; 1643 } 1644 1645prints 1646 1647 Word is cat, ends at position 3 1648 Word is dog, ends at position 7 1649 Word is house, ends at position 13 1650 1651A failed match or changing the target string resets the position. If 1652you don't want the position reset after failure to match, add the 1653C</c>, as in C</regexp/gc>. The current position in the string is 1654associated with the string, not the regexp. This means that different 1655strings have different positions and their respective positions can be 1656set or read independently. 1657 1658In list context, C</g> returns a list of matched groupings, or if 1659there are no groupings, a list of matches to the whole regexp. So if 1660we wanted just the words, we could use 1661 1662 @words = ($x =~ /(\w+)/g); # matches, 1663 # $words[0] = 'cat' 1664 # $words[1] = 'dog' 1665 # $words[2] = 'house' 1666 1667Closely associated with the C</g> modifier is the C<\G> anchor. The 1668C<\G> anchor matches at the point where the previous C</g> match left 1669off. C<\G> allows us to easily do context-sensitive matching: 1670 1671 $metric = 1; # use metric units 1672 ... 1673 $x = <FILE>; # read in measurement 1674 $x =~ /^([+-]?\d+)\s*/g; # get magnitude 1675 $weight = $1; 1676 if ($metric) { # error checking 1677 print "Units error!" unless $x =~ /\Gkg\./g; 1678 } 1679 else { 1680 print "Units error!" unless $x =~ /\Glbs\./g; 1681 } 1682 $x =~ /\G\s+(widget|sprocket)/g; # continue processing 1683 1684The combination of C</g> and C<\G> allows us to process the string a 1685bit at a time and use arbitrary Perl logic to decide what to do next. 1686Currently, the C<\G> anchor is only fully supported when used to anchor 1687to the start of the pattern. 1688 1689C<\G> is also invaluable in processing fixed-length records with 1690regexps. Suppose we have a snippet of coding region DNA, encoded as 1691base pair letters C<ATCGTTGAAT...> and we want to find all the stop 1692codons C<TGA>. In a coding region, codons are 3-letter sequences, so 1693we can think of the DNA snippet as a sequence of 3-letter records. The 1694naive regexp 1695 1696 # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC" 1697 $dna = "ATCGTTGAATGCAAATGACATGAC"; 1698 $dna =~ /TGA/; 1699 1700doesn't work; it may match a C<TGA>, but there is no guarantee that 1701the match is aligned with codon boundaries, I<e.g.>, the substring 1702S<C<GTT GAA>> gives a match. A better solution is 1703 1704 while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *? 1705 print "Got a TGA stop codon at position ", pos $dna, "\n"; 1706 } 1707 1708which prints 1709 1710 Got a TGA stop codon at position 18 1711 Got a TGA stop codon at position 23 1712 1713Position 18 is good, but position 23 is bogus. What happened? 1714 1715The answer is that our regexp works well until we get past the last 1716real match. Then the regexp will fail to match a synchronized C<TGA> 1717and start stepping ahead one character position at a time, not what we 1718want. The solution is to use C<\G> to anchor the match to the codon 1719alignment: 1720 1721 while ($dna =~ /\G(\w\w\w)*?TGA/g) { 1722 print "Got a TGA stop codon at position ", pos $dna, "\n"; 1723 } 1724 1725This prints 1726 1727 Got a TGA stop codon at position 18 1728 1729which is the correct answer. This example illustrates that it is 1730important not only to match what is desired, but to reject what is not 1731desired. 1732 1733(There are other regexp modifiers that are available, such as 1734C</o>, but their specialized uses are beyond the 1735scope of this introduction. ) 1736 1737=head3 Search and replace 1738 1739Regular expressions also play a big role in I<search and replace> 1740operations in Perl. Search and replace is accomplished with the 1741C<s///> operator. The general form is 1742C<s/regexp/replacement/modifiers>, with everything we know about 1743regexps and modifiers applying in this case as well. The 1744I<replacement> is a Perl double-quoted string that replaces in the 1745string whatever is matched with the C<regexp>. The operator C<=~> is 1746also used here to associate a string with C<s///>. If matching 1747against C<$_>, the S<C<$_ =~>> can be dropped. If there is a match, 1748C<s///> returns the number of substitutions made; otherwise it returns 1749false. Here are a few examples: 1750 1751 $x = "Time to feed the cat!"; 1752 $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" 1753 if ($x =~ s/^(Time.*hacker)!$/$1 now!/) { 1754 $more_insistent = 1; 1755 } 1756 $y = "'quoted words'"; 1757 $y =~ s/^'(.*)'$/$1/; # strip single quotes, 1758 # $y contains "quoted words" 1759 1760In the last example, the whole string was matched, but only the part 1761inside the single quotes was grouped. With the C<s///> operator, the 1762matched variables C<$1>, C<$2>, I<etc>. are immediately available for use 1763in the replacement expression, so we use C<$1> to replace the quoted 1764string with just what was quoted. With the global modifier, C<s///g> 1765will search and replace all occurrences of the regexp in the string: 1766 1767 $x = "I batted 4 for 4"; 1768 $x =~ s/4/four/; # doesn't do it all: 1769 # $x contains "I batted four for 4" 1770 $x = "I batted 4 for 4"; 1771 $x =~ s/4/four/g; # does it all: 1772 # $x contains "I batted four for four" 1773 1774If you prefer "regex" over "regexp" in this tutorial, you could use 1775the following program to replace it: 1776 1777 % cat > simple_replace 1778 #!/usr/bin/perl 1779 $regexp = shift; 1780 $replacement = shift; 1781 while (<>) { 1782 s/$regexp/$replacement/g; 1783 print; 1784 } 1785 ^D 1786 1787 % simple_replace regexp regex perlretut.pod 1788 1789In C<simple_replace> we used the C<s///g> modifier to replace all 1790occurrences of the regexp on each line. (Even though the regular 1791expression appears in a loop, Perl is smart enough to compile it 1792only once.) As with C<simple_grep>, both the 1793C<print> and the C<s/$regexp/$replacement/g> use C<$_> implicitly. 1794 1795If you don't want C<s///> to change your original variable you can use 1796the non-destructive substitute modifier, C<s///r>. This changes the 1797behavior so that C<s///r> returns the final substituted string 1798(instead of the number of substitutions): 1799 1800 $x = "I like dogs."; 1801 $y = $x =~ s/dogs/cats/r; 1802 print "$x $y\n"; 1803 1804That example will print "I like dogs. I like cats". Notice the original 1805C<$x> variable has not been affected. The overall 1806result of the substitution is instead stored in C<$y>. If the 1807substitution doesn't affect anything then the original string is 1808returned: 1809 1810 $x = "I like dogs."; 1811 $y = $x =~ s/elephants/cougars/r; 1812 print "$x $y\n"; # prints "I like dogs. I like dogs." 1813 1814One other interesting thing that the C<s///r> flag allows is chaining 1815substitutions: 1816 1817 $x = "Cats are great."; 1818 print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~ 1819 s/Frogs/Hedgehogs/r, "\n"; 1820 # prints "Hedgehogs are great." 1821 1822A modifier available specifically to search and replace is the 1823C<s///e> evaluation modifier. C<s///e> treats the 1824replacement text as Perl code, rather than a double-quoted 1825string. The value that the code returns is substituted for the 1826matched substring. C<s///e> is useful if you need to do a bit of 1827computation in the process of replacing text. This example counts 1828character frequencies in a line: 1829 1830 $x = "Bill the cat"; 1831 $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself 1832 print "frequency of '$_' is $chars{$_}\n" 1833 foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars); 1834 1835This prints 1836 1837 frequency of ' ' is 2 1838 frequency of 't' is 2 1839 frequency of 'l' is 2 1840 frequency of 'B' is 1 1841 frequency of 'c' is 1 1842 frequency of 'e' is 1 1843 frequency of 'h' is 1 1844 frequency of 'i' is 1 1845 frequency of 'a' is 1 1846 1847As with the match C<m//> operator, C<s///> can use other delimiters, 1848such as C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are 1849used C<s'''>, then the regexp and replacement are 1850treated as single-quoted strings and there are no 1851variable substitutions. C<s///> in list context 1852returns the same thing as in scalar context, I<i.e.>, the number of 1853matches. 1854 1855=head3 The split function 1856 1857The C<split()> function is another place where a regexp is used. 1858C<split /regexp/, string, limit> separates the C<string> operand into 1859a list of substrings and returns that list. The regexp must be designed 1860to match whatever constitutes the separators for the desired substrings. 1861The C<limit>, if present, constrains splitting into no more than C<limit> 1862number of strings. For example, to split a string into words, use 1863 1864 $x = "Calvin and Hobbes"; 1865 @words = split /\s+/, $x; # $word[0] = 'Calvin' 1866 # $word[1] = 'and' 1867 # $word[2] = 'Hobbes' 1868 1869If the empty regexp C<//> is used, the regexp always matches and 1870the string is split into individual characters. If the regexp has 1871groupings, then the resulting list contains the matched substrings from the 1872groupings as well. For instance, 1873 1874 $x = "/usr/bin/perl"; 1875 @dirs = split m!/!, $x; # $dirs[0] = '' 1876 # $dirs[1] = 'usr' 1877 # $dirs[2] = 'bin' 1878 # $dirs[3] = 'perl' 1879 @parts = split m!(/)!, $x; # $parts[0] = '' 1880 # $parts[1] = '/' 1881 # $parts[2] = 'usr' 1882 # $parts[3] = '/' 1883 # $parts[4] = 'bin' 1884 # $parts[5] = '/' 1885 # $parts[6] = 'perl' 1886 1887Since the first character of C<$x> matched the regexp, C<split> prepended 1888an empty initial element to the list. 1889 1890If you have read this far, congratulations! You now have all the basic 1891tools needed to use regular expressions to solve a wide range of text 1892processing problems. If this is your first time through the tutorial, 1893why not stop here and play around with regexps a while.... S<Part 2> 1894concerns the more esoteric aspects of regular expressions and those 1895concepts certainly aren't needed right at the start. 1896 1897=head1 Part 2: Power tools 1898 1899OK, you know the basics of regexps and you want to know more. If 1900matching regular expressions is analogous to a walk in the woods, then 1901the tools discussed in Part 1 are analogous to topo maps and a 1902compass, basic tools we use all the time. Most of the tools in part 2 1903are analogous to flare guns and satellite phones. They aren't used 1904too often on a hike, but when we are stuck, they can be invaluable. 1905 1906What follows are the more advanced, less used, or sometimes esoteric 1907capabilities of Perl regexps. In Part 2, we will assume you are 1908comfortable with the basics and concentrate on the advanced features. 1909 1910=head2 More on characters, strings, and character classes 1911 1912There are a number of escape sequences and character classes that we 1913haven't covered yet. 1914 1915There are several escape sequences that convert characters or strings 1916between upper and lower case, and they are also available within 1917patterns. C<\l> and C<\u> convert the next character to lower or 1918upper case, respectively: 1919 1920 $x = "perl"; 1921 $string =~ /\u$x/; # matches 'Perl' in $string 1922 $x = "M(rs?|s)\\."; # note the double backslash 1923 $string =~ /\l$x/; # matches 'mr.', 'mrs.', and 'ms.', 1924 1925A C<\L> or C<\U> indicates a lasting conversion of case, until 1926terminated by C<\E> or thrown over by another C<\U> or C<\L>: 1927 1928 $x = "This word is in lower case:\L SHOUT\E"; 1929 $x =~ /shout/; # matches 1930 $x = "I STILL KEYPUNCH CARDS FOR MY 360"; 1931 $x =~ /\Ukeypunch/; # matches punch card string 1932 1933If there is no C<\E>, case is converted until the end of the 1934string. The regexps C<\L\u$word> or C<\u\L$word> convert the first 1935character of C<$word> to uppercase and the rest of the characters to 1936lowercase. 1937 1938Control characters can be escaped with C<\c>, so that a control-Z 1939character would be matched with C<\cZ>. The escape sequence 1940C<\Q>...C<\E> quotes, or protects most non-alphabetic characters. For 1941instance, 1942 1943 $x = "\QThat !^*&%~& cat!"; 1944 $x =~ /\Q!^*&%~&\E/; # check for rough language 1945 1946It does not protect C<'$'> or C<'@'>, so that variables can still be 1947substituted. 1948 1949C<\Q>, C<\L>, C<\l>, C<\U>, C<\u> and C<\E> are actually part of 1950double-quotish syntax, and not part of regexp syntax proper. They will 1951work if they appear in a regular expression embedded directly in a 1952program, but not when contained in a string that is interpolated in a 1953pattern. 1954 1955Perl regexps can handle more than just the 1956standard ASCII character set. Perl supports I<Unicode>, a standard 1957for representing the alphabets from virtually all of the world's written 1958languages, and a host of symbols. Perl's text strings are Unicode strings, so 1959they can contain characters with a value (codepoint or character number) higher 1960than 255. 1961 1962What does this mean for regexps? Well, regexp users don't need to know 1963much about Perl's internal representation of strings. But they do need 1964to know 1) how to represent Unicode characters in a regexp and 2) that 1965a matching operation will treat the string to be searched as a sequence 1966of characters, not bytes. The answer to 1) is that Unicode characters 1967greater than C<chr(255)> are represented using the C<\x{hex}> notation, because 1968C<\x>I<XY> (without curly braces and I<XY> are two hex digits) doesn't 1969go further than 255. (Starting in Perl 5.14, if you're an octal fan, 1970you can also use C<\o{oct}>.) 1971 1972 /\x{263a}/; # match a Unicode smiley face :) 1973 1974B<NOTE>: In Perl 5.6.0 it used to be that one needed to say C<use 1975utf8> to use any Unicode features. This is no more the case: for 1976almost all Unicode processing, the explicit C<utf8> pragma is not 1977needed. (The only case where it matters is if your Perl script is in 1978Unicode and encoded in UTF-8, then an explicit C<use utf8> is needed.) 1979 1980Figuring out the hexadecimal sequence of a Unicode character you want 1981or deciphering someone else's hexadecimal Unicode regexp is about as 1982much fun as programming in machine code. So another way to specify 1983Unicode characters is to use the I<named character> escape 1984sequence C<\N{I<name>}>. I<name> is a name for the Unicode character, as 1985specified in the Unicode standard. For instance, if we wanted to 1986represent or match the astrological sign for the planet Mercury, we 1987could use 1988 1989 $x = "abc\N{MERCURY}def"; 1990 $x =~ /\N{MERCURY}/; # matches 1991 1992One can also use "short" names: 1993 1994 print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n"; 1995 print "\N{greek:Sigma} is an upper-case sigma.\n"; 1996 1997You can also restrict names to a certain alphabet by specifying the 1998L<charnames> pragma: 1999 2000 use charnames qw(greek); 2001 print "\N{sigma} is Greek sigma\n"; 2002 2003An index of character names is available on-line from the Unicode 2004Consortium, L<https://www.unicode.org/charts/charindex.html>; explanatory 2005material with links to other resources at 2006L<https://www.unicode.org/standard/where>. 2007 2008Starting in Perl v5.32, an alternative to C<\N{...}> for full names is 2009available, and that is to say 2010 2011 /\p{Name=greek small letter sigma}/ 2012 2013The casing of the character name is irrelevant when used in C<\p{}>, as 2014are most spaces, underscores and hyphens. (A few outlier characters 2015cause problems with ignoring all of them always. The details (which you 2016can look up when you get more proficient, and if ever needed) are in 2017L<https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>). 2018 2019The answer to requirement 2) is that a regexp (mostly) 2020uses Unicode characters. The "mostly" is for messy backward 2021compatibility reasons, but starting in Perl 5.14, any regexp compiled in 2022the scope of a C<use feature 'unicode_strings'> (which is automatically 2023turned on within the scope of a C<use 5.012> or higher) will turn that 2024"mostly" into "always". If you want to handle Unicode properly, you 2025should ensure that C<'unicode_strings'> is turned on. 2026Internally, this is encoded to bytes using either UTF-8 or a native 8 2027bit encoding, depending on the history of the string, but conceptually 2028it is a sequence of characters, not bytes. See L<perlunitut> for a 2029tutorial about that. 2030 2031Let us now discuss Unicode character classes, most usually called 2032"character properties". These are represented by the C<\p{I<name>}> 2033escape sequence. The negation of this is C<\P{I<name>}>. For example, 2034to match lower and uppercase characters, 2035 2036 $x = "BOB"; 2037 $x =~ /^\p{IsUpper}/; # matches, uppercase char class 2038 $x =~ /^\P{IsUpper}/; # doesn't match, char class sans uppercase 2039 $x =~ /^\p{IsLower}/; # doesn't match, lowercase char class 2040 $x =~ /^\P{IsLower}/; # matches, char class sans lowercase 2041 2042(The "C<Is>" is optional.) 2043 2044There are many, many Unicode character properties. For the full list 2045see L<perluniprops>. Most of them have synonyms with shorter names, 2046also listed there. Some synonyms are a single character. For these, 2047you can drop the braces. For instance, C<\pM> is the same thing as 2048C<\p{Mark}>, meaning things like accent marks. 2049 2050The Unicode C<\p{Script}> and C<\p{Script_Extensions}> properties are 2051used to categorize every Unicode character into the language script it 2052is written in. (C<Script_Extensions> is an improved version of 2053C<Script>, which is retained for backward compatibility, and so you 2054should generally use C<Script_Extensions>.) 2055For example, 2056English, French, and a bunch of other European languages are written in 2057the Latin script. But there is also the Greek script, the Thai script, 2058the Katakana script, I<etc>. You can test whether a character is in a 2059particular script (based on C<Script_Extensions>) with, for example 2060C<\p{Latin}>, C<\p{Greek}>, or C<\p{Katakana}>. To test if it isn't in 2061the Balinese script, you would use C<\P{Balinese}>. 2062 2063What we have described so far is the single form of the C<\p{...}> character 2064classes. There is also a compound form which you may run into. These 2065look like C<\p{I<name>=I<value>}> or C<\p{I<name>:I<value>}> (the equals sign and colon 2066can be used interchangeably). These are more general than the single form, 2067and in fact most of the single forms are just Perl-defined shortcuts for common 2068compound forms. For example, the script examples in the previous paragraph 2069could be written equivalently as C<\p{Script_Extensions=Latin}>, C<\p{Script_Extensions:Greek}>, 2070C<\p{script_extensions=katakana}>, and C<\P{script_extensions=balinese}> (case is irrelevant 2071between the C<{}> braces). You may 2072never have to use the compound forms, but sometimes it is necessary, and their 2073use can make your code easier to understand. 2074 2075C<\X> is an abbreviation for a character class that comprises 2076a Unicode I<extended grapheme cluster>. This represents a "logical character": 2077what appears to be a single character, but may be represented internally by more 2078than one. As an example, using the Unicode full names, I<e.g.>, "S<A + COMBINING 2079RING>" is a grapheme cluster with base character "A" and combining character 2080"S<COMBINING RING>, which translates in Danish to "A" with the circle atop it, 2081as in the word E<Aring>ngstrom. 2082 2083For the full and latest information about Unicode see the latest 2084Unicode standard, or the Unicode Consortium's website L<https://www.unicode.org> 2085 2086As if all those classes weren't enough, Perl also defines POSIX-style 2087character classes. These have the form C<[:I<name>:]>, with I<name> the 2088name of the POSIX class. The POSIX classes are C<alpha>, C<alnum>, 2089C<ascii>, C<cntrl>, C<digit>, C<graph>, C<lower>, C<print>, C<punct>, 2090C<space>, C<upper>, and C<xdigit>, and two extensions, C<word> (a Perl 2091extension to match C<\w>), and C<blank> (a GNU extension). The C</a> 2092modifier restricts these to matching just in the ASCII range; otherwise 2093they can match the same as their corresponding Perl Unicode classes: 2094C<[:upper:]> is the same as C<\p{IsUpper}>, I<etc>. (There are some 2095exceptions and gotchas with this; see L<perlrecharclass> for a full 2096discussion.) The C<[:digit:]>, C<[:word:]>, and 2097C<[:space:]> correspond to the familiar C<\d>, C<\w>, and C<\s> 2098character classes. To negate a POSIX class, put a C<'^'> in front of 2099the name, so that, I<e.g.>, C<[:^digit:]> corresponds to C<\D> and, under 2100Unicode, C<\P{IsDigit}>. The Unicode and POSIX character classes can 2101be used just like C<\d>, with the exception that POSIX character 2102classes can only be used inside of a character class: 2103 2104 /\s+[abc[:digit:]xyz]\s*/; # match a,b,c,x,y,z, or a digit 2105 /^=item\s[[:digit:]]/; # match '=item', 2106 # followed by a space and a digit 2107 /\s+[abc\p{IsDigit}xyz]\s+/; # match a,b,c,x,y,z, or a digit 2108 /^=item\s\p{IsDigit}/; # match '=item', 2109 # followed by a space and a digit 2110 2111Whew! That is all the rest of the characters and character classes. 2112 2113=head2 Compiling and saving regular expressions 2114 2115In Part 1 we mentioned that Perl compiles a regexp into a compact 2116sequence of opcodes. Thus, a compiled regexp is a data structure 2117that can be stored once and used again and again. The regexp quote 2118C<qr//> does exactly that: C<qr/string/> compiles the C<string> as a 2119regexp and transforms the result into a form that can be assigned to a 2120variable: 2121 2122 $reg = qr/foo+bar?/; # reg contains a compiled regexp 2123 2124Then C<$reg> can be used as a regexp: 2125 2126 $x = "fooooba"; 2127 $x =~ $reg; # matches, just like /foo+bar?/ 2128 $x =~ /$reg/; # same thing, alternate form 2129 2130C<$reg> can also be interpolated into a larger regexp: 2131 2132 $x =~ /(abc)?$reg/; # still matches 2133 2134As with the matching operator, the regexp quote can use different 2135delimiters, I<e.g.>, C<qr!!>, C<qr{}> or C<qr~~>. Apostrophes 2136as delimiters (C<qr''>) inhibit any interpolation. 2137 2138Pre-compiled regexps are useful for creating dynamic matches that 2139don't need to be recompiled each time they are encountered. Using 2140pre-compiled regexps, we write a C<grep_step> program which greps 2141for a sequence of patterns, advancing to the next pattern as soon 2142as one has been satisfied. 2143 2144 % cat > grep_step 2145 #!/usr/bin/perl 2146 # grep_step - match <number> regexps, one after the other 2147 # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ... 2148 2149 $number = shift; 2150 $regexp[$_] = shift foreach (0..$number-1); 2151 @compiled = map qr/$_/, @regexp; 2152 while ($line = <>) { 2153 if ($line =~ /$compiled[0]/) { 2154 print $line; 2155 shift @compiled; 2156 last unless @compiled; 2157 } 2158 } 2159 ^D 2160 2161 % grep_step 3 shift print last grep_step 2162 $number = shift; 2163 print $line; 2164 last unless @compiled; 2165 2166Storing pre-compiled regexps in an array C<@compiled> allows us to 2167simply loop through the regexps without any recompilation, thus gaining 2168flexibility without sacrificing speed. 2169 2170 2171=head2 Composing regular expressions at runtime 2172 2173Backtracking is more efficient than repeated tries with different regular 2174expressions. If there are several regular expressions and a match with 2175any of them is acceptable, then it is possible to combine them into a set 2176of alternatives. If the individual expressions are input data, this 2177can be done by programming a join operation. We'll exploit this idea in 2178an improved version of the C<simple_grep> program: a program that matches 2179multiple patterns: 2180 2181 % cat > multi_grep 2182 #!/usr/bin/perl 2183 # multi_grep - match any of <number> regexps 2184 # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ... 2185 2186 $number = shift; 2187 $regexp[$_] = shift foreach (0..$number-1); 2188 $pattern = join '|', @regexp; 2189 2190 while ($line = <>) { 2191 print $line if $line =~ /$pattern/; 2192 } 2193 ^D 2194 2195 % multi_grep 2 shift for multi_grep 2196 $number = shift; 2197 $regexp[$_] = shift foreach (0..$number-1); 2198 2199Sometimes it is advantageous to construct a pattern from the I<input> 2200that is to be analyzed and use the permissible values on the left 2201hand side of the matching operations. As an example for this somewhat 2202paradoxical situation, let's assume that our input contains a command 2203verb which should match one out of a set of available command verbs, 2204with the additional twist that commands may be abbreviated as long as 2205the given string is unique. The program below demonstrates the basic 2206algorithm. 2207 2208 % cat > keymatch 2209 #!/usr/bin/perl 2210 $kwds = 'copy compare list print'; 2211 while( $cmd = <> ){ 2212 $cmd =~ s/^\s+|\s+$//g; # trim leading and trailing spaces 2213 if( ( @matches = $kwds =~ /\b$cmd\w*/g ) == 1 ){ 2214 print "command: '@matches'\n"; 2215 } elsif( @matches == 0 ){ 2216 print "no such command: '$cmd'\n"; 2217 } else { 2218 print "not unique: '$cmd' (could be one of: @matches)\n"; 2219 } 2220 } 2221 ^D 2222 2223 % keymatch 2224 li 2225 command: 'list' 2226 co 2227 not unique: 'co' (could be one of: copy compare) 2228 printer 2229 no such command: 'printer' 2230 2231Rather than trying to match the input against the keywords, we match the 2232combined set of keywords against the input. The pattern matching 2233operation S<C<$kwds =~ /\b($cmd\w*)/g>> does several things at the 2234same time. It makes sure that the given command begins where a keyword 2235begins (C<\b>). It tolerates abbreviations due to the added C<\w*>. It 2236tells us the number of matches (C<scalar @matches>) and all the keywords 2237that were actually matched. You could hardly ask for more. 2238 2239=head2 Embedding comments and modifiers in a regular expression 2240 2241Starting with this section, we will be discussing Perl's set of 2242I<extended patterns>. These are extensions to the traditional regular 2243expression syntax that provide powerful new tools for pattern 2244matching. We have already seen extensions in the form of the minimal 2245matching constructs C<??>, C<*?>, C<+?>, C<{n,m}?>, and C<{n,}?>. Most 2246of the extensions below have the form C<(?char...)>, where the 2247C<char> is a character that determines the type of extension. 2248 2249The first extension is an embedded comment C<(?#text)>. This embeds a 2250comment into the regular expression without affecting its meaning. The 2251comment should not have any closing parentheses in the text. An 2252example is 2253 2254 /(?# Match an integer:)[+-]?\d+/; 2255 2256This style of commenting has been largely superseded by the raw, 2257freeform commenting that is allowed with the C</x> modifier. 2258 2259Most modifiers, such as C</i>, C</m>, C</s> and C</x> (or any 2260combination thereof) can also be embedded in 2261a regexp using C<(?i)>, C<(?m)>, C<(?s)>, and C<(?x)>. For instance, 2262 2263 /(?i)yes/; # match 'yes' case insensitively 2264 /yes/i; # same thing 2265 /(?x)( # freeform version of an integer regexp 2266 [+-]? # match an optional sign 2267 \d+ # match a sequence of digits 2268 ) 2269 /x; 2270 2271Embedded modifiers can have two important advantages over the usual 2272modifiers. Embedded modifiers allow a custom set of modifiers for 2273I<each> regexp pattern. This is great for matching an array of regexps 2274that must have different modifiers: 2275 2276 $pattern[0] = '(?i)doctor'; 2277 $pattern[1] = 'Johnson'; 2278 ... 2279 while (<>) { 2280 foreach $patt (@pattern) { 2281 print if /$patt/; 2282 } 2283 } 2284 2285The second advantage is that embedded modifiers (except C</p>, which 2286modifies the entire regexp) only affect the regexp 2287inside the group the embedded modifier is contained in. So grouping 2288can be used to localize the modifier's effects: 2289 2290 /Answer: ((?i)yes)/; # matches 'Answer: yes', 'Answer: YES', etc. 2291 2292Embedded modifiers can also turn off any modifiers already present 2293by using, I<e.g.>, C<(?-i)>. Modifiers can also be combined into 2294a single expression, I<e.g.>, C<(?s-i)> turns on single line mode and 2295turns off case insensitivity. 2296 2297Embedded modifiers may also be added to a non-capturing grouping. 2298C<(?i-m:regexp)> is a non-capturing grouping that matches C<regexp> 2299case insensitively and turns off multi-line mode. 2300 2301 2302=head2 Looking ahead and looking behind 2303 2304This section concerns the lookahead and lookbehind assertions. First, 2305a little background. 2306 2307In Perl regular expressions, most regexp elements "eat up" a certain 2308amount of string when they match. For instance, the regexp element 2309C<[abc]> eats up one character of the string when it matches, in the 2310sense that Perl moves to the next character position in the string 2311after the match. There are some elements, however, that don't eat up 2312characters (advance the character position) if they match. The examples 2313we have seen so far are the anchors. The anchor C<'^'> matches the 2314beginning of the line, but doesn't eat any characters. Similarly, the 2315word boundary anchor C<\b> matches wherever a character matching C<\w> 2316is next to a character that doesn't, but it doesn't eat up any 2317characters itself. Anchors are examples of I<zero-width assertions>: 2318zero-width, because they consume 2319no characters, and assertions, because they test some property of the 2320string. In the context of our walk in the woods analogy to regexp 2321matching, most regexp elements move us along a trail, but anchors have 2322us stop a moment and check our surroundings. If the local environment 2323checks out, we can proceed forward. But if the local environment 2324doesn't satisfy us, we must backtrack. 2325 2326Checking the environment entails either looking ahead on the trail, 2327looking behind, or both. C<'^'> looks behind, to see that there are no 2328characters before. C<'$'> looks ahead, to see that there are no 2329characters after. C<\b> looks both ahead and behind, to see if the 2330characters on either side differ in their "word-ness". 2331 2332The lookahead and lookbehind assertions are generalizations of the 2333anchor concept. Lookahead and lookbehind are zero-width assertions 2334that let us specify which characters we want to test for. The 2335lookahead assertion is denoted by C<(?=regexp)> or (starting in 5.32, 2336experimentally in 5.28) C<(*pla:regexp)> or 2337C<(*positive_lookahead:regexp)>; and the lookbehind assertion is denoted 2338by C<< (?<=fixed-regexp) >> or (starting in 5.32, experimentally in 23395.28) C<(*plb:fixed-regexp)> or C<(*positive_lookbehind:fixed-regexp)>. 2340Some examples are 2341 2342 $x = "I catch the housecat 'Tom-cat' with catnip"; 2343 $x =~ /cat(*pla:\s)/; # matches 'cat' in 'housecat' 2344 @catwords = ($x =~ /(?<=\s)cat\w+/g); # matches, 2345 # $catwords[0] = 'catch' 2346 # $catwords[1] = 'catnip' 2347 $x =~ /\bcat\b/; # matches 'cat' in 'Tom-cat' 2348 $x =~ /(?<=\s)cat(?=\s)/; # doesn't match; no isolated 'cat' in 2349 # middle of $x 2350 2351Note that the parentheses in these are 2352non-capturing, since these are zero-width assertions. Thus in the 2353second regexp, the substrings captured are those of the whole regexp 2354itself. Lookahead can match arbitrary regexps, but 2355lookbehind prior to 5.30 C<< (?<=fixed-regexp) >> only works for regexps 2356of fixed width, I<i.e.>, a fixed number of characters long. Thus 2357C<< (?<=(ab|bc)) >> is fine, but C<< (?<=(ab)*) >> prior to 5.30 is not. 2358 2359The negated versions of the lookahead and lookbehind assertions are 2360denoted by C<(?!regexp)> and C<< (?<!fixed-regexp) >> respectively. 2361Or, starting in 5.32 (experimentally in 5.28), C<(*nla:regexp)>, 2362C<(*negative_lookahead:regexp)>, C<(*nlb:regexp)>, or 2363C<(*negative_lookbehind:regexp)>. 2364They evaluate true if the regexps do I<not> match: 2365 2366 $x = "foobar"; 2367 $x =~ /foo(?!bar)/; # doesn't match, 'bar' follows 'foo' 2368 $x =~ /foo(?!baz)/; # matches, 'baz' doesn't follow 'foo' 2369 $x =~ /(?<!\s)foo/; # matches, there is no \s before 'foo' 2370 2371Here is an example where a string containing blank-separated words, 2372numbers and single dashes is to be split into its components. 2373Using C</\s+/> alone won't work, because spaces are not required between 2374dashes, or a word or a dash. Additional places for a split are established 2375by looking ahead and behind: 2376 2377 $str = "one two - --6-8"; 2378 @toks = split / \s+ # a run of spaces 2379 | (?<=\S) (?=-) # any non-space followed by '-' 2380 | (?<=-) (?=\S) # a '-' followed by any non-space 2381 /x, $str; # @toks = qw(one two - - - 6 - 8) 2382 2383=head2 Using independent subexpressions to prevent backtracking 2384 2385I<Independent subexpressions> (or atomic subexpressions) are regular 2386expressions, in the context of a larger regular expression, that 2387function independently of the larger regular expression. That is, they 2388consume as much or as little of the string as they wish without regard 2389for the ability of the larger regexp to match. Independent 2390subexpressions are represented by 2391C<< (?>regexp) >> or (starting in 5.32, experimentally in 5.28) 2392C<(*atomic:regexp)>. We can illustrate their behavior by first 2393considering an ordinary regexp: 2394 2395 $x = "ab"; 2396 $x =~ /a*ab/; # matches 2397 2398This obviously matches, but in the process of matching, the 2399subexpression C<a*> first grabbed the C<'a'>. Doing so, however, 2400wouldn't allow the whole regexp to match, so after backtracking, C<a*> 2401eventually gave back the C<'a'> and matched the empty string. Here, what 2402C<a*> matched was I<dependent> on what the rest of the regexp matched. 2403 2404Contrast that with an independent subexpression: 2405 2406 $x =~ /(?>a*)ab/; # doesn't match! 2407 2408The independent subexpression C<< (?>a*) >> doesn't care about the rest 2409of the regexp, so it sees an C<'a'> and grabs it. Then the rest of the 2410regexp C<ab> cannot match. Because C<< (?>a*) >> is independent, there 2411is no backtracking and the independent subexpression does not give 2412up its C<'a'>. Thus the match of the regexp as a whole fails. A similar 2413behavior occurs with completely independent regexps: 2414 2415 $x = "ab"; 2416 $x =~ /a*/g; # matches, eats an 'a' 2417 $x =~ /\Gab/g; # doesn't match, no 'a' available 2418 2419Here C</g> and C<\G> create a "tag team" handoff of the string from 2420one regexp to the other. Regexps with an independent subexpression are 2421much like this, with a handoff of the string to the independent 2422subexpression, and a handoff of the string back to the enclosing 2423regexp. 2424 2425The ability of an independent subexpression to prevent backtracking 2426can be quite useful. Suppose we want to match a non-empty string 2427enclosed in parentheses up to two levels deep. Then the following 2428regexp matches: 2429 2430 $x = "abc(de(fg)h"; # unbalanced parentheses 2431 $x =~ /\( ( [ ^ () ]+ | \( [ ^ () ]* \) )+ \)/xx; 2432 2433The regexp matches an open parenthesis, one or more copies of an 2434alternation, and a close parenthesis. The alternation is two-way, with 2435the first alternative C<[^()]+> matching a substring with no 2436parentheses and the second alternative C<\([^()]*\)> matching a 2437substring delimited by parentheses. The problem with this regexp is 2438that it is pathological: it has nested indeterminate quantifiers 2439of the form C<(a+|b)+>. We discussed in Part 1 how nested quantifiers 2440like this could take an exponentially long time to execute if there 2441was no match possible. To prevent the exponential blowup, we need to 2442prevent useless backtracking at some point. This can be done by 2443enclosing the inner quantifier as an independent subexpression: 2444 2445 $x =~ /\( ( (?> [ ^ () ]+ ) | \([ ^ () ]* \) )+ \)/xx; 2446 2447Here, C<< (?>[^()]+) >> breaks the degeneracy of string partitioning 2448by gobbling up as much of the string as possible and keeping it. Then 2449match failures fail much more quickly. 2450 2451 2452=head2 Conditional expressions 2453 2454A I<conditional expression> is a form of if-then-else statement 2455that allows one to choose which patterns are to be matched, based on 2456some condition. There are two types of conditional expression: 2457C<(?(I<condition>)I<yes-regexp>)> and 2458C<(?(condition)I<yes-regexp>|I<no-regexp>)>. 2459C<(?(I<condition>)I<yes-regexp>)> is 2460like an S<C<'if () {}'>> statement in Perl. If the I<condition> is true, 2461the I<yes-regexp> will be matched. If the I<condition> is false, the 2462I<yes-regexp> will be skipped and Perl will move onto the next regexp 2463element. The second form is like an S<C<'if () {} else {}'>> statement 2464in Perl. If the I<condition> is true, the I<yes-regexp> will be 2465matched, otherwise the I<no-regexp> will be matched. 2466 2467The I<condition> can have several forms. The first form is simply an 2468integer in parentheses C<(I<integer>)>. It is true if the corresponding 2469backreference C<\I<integer>> matched earlier in the regexp. The same 2470thing can be done with a name associated with a capture group, written 2471as C<<< (E<lt>I<name>E<gt>) >>> or C<< ('I<name>') >>. The second form is a bare 2472zero-width assertion C<(?...)>, either a lookahead, a lookbehind, or a 2473code assertion (discussed in the next section). The third set of forms 2474provides tests that return true if the expression is executed within 2475a recursion (C<(R)>) or is being called from some capturing group, 2476referenced either by number (C<(R1)>, C<(R2)>,...) or by name 2477(C<(R&I<name>)>). 2478 2479The integer or name form of the C<condition> allows us to choose, 2480with more flexibility, what to match based on what matched earlier in the 2481regexp. This searches for words of the form C<"$x$x"> or C<"$x$y$y$x">: 2482 2483 % simple_grep '^(\w+)(\w+)?(?(2)\g2\g1|\g1)$' /usr/dict/words 2484 beriberi 2485 coco 2486 couscous 2487 deed 2488 ... 2489 toot 2490 toto 2491 tutu 2492 2493The lookbehind C<condition> allows, along with backreferences, 2494an earlier part of the match to influence a later part of the 2495match. For instance, 2496 2497 /[ATGC]+(?(?<=AA)G|C)$/; 2498 2499matches a DNA sequence such that it either ends in C<AAG>, or some 2500other base pair combination and C<'C'>. Note that the form is 2501C<< (?(?<=AA)G|C) >> and not C<< (?((?<=AA))G|C) >>; for the 2502lookahead, lookbehind or code assertions, the parentheses around the 2503conditional are not needed. 2504 2505 2506=head2 Defining named patterns 2507 2508Some regular expressions use identical subpatterns in several places. 2509Starting with Perl 5.10, it is possible to define named subpatterns in 2510a section of the pattern so that they can be called up by name 2511anywhere in the pattern. This syntactic pattern for this definition 2512group is C<< (?(DEFINE)(?<I<name>>I<pattern>)...) >>. An insertion 2513of a named pattern is written as C<(?&I<name>)>. 2514 2515The example below illustrates this feature using the pattern for 2516floating point numbers that was presented earlier on. The three 2517subpatterns that are used more than once are the optional sign, the 2518digit sequence for an integer and the decimal fraction. The C<DEFINE> 2519group at the end of the pattern contains their definition. Notice 2520that the decimal fraction pattern is the first place where we can 2521reuse the integer pattern. 2522 2523 /^ (?&osg)\ * ( (?&int)(?&dec)? | (?&dec) ) 2524 (?: [eE](?&osg)(?&int) )? 2525 $ 2526 (?(DEFINE) 2527 (?<osg>[-+]?) # optional sign 2528 (?<int>\d++) # integer 2529 (?<dec>\.(?&int)) # decimal fraction 2530 )/x 2531 2532 2533=head2 Recursive patterns 2534 2535This feature (introduced in Perl 5.10) significantly extends the 2536power of Perl's pattern matching. By referring to some other 2537capture group anywhere in the pattern with the construct 2538C<(?I<group-ref>)>, the I<pattern> within the referenced group is used 2539as an independent subpattern in place of the group reference itself. 2540Because the group reference may be contained I<within> the group it 2541refers to, it is now possible to apply pattern matching to tasks that 2542hitherto required a recursive parser. 2543 2544To illustrate this feature, we'll design a pattern that matches if 2545a string contains a palindrome. (This is a word or a sentence that, 2546while ignoring spaces, interpunctuation and case, reads the same backwards 2547as forwards. We begin by observing that the empty string or a string 2548containing just one word character is a palindrome. Otherwise it must 2549have a word character up front and the same at its end, with another 2550palindrome in between. 2551 2552 /(?: (\w) (?...Here be a palindrome...) \g{-1} | \w? )/x 2553 2554Adding C<\W*> at either end to eliminate what is to be ignored, we already 2555have the full pattern: 2556 2557 my $pp = qr/^(\W* (?: (\w) (?1) \g{-1} | \w? ) \W*)$/ix; 2558 for $s ( "saippuakauppias", "A man, a plan, a canal: Panama!" ){ 2559 print "'$s' is a palindrome\n" if $s =~ /$pp/; 2560 } 2561 2562In C<(?...)> both absolute and relative backreferences may be used. 2563The entire pattern can be reinserted with C<(?R)> or C<(?0)>. 2564If you prefer to name your groups, you can use C<(?&I<name>)> to 2565recurse into that group. 2566 2567 2568=head2 A bit of magic: executing Perl code in a regular expression 2569 2570Normally, regexps are a part of Perl expressions. 2571I<Code evaluation> expressions turn that around by allowing 2572arbitrary Perl code to be a part of a regexp. A code evaluation 2573expression is denoted C<(?{I<code>})>, with I<code> a string of Perl 2574statements. 2575 2576Code expressions are zero-width assertions, and the value they return 2577depends on their environment. There are two possibilities: either the 2578code expression is used as a conditional in a conditional expression 2579C<(?(I<condition>)...)>, or it is not. If the code expression is a 2580conditional, the code is evaluated and the result (I<i.e.>, the result of 2581the last statement) is used to determine truth or falsehood. If the 2582code expression is not used as a conditional, the assertion always 2583evaluates true and the result is put into the special variable 2584C<$^R>. The variable C<$^R> can then be used in code expressions later 2585in the regexp. Here are some silly examples: 2586 2587 $x = "abcdef"; 2588 $x =~ /abc(?{print "Hi Mom!";})def/; # matches, 2589 # prints 'Hi Mom!' 2590 $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match, 2591 # no 'Hi Mom!' 2592 2593Pay careful attention to the next example: 2594 2595 $x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match, 2596 # no 'Hi Mom!' 2597 # but why not? 2598 2599At first glance, you'd think that it shouldn't print, because obviously 2600the C<ddd> isn't going to match the target string. But look at this 2601example: 2602 2603 $x =~ /abc(?{print "Hi Mom!";})[dD]dd/; # doesn't match, 2604 # but _does_ print 2605 2606Hmm. What happened here? If you've been following along, you know that 2607the above pattern should be effectively (almost) the same as the last one; 2608enclosing the C<'d'> in a character class isn't going to change what it 2609matches. So why does the first not print while the second one does? 2610 2611The answer lies in the optimizations the regexp engine makes. In the first 2612case, all the engine sees are plain old characters (aside from the 2613C<?{}> construct). It's smart enough to realize that the string C<'ddd'> 2614doesn't occur in our target string before actually running the pattern 2615through. But in the second case, we've tricked it into thinking that our 2616pattern is more complicated. It takes a look, sees our 2617character class, and decides that it will have to actually run the 2618pattern to determine whether or not it matches, and in the process of 2619running it hits the print statement before it discovers that we don't 2620have a match. 2621 2622To take a closer look at how the engine does optimizations, see the 2623section L</"Pragmas and debugging"> below. 2624 2625More fun with C<?{}>: 2626 2627 $x =~ /(?{print "Hi Mom!";})/; # matches, 2628 # prints 'Hi Mom!' 2629 $x =~ /(?{$c = 1;})(?{print "$c";})/; # matches, 2630 # prints '1' 2631 $x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches, 2632 # prints '1' 2633 2634The bit of magic mentioned in the section title occurs when the regexp 2635backtracks in the process of searching for a match. If the regexp 2636backtracks over a code expression and if the variables used within are 2637localized using C<local>, the changes in the variables produced by the 2638code expression are undone! Thus, if we wanted to count how many times 2639a character got matched inside a group, we could use, I<e.g.>, 2640 2641 $x = "aaaa"; 2642 $count = 0; # initialize 'a' count 2643 $c = "bob"; # test if $c gets clobbered 2644 $x =~ /(?{local $c = 0;}) # initialize count 2645 ( a # match 'a' 2646 (?{local $c = $c + 1;}) # increment count 2647 )* # do this any number of times, 2648 aa # but match 'aa' at the end 2649 (?{$count = $c;}) # copy local $c var into $count 2650 /x; 2651 print "'a' count is $count, \$c variable is '$c'\n"; 2652 2653This prints 2654 2655 'a' count is 2, $c variable is 'bob' 2656 2657If we replace the S<C< (?{local $c = $c + 1;})>> with 2658S<C< (?{$c = $c + 1;})>>, the variable changes are I<not> undone 2659during backtracking, and we get 2660 2661 'a' count is 4, $c variable is 'bob' 2662 2663Note that only localized variable changes are undone. Other side 2664effects of code expression execution are permanent. Thus 2665 2666 $x = "aaaa"; 2667 $x =~ /(a(?{print "Yow\n";}))*aa/; 2668 2669produces 2670 2671 Yow 2672 Yow 2673 Yow 2674 Yow 2675 2676The result C<$^R> is automatically localized, so that it will behave 2677properly in the presence of backtracking. 2678 2679This example uses a code expression in a conditional to match a 2680definite article, either C<'the'> in English or C<'der|die|das'> in 2681German: 2682 2683 $lang = 'DE'; # use German 2684 ... 2685 $text = "das"; 2686 print "matched\n" 2687 if $text =~ /(?(?{ 2688 $lang eq 'EN'; # is the language English? 2689 }) 2690 the | # if so, then match 'the' 2691 (der|die|das) # else, match 'der|die|das' 2692 ) 2693 /xi; 2694 2695Note that the syntax here is C<(?(?{...})I<yes-regexp>|I<no-regexp>)>, not 2696C<(?((?{...}))I<yes-regexp>|I<no-regexp>)>. In other words, in the case of a 2697code expression, we don't need the extra parentheses around the 2698conditional. 2699 2700If you try to use code expressions where the code text is contained within 2701an interpolated variable, rather than appearing literally in the pattern, 2702Perl may surprise you: 2703 2704 $bar = 5; 2705 $pat = '(?{ 1 })'; 2706 /foo(?{ $bar })bar/; # compiles ok, $bar not interpolated 2707 /foo(?{ 1 })$bar/; # compiles ok, $bar interpolated 2708 /foo${pat}bar/; # compile error! 2709 2710 $pat = qr/(?{ $foo = 1 })/; # precompile code regexp 2711 /foo${pat}bar/; # compiles ok 2712 2713If a regexp has a variable that interpolates a code expression, Perl 2714treats the regexp as an error. If the code expression is precompiled into 2715a variable, however, interpolating is ok. The question is, why is this an 2716error? 2717 2718The reason is that variable interpolation and code expressions 2719together pose a security risk. The combination is dangerous because 2720many programmers who write search engines often take user input and 2721plug it directly into a regexp: 2722 2723 $regexp = <>; # read user-supplied regexp 2724 $chomp $regexp; # get rid of possible newline 2725 $text =~ /$regexp/; # search $text for the $regexp 2726 2727If the C<$regexp> variable contains a code expression, the user could 2728then execute arbitrary Perl code. For instance, some joker could 2729search for S<C<system('rm -rf *');>> to erase your files. In this 2730sense, the combination of interpolation and code expressions I<taints> 2731your regexp. So by default, using both interpolation and code 2732expressions in the same regexp is not allowed. If you're not 2733concerned about malicious users, it is possible to bypass this 2734security check by invoking S<C<use re 'eval'>>: 2735 2736 use re 'eval'; # throw caution out the door 2737 $bar = 5; 2738 $pat = '(?{ 1 })'; 2739 /foo${pat}bar/; # compiles ok 2740 2741Another form of code expression is the I<pattern code expression>. 2742The pattern code expression is like a regular code expression, except 2743that the result of the code evaluation is treated as a regular 2744expression and matched immediately. A simple example is 2745 2746 $length = 5; 2747 $char = 'a'; 2748 $x = 'aaaaabb'; 2749 $x =~ /(??{$char x $length})/x; # matches, there are 5 of 'a' 2750 2751 2752This final example contains both ordinary and pattern code 2753expressions. It detects whether a binary string C<1101010010001...> has a 2754Fibonacci spacing 0,1,1,2,3,5,... of the C<'1'>'s: 2755 2756 $x = "1101010010001000001"; 2757 $z0 = ''; $z1 = '0'; # initial conditions 2758 print "It is a Fibonacci sequence\n" 2759 if $x =~ /^1 # match an initial '1' 2760 (?: 2761 ((??{ $z0 })) # match some '0' 2762 1 # and then a '1' 2763 (?{ $z0 = $z1; $z1 .= $^N; }) 2764 )+ # repeat as needed 2765 $ # that is all there is 2766 /x; 2767 printf "Largest sequence matched was %d\n", length($z1)-length($z0); 2768 2769Remember that C<$^N> is set to whatever was matched by the last 2770completed capture group. This prints 2771 2772 It is a Fibonacci sequence 2773 Largest sequence matched was 5 2774 2775Ha! Try that with your garden variety regexp package... 2776 2777Note that the variables C<$z0> and C<$z1> are not substituted when the 2778regexp is compiled, as happens for ordinary variables outside a code 2779expression. Rather, the whole code block is parsed as perl code at the 2780same time as perl is compiling the code containing the literal regexp 2781pattern. 2782 2783This regexp without the C</x> modifier is 2784 2785 /^1(?:((??{ $z0 }))1(?{ $z0 = $z1; $z1 .= $^N; }))+$/ 2786 2787which shows that spaces are still possible in the code parts. Nevertheless, 2788when working with code and conditional expressions, the extended form of 2789regexps is almost necessary in creating and debugging regexps. 2790 2791 2792=head2 Backtracking control verbs 2793 2794Perl 5.10 introduced a number of control verbs intended to provide 2795detailed control over the backtracking process, by directly influencing 2796the regexp engine and by providing monitoring techniques. See 2797L<perlre/"Special Backtracking Control Verbs"> for a detailed 2798description. 2799 2800Below is just one example, illustrating the control verb C<(*FAIL)>, 2801which may be abbreviated as C<(*F)>. If this is inserted in a regexp 2802it will cause it to fail, just as it would at some 2803mismatch between the pattern and the string. Processing 2804of the regexp continues as it would after any "normal" 2805failure, so that, for instance, the next position in the string or another 2806alternative will be tried. As failing to match doesn't preserve capture 2807groups or produce results, it may be necessary to use this in 2808combination with embedded code. 2809 2810 %count = (); 2811 "supercalifragilisticexpialidocious" =~ 2812 /([aeiou])(?{ $count{$1}++; })(*FAIL)/i; 2813 printf "%3d '%s'\n", $count{$_}, $_ for (sort keys %count); 2814 2815The pattern begins with a class matching a subset of letters. Whenever 2816this matches, a statement like C<$count{'a'}++;> is executed, incrementing 2817the letter's counter. Then C<(*FAIL)> does what it says, and 2818the regexp engine proceeds according to the book: as long as the end of 2819the string hasn't been reached, the position is advanced before looking 2820for another vowel. Thus, match or no match makes no difference, and the 2821regexp engine proceeds until the entire string has been inspected. 2822(It's remarkable that an alternative solution using something like 2823 2824 $count{lc($_)}++ for split('', "supercalifragilisticexpialidocious"); 2825 printf "%3d '%s'\n", $count2{$_}, $_ for ( qw{ a e i o u } ); 2826 2827is considerably slower.) 2828 2829 2830=head2 Pragmas and debugging 2831 2832Speaking of debugging, there are several pragmas available to control 2833and debug regexps in Perl. We have already encountered one pragma in 2834the previous section, S<C<use re 'eval';>>, that allows variable 2835interpolation and code expressions to coexist in a regexp. The other 2836pragmas are 2837 2838 use re 'taint'; 2839 $tainted = <>; 2840 @parts = ($tainted =~ /(\w+)\s+(\w+)/; # @parts is now tainted 2841 2842The C<taint> pragma causes any substrings from a match with a tainted 2843variable to be tainted as well. This is not normally the case, as 2844regexps are often used to extract the safe bits from a tainted 2845variable. Use C<taint> when you are not extracting safe bits, but are 2846performing some other processing. Both C<taint> and C<eval> pragmas 2847are lexically scoped, which means they are in effect only until 2848the end of the block enclosing the pragmas. 2849 2850 use re '/m'; # or any other flags 2851 $multiline_string =~ /^foo/; # /m is implied 2852 2853The C<re '/flags'> pragma (introduced in Perl 28545.14) turns on the given regular expression flags 2855until the end of the lexical scope. See 2856L<re/"'E<sol>flags' mode"> for more 2857detail. 2858 2859 use re 'debug'; 2860 /^(.*)$/s; # output debugging info 2861 2862 use re 'debugcolor'; 2863 /^(.*)$/s; # output debugging info in living color 2864 2865The global C<debug> and C<debugcolor> pragmas allow one to get 2866detailed debugging info about regexp compilation and 2867execution. C<debugcolor> is the same as debug, except the debugging 2868information is displayed in color on terminals that can display 2869termcap color sequences. Here is example output: 2870 2871 % perl -e 'use re "debug"; "abc" =~ /a*b+c/;' 2872 Compiling REx 'a*b+c' 2873 size 9 first at 1 2874 1: STAR(4) 2875 2: EXACT <a>(0) 2876 4: PLUS(7) 2877 5: EXACT <b>(0) 2878 7: EXACT <c>(9) 2879 9: END(0) 2880 floating 'bc' at 0..2147483647 (checking floating) minlen 2 2881 Guessing start of match, REx 'a*b+c' against 'abc'... 2882 Found floating substr 'bc' at offset 1... 2883 Guessed: match at offset 0 2884 Matching REx 'a*b+c' against 'abc' 2885 Setting an EVAL scope, savestack=3 2886 0 <> <abc> | 1: STAR 2887 EXACT <a> can match 1 times out of 32767... 2888 Setting an EVAL scope, savestack=3 2889 1 <a> <bc> | 4: PLUS 2890 EXACT <b> can match 1 times out of 32767... 2891 Setting an EVAL scope, savestack=3 2892 2 <ab> <c> | 7: EXACT <c> 2893 3 <abc> <> | 9: END 2894 Match successful! 2895 Freeing REx: 'a*b+c' 2896 2897If you have gotten this far into the tutorial, you can probably guess 2898what the different parts of the debugging output tell you. The first 2899part 2900 2901 Compiling REx 'a*b+c' 2902 size 9 first at 1 2903 1: STAR(4) 2904 2: EXACT <a>(0) 2905 4: PLUS(7) 2906 5: EXACT <b>(0) 2907 7: EXACT <c>(9) 2908 9: END(0) 2909 2910describes the compilation stage. C<STAR(4)> means that there is a 2911starred object, in this case C<'a'>, and if it matches, goto line 4, 2912I<i.e.>, C<PLUS(7)>. The middle lines describe some heuristics and 2913optimizations performed before a match: 2914 2915 floating 'bc' at 0..2147483647 (checking floating) minlen 2 2916 Guessing start of match, REx 'a*b+c' against 'abc'... 2917 Found floating substr 'bc' at offset 1... 2918 Guessed: match at offset 0 2919 2920Then the match is executed and the remaining lines describe the 2921process: 2922 2923 Matching REx 'a*b+c' against 'abc' 2924 Setting an EVAL scope, savestack=3 2925 0 <> <abc> | 1: STAR 2926 EXACT <a> can match 1 times out of 32767... 2927 Setting an EVAL scope, savestack=3 2928 1 <a> <bc> | 4: PLUS 2929 EXACT <b> can match 1 times out of 32767... 2930 Setting an EVAL scope, savestack=3 2931 2 <ab> <c> | 7: EXACT <c> 2932 3 <abc> <> | 9: END 2933 Match successful! 2934 Freeing REx: 'a*b+c' 2935 2936Each step is of the form S<C<< n <x> <y> >>>, with C<< <x> >> the 2937part of the string matched and C<< <y> >> the part not yet 2938matched. The S<C<< | 1: STAR >>> says that Perl is at line number 1 2939in the compilation list above. See 2940L<perldebguts/"Debugging Regular Expressions"> for much more detail. 2941 2942An alternative method of debugging regexps is to embed C<print> 2943statements within the regexp. This provides a blow-by-blow account of 2944the backtracking in an alternation: 2945 2946 "that this" =~ m@(?{print "Start at position ", pos, "\n";}) 2947 t(?{print "t1\n";}) 2948 h(?{print "h1\n";}) 2949 i(?{print "i1\n";}) 2950 s(?{print "s1\n";}) 2951 | 2952 t(?{print "t2\n";}) 2953 h(?{print "h2\n";}) 2954 a(?{print "a2\n";}) 2955 t(?{print "t2\n";}) 2956 (?{print "Done at position ", pos, "\n";}) 2957 @x; 2958 2959prints 2960 2961 Start at position 0 2962 t1 2963 h1 2964 t2 2965 h2 2966 a2 2967 t2 2968 Done at position 4 2969 2970=head1 SEE ALSO 2971 2972This is just a tutorial. For the full story on Perl regular 2973expressions, see the L<perlre> regular expressions reference page. 2974 2975For more information on the matching C<m//> and substitution C<s///> 2976operators, see L<perlop/"Regexp Quote-Like Operators">. For 2977information on the C<split> operation, see L<perlfunc/split>. 2978 2979For an excellent all-around resource on the care and feeding of 2980regular expressions, see the book I<Mastering Regular Expressions> by 2981Jeffrey Friedl (published by O'Reilly, ISBN 1556592-257-3). 2982 2983=head1 AUTHOR AND COPYRIGHT 2984 2985Copyright (c) 2000 Mark Kvale. 2986All rights reserved. 2987Now maintained by Perl porters. 2988 2989This document may be distributed under the same terms as Perl itself. 2990 2991=head2 Acknowledgments 2992 2993The inspiration for the stop codon DNA example came from the ZIP 2994code example in chapter 7 of I<Mastering Regular Expressions>. 2995 2996The author would like to thank Jeff Pinyan, Andrew Johnson, Peter 2997Haworth, Ronald J Kimball, and Joe Smith for all their helpful 2998comments. 2999 3000=cut 3001 3002