1*0Sstevel@tonic-gate=head1 NAME 2*0Sstevel@tonic-gate 3*0Sstevel@tonic-gateperlretut - Perl regular expressions tutorial 4*0Sstevel@tonic-gate 5*0Sstevel@tonic-gate=head1 DESCRIPTION 6*0Sstevel@tonic-gate 7*0Sstevel@tonic-gateThis page provides a basic tutorial on understanding, creating and 8*0Sstevel@tonic-gateusing regular expressions in Perl. It serves as a complement to the 9*0Sstevel@tonic-gatereference page on regular expressions L<perlre>. Regular expressions 10*0Sstevel@tonic-gateare an integral part of the C<m//>, C<s///>, C<qr//> and C<split> 11*0Sstevel@tonic-gateoperators and so this tutorial also overlaps with 12*0Sstevel@tonic-gateL<perlop/"Regexp Quote-Like Operators"> and L<perlfunc/split>. 13*0Sstevel@tonic-gate 14*0Sstevel@tonic-gatePerl is widely renowned for excellence in text processing, and regular 15*0Sstevel@tonic-gateexpressions are one of the big factors behind this fame. Perl regular 16*0Sstevel@tonic-gateexpressions display an efficiency and flexibility unknown in most 17*0Sstevel@tonic-gateother computer languages. Mastering even the basics of regular 18*0Sstevel@tonic-gateexpressions will allow you to manipulate text with surprising ease. 19*0Sstevel@tonic-gate 20*0Sstevel@tonic-gateWhat is a regular expression? A regular expression is simply a string 21*0Sstevel@tonic-gatethat describes a pattern. Patterns are in common use these days; 22*0Sstevel@tonic-gateexamples are the patterns typed into a search engine to find web pages 23*0Sstevel@tonic-gateand the patterns used to list files in a directory, e.g., C<ls *.txt> 24*0Sstevel@tonic-gateor C<dir *.*>. In Perl, the patterns described by regular expressions 25*0Sstevel@tonic-gateare used to search strings, extract desired parts of strings, and to 26*0Sstevel@tonic-gatedo search and replace operations. 27*0Sstevel@tonic-gate 28*0Sstevel@tonic-gateRegular expressions have the undeserved reputation of being abstract 29*0Sstevel@tonic-gateand difficult to understand. Regular expressions are constructed using 30*0Sstevel@tonic-gatesimple concepts like conditionals and loops and are no more difficult 31*0Sstevel@tonic-gateto understand than the corresponding C<if> conditionals and C<while> 32*0Sstevel@tonic-gateloops in the Perl language itself. In fact, the main challenge in 33*0Sstevel@tonic-gatelearning regular expressions is just getting used to the terse 34*0Sstevel@tonic-gatenotation used to express these concepts. 35*0Sstevel@tonic-gate 36*0Sstevel@tonic-gateThis tutorial flattens the learning curve by discussing regular 37*0Sstevel@tonic-gateexpression concepts, along with their notation, one at a time and with 38*0Sstevel@tonic-gatemany examples. The first part of the tutorial will progress from the 39*0Sstevel@tonic-gatesimplest word searches to the basic regular expression concepts. If 40*0Sstevel@tonic-gateyou master the first part, you will have all the tools needed to solve 41*0Sstevel@tonic-gateabout 98% of your needs. The second part of the tutorial is for those 42*0Sstevel@tonic-gatecomfortable with the basics and hungry for more power tools. It 43*0Sstevel@tonic-gatediscusses the more advanced regular expression operators and 44*0Sstevel@tonic-gateintroduces the latest cutting edge innovations in 5.6.0. 45*0Sstevel@tonic-gate 46*0Sstevel@tonic-gateA note: to save time, 'regular expression' is often abbreviated as 47*0Sstevel@tonic-gateregexp or regex. Regexp is a more natural abbreviation than regex, but 48*0Sstevel@tonic-gateis harder to pronounce. The Perl pod documentation is evenly split on 49*0Sstevel@tonic-gateregexp vs regex; in Perl, there is more than one way to abbreviate it. 50*0Sstevel@tonic-gateWe'll use regexp in this tutorial. 51*0Sstevel@tonic-gate 52*0Sstevel@tonic-gate=head1 Part 1: The basics 53*0Sstevel@tonic-gate 54*0Sstevel@tonic-gate=head2 Simple word matching 55*0Sstevel@tonic-gate 56*0Sstevel@tonic-gateThe simplest regexp is simply a word, or more generally, a string of 57*0Sstevel@tonic-gatecharacters. A regexp consisting of a word matches any string that 58*0Sstevel@tonic-gatecontains that word: 59*0Sstevel@tonic-gate 60*0Sstevel@tonic-gate "Hello World" =~ /World/; # matches 61*0Sstevel@tonic-gate 62*0Sstevel@tonic-gateWhat is this perl statement all about? C<"Hello World"> is a simple 63*0Sstevel@tonic-gatedouble quoted string. C<World> is the regular expression and the 64*0Sstevel@tonic-gateC<//> enclosing C</World/> tells perl to search a string for a match. 65*0Sstevel@tonic-gateThe operator C<=~> associates the string with the regexp match and 66*0Sstevel@tonic-gateproduces a true value if the regexp matched, or false if the regexp 67*0Sstevel@tonic-gatedid not match. In our case, C<World> matches the second word in 68*0Sstevel@tonic-gateC<"Hello World">, so the expression is true. Expressions like this 69*0Sstevel@tonic-gateare useful in conditionals: 70*0Sstevel@tonic-gate 71*0Sstevel@tonic-gate if ("Hello World" =~ /World/) { 72*0Sstevel@tonic-gate print "It matches\n"; 73*0Sstevel@tonic-gate } 74*0Sstevel@tonic-gate else { 75*0Sstevel@tonic-gate print "It doesn't match\n"; 76*0Sstevel@tonic-gate } 77*0Sstevel@tonic-gate 78*0Sstevel@tonic-gateThere are useful variations on this theme. The sense of the match can 79*0Sstevel@tonic-gatebe reversed by using C<!~> operator: 80*0Sstevel@tonic-gate 81*0Sstevel@tonic-gate if ("Hello World" !~ /World/) { 82*0Sstevel@tonic-gate print "It doesn't match\n"; 83*0Sstevel@tonic-gate } 84*0Sstevel@tonic-gate else { 85*0Sstevel@tonic-gate print "It matches\n"; 86*0Sstevel@tonic-gate } 87*0Sstevel@tonic-gate 88*0Sstevel@tonic-gateThe literal string in the regexp can be replaced by a variable: 89*0Sstevel@tonic-gate 90*0Sstevel@tonic-gate $greeting = "World"; 91*0Sstevel@tonic-gate if ("Hello World" =~ /$greeting/) { 92*0Sstevel@tonic-gate print "It matches\n"; 93*0Sstevel@tonic-gate } 94*0Sstevel@tonic-gate else { 95*0Sstevel@tonic-gate print "It doesn't match\n"; 96*0Sstevel@tonic-gate } 97*0Sstevel@tonic-gate 98*0Sstevel@tonic-gateIf you're matching against the special default variable C<$_>, the 99*0Sstevel@tonic-gateC<$_ =~> part can be omitted: 100*0Sstevel@tonic-gate 101*0Sstevel@tonic-gate $_ = "Hello World"; 102*0Sstevel@tonic-gate if (/World/) { 103*0Sstevel@tonic-gate print "It matches\n"; 104*0Sstevel@tonic-gate } 105*0Sstevel@tonic-gate else { 106*0Sstevel@tonic-gate print "It doesn't match\n"; 107*0Sstevel@tonic-gate } 108*0Sstevel@tonic-gate 109*0Sstevel@tonic-gateAnd finally, the C<//> default delimiters for a match can be changed 110*0Sstevel@tonic-gateto arbitrary delimiters by putting an C<'m'> out front: 111*0Sstevel@tonic-gate 112*0Sstevel@tonic-gate "Hello World" =~ m!World!; # matches, delimited by '!' 113*0Sstevel@tonic-gate "Hello World" =~ m{World}; # matches, note the matching '{}' 114*0Sstevel@tonic-gate "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', 115*0Sstevel@tonic-gate # '/' becomes an ordinary char 116*0Sstevel@tonic-gate 117*0Sstevel@tonic-gateC</World/>, C<m!World!>, and C<m{World}> all represent the 118*0Sstevel@tonic-gatesame thing. When, e.g., C<""> is used as a delimiter, the forward 119*0Sstevel@tonic-gateslash C<'/'> becomes an ordinary character and can be used in a regexp 120*0Sstevel@tonic-gatewithout trouble. 121*0Sstevel@tonic-gate 122*0Sstevel@tonic-gateLet's consider how different regexps would match C<"Hello World">: 123*0Sstevel@tonic-gate 124*0Sstevel@tonic-gate "Hello World" =~ /world/; # doesn't match 125*0Sstevel@tonic-gate "Hello World" =~ /o W/; # matches 126*0Sstevel@tonic-gate "Hello World" =~ /oW/; # doesn't match 127*0Sstevel@tonic-gate "Hello World" =~ /World /; # doesn't match 128*0Sstevel@tonic-gate 129*0Sstevel@tonic-gateThe first regexp C<world> doesn't match because regexps are 130*0Sstevel@tonic-gatecase-sensitive. The second regexp matches because the substring 131*0Sstevel@tonic-gateS<C<'o W'> > occurs in the string S<C<"Hello World"> >. The space 132*0Sstevel@tonic-gatecharacter ' ' is treated like any other character in a regexp and is 133*0Sstevel@tonic-gateneeded to match in this case. The lack of a space character is the 134*0Sstevel@tonic-gatereason the third regexp C<'oW'> doesn't match. The fourth regexp 135*0Sstevel@tonic-gateC<'World '> doesn't match because there is a space at the end of the 136*0Sstevel@tonic-gateregexp, but not at the end of the string. The lesson here is that 137*0Sstevel@tonic-gateregexps must match a part of the string I<exactly> in order for the 138*0Sstevel@tonic-gatestatement to be true. 139*0Sstevel@tonic-gate 140*0Sstevel@tonic-gateIf a regexp matches in more than one place in the string, perl will 141*0Sstevel@tonic-gatealways match at the earliest possible point in the string: 142*0Sstevel@tonic-gate 143*0Sstevel@tonic-gate "Hello World" =~ /o/; # matches 'o' in 'Hello' 144*0Sstevel@tonic-gate "That hat is red" =~ /hat/; # matches 'hat' in 'That' 145*0Sstevel@tonic-gate 146*0Sstevel@tonic-gateWith respect to character matching, there are a few more points you 147*0Sstevel@tonic-gateneed to know about. First of all, not all characters can be used 'as 148*0Sstevel@tonic-gateis' in a match. Some characters, called B<metacharacters>, are reserved 149*0Sstevel@tonic-gatefor use in regexp notation. The metacharacters are 150*0Sstevel@tonic-gate 151*0Sstevel@tonic-gate {}[]()^$.|*+?\ 152*0Sstevel@tonic-gate 153*0Sstevel@tonic-gateThe significance of each of these will be explained 154*0Sstevel@tonic-gatein the rest of the tutorial, but for now, it is important only to know 155*0Sstevel@tonic-gatethat a metacharacter can be matched by putting a backslash before it: 156*0Sstevel@tonic-gate 157*0Sstevel@tonic-gate "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter 158*0Sstevel@tonic-gate "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + 159*0Sstevel@tonic-gate "The interval is [0,1)." =~ /[0,1)./ # is a syntax error! 160*0Sstevel@tonic-gate "The interval is [0,1)." =~ /\[0,1\)\./ # matches 161*0Sstevel@tonic-gate "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches 162*0Sstevel@tonic-gate 163*0Sstevel@tonic-gateIn the last regexp, the forward slash C<'/'> is also backslashed, 164*0Sstevel@tonic-gatebecause it is used to delimit the regexp. This can lead to LTS 165*0Sstevel@tonic-gate(leaning toothpick syndrome), however, and it is often more readable 166*0Sstevel@tonic-gateto change delimiters. 167*0Sstevel@tonic-gate 168*0Sstevel@tonic-gate "/usr/bin/perl" =~ m!/usr/bin/perl!; # easier to read 169*0Sstevel@tonic-gate 170*0Sstevel@tonic-gateThe backslash character C<'\'> is a metacharacter itself and needs to 171*0Sstevel@tonic-gatebe backslashed: 172*0Sstevel@tonic-gate 173*0Sstevel@tonic-gate 'C:\WIN32' =~ /C:\\WIN/; # matches 174*0Sstevel@tonic-gate 175*0Sstevel@tonic-gateIn addition to the metacharacters, there are some ASCII characters 176*0Sstevel@tonic-gatewhich don't have printable character equivalents and are instead 177*0Sstevel@tonic-gaterepresented by B<escape sequences>. Common examples are C<\t> for a 178*0Sstevel@tonic-gatetab, C<\n> for a newline, C<\r> for a carriage return and C<\a> for a 179*0Sstevel@tonic-gatebell. If your string is better thought of as a sequence of arbitrary 180*0Sstevel@tonic-gatebytes, the octal escape sequence, e.g., C<\033>, or hexadecimal escape 181*0Sstevel@tonic-gatesequence, e.g., C<\x1B> may be a more natural representation for your 182*0Sstevel@tonic-gatebytes. Here are some examples of escapes: 183*0Sstevel@tonic-gate 184*0Sstevel@tonic-gate "1000\t2000" =~ m(0\t2) # matches 185*0Sstevel@tonic-gate "1000\n2000" =~ /0\n20/ # matches 186*0Sstevel@tonic-gate "1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000" 187*0Sstevel@tonic-gate "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat 188*0Sstevel@tonic-gate 189*0Sstevel@tonic-gateIf you've been around Perl a while, all this talk of escape sequences 190*0Sstevel@tonic-gatemay seem familiar. Similar escape sequences are used in double-quoted 191*0Sstevel@tonic-gatestrings and in fact the regexps in Perl are mostly treated as 192*0Sstevel@tonic-gatedouble-quoted strings. This means that variables can be used in 193*0Sstevel@tonic-gateregexps as well. Just like double-quoted strings, the values of the 194*0Sstevel@tonic-gatevariables in the regexp will be substituted in before the regexp is 195*0Sstevel@tonic-gateevaluated for matching purposes. So we have: 196*0Sstevel@tonic-gate 197*0Sstevel@tonic-gate $foo = 'house'; 198*0Sstevel@tonic-gate 'housecat' =~ /$foo/; # matches 199*0Sstevel@tonic-gate 'cathouse' =~ /cat$foo/; # matches 200*0Sstevel@tonic-gate 'housecat' =~ /${foo}cat/; # matches 201*0Sstevel@tonic-gate 202*0Sstevel@tonic-gateSo far, so good. With the knowledge above you can already perform 203*0Sstevel@tonic-gatesearches with just about any literal string regexp you can dream up. 204*0Sstevel@tonic-gateHere is a I<very simple> emulation of the Unix grep program: 205*0Sstevel@tonic-gate 206*0Sstevel@tonic-gate % cat > simple_grep 207*0Sstevel@tonic-gate #!/usr/bin/perl 208*0Sstevel@tonic-gate $regexp = shift; 209*0Sstevel@tonic-gate while (<>) { 210*0Sstevel@tonic-gate print if /$regexp/; 211*0Sstevel@tonic-gate } 212*0Sstevel@tonic-gate ^D 213*0Sstevel@tonic-gate 214*0Sstevel@tonic-gate % chmod +x simple_grep 215*0Sstevel@tonic-gate 216*0Sstevel@tonic-gate % simple_grep abba /usr/dict/words 217*0Sstevel@tonic-gate Babbage 218*0Sstevel@tonic-gate cabbage 219*0Sstevel@tonic-gate cabbages 220*0Sstevel@tonic-gate sabbath 221*0Sstevel@tonic-gate Sabbathize 222*0Sstevel@tonic-gate Sabbathizes 223*0Sstevel@tonic-gate sabbatical 224*0Sstevel@tonic-gate scabbard 225*0Sstevel@tonic-gate scabbards 226*0Sstevel@tonic-gate 227*0Sstevel@tonic-gateThis program is easy to understand. C<#!/usr/bin/perl> is the standard 228*0Sstevel@tonic-gateway to invoke a perl program from the shell. 229*0Sstevel@tonic-gateS<C<$regexp = shift;> > saves the first command line argument as the 230*0Sstevel@tonic-gateregexp to be used, leaving the rest of the command line arguments to 231*0Sstevel@tonic-gatebe treated as files. S<C<< while (<>) >> > loops over all the lines in 232*0Sstevel@tonic-gateall the files. For each line, S<C<print if /$regexp/;> > prints the 233*0Sstevel@tonic-gateline if the regexp matches the line. In this line, both C<print> and 234*0Sstevel@tonic-gateC</$regexp/> use the default variable C<$_> implicitly. 235*0Sstevel@tonic-gate 236*0Sstevel@tonic-gateWith all of the regexps above, if the regexp matched anywhere in the 237*0Sstevel@tonic-gatestring, it was considered a match. Sometimes, however, we'd like to 238*0Sstevel@tonic-gatespecify I<where> in the string the regexp should try to match. To do 239*0Sstevel@tonic-gatethis, we would use the B<anchor> metacharacters C<^> and C<$>. The 240*0Sstevel@tonic-gateanchor C<^> means match at the beginning of the string and the anchor 241*0Sstevel@tonic-gateC<$> means match at the end of the string, or before a newline at the 242*0Sstevel@tonic-gateend of the string. Here is how they are used: 243*0Sstevel@tonic-gate 244*0Sstevel@tonic-gate "housekeeper" =~ /keeper/; # matches 245*0Sstevel@tonic-gate "housekeeper" =~ /^keeper/; # doesn't match 246*0Sstevel@tonic-gate "housekeeper" =~ /keeper$/; # matches 247*0Sstevel@tonic-gate "housekeeper\n" =~ /keeper$/; # matches 248*0Sstevel@tonic-gate 249*0Sstevel@tonic-gateThe second regexp doesn't match because C<^> constrains C<keeper> to 250*0Sstevel@tonic-gatematch only at the beginning of the string, but C<"housekeeper"> has 251*0Sstevel@tonic-gatekeeper starting in the middle. The third regexp does match, since the 252*0Sstevel@tonic-gateC<$> constrains C<keeper> to match only at the end of the string. 253*0Sstevel@tonic-gate 254*0Sstevel@tonic-gateWhen both C<^> and C<$> are used at the same time, the regexp has to 255*0Sstevel@tonic-gatematch both the beginning and the end of the string, i.e., the regexp 256*0Sstevel@tonic-gatematches the whole string. Consider 257*0Sstevel@tonic-gate 258*0Sstevel@tonic-gate "keeper" =~ /^keep$/; # doesn't match 259*0Sstevel@tonic-gate "keeper" =~ /^keeper$/; # matches 260*0Sstevel@tonic-gate "" =~ /^$/; # ^$ matches an empty string 261*0Sstevel@tonic-gate 262*0Sstevel@tonic-gateThe first regexp doesn't match because the string has more to it than 263*0Sstevel@tonic-gateC<keep>. Since the second regexp is exactly the string, it 264*0Sstevel@tonic-gatematches. Using both C<^> and C<$> in a regexp forces the complete 265*0Sstevel@tonic-gatestring to match, so it gives you complete control over which strings 266*0Sstevel@tonic-gatematch and which don't. Suppose you are looking for a fellow named 267*0Sstevel@tonic-gatebert, off in a string by himself: 268*0Sstevel@tonic-gate 269*0Sstevel@tonic-gate "dogbert" =~ /bert/; # matches, but not what you want 270*0Sstevel@tonic-gate 271*0Sstevel@tonic-gate "dilbert" =~ /^bert/; # doesn't match, but .. 272*0Sstevel@tonic-gate "bertram" =~ /^bert/; # matches, so still not good enough 273*0Sstevel@tonic-gate 274*0Sstevel@tonic-gate "bertram" =~ /^bert$/; # doesn't match, good 275*0Sstevel@tonic-gate "dilbert" =~ /^bert$/; # doesn't match, good 276*0Sstevel@tonic-gate "bert" =~ /^bert$/; # matches, perfect 277*0Sstevel@tonic-gate 278*0Sstevel@tonic-gateOf course, in the case of a literal string, one could just as easily 279*0Sstevel@tonic-gateuse the string equivalence S<C<$string eq 'bert'> > and it would be 280*0Sstevel@tonic-gatemore efficient. The C<^...$> regexp really becomes useful when we 281*0Sstevel@tonic-gateadd in the more powerful regexp tools below. 282*0Sstevel@tonic-gate 283*0Sstevel@tonic-gate=head2 Using character classes 284*0Sstevel@tonic-gate 285*0Sstevel@tonic-gateAlthough one can already do quite a lot with the literal string 286*0Sstevel@tonic-gateregexps above, we've only scratched the surface of regular expression 287*0Sstevel@tonic-gatetechnology. In this and subsequent sections we will introduce regexp 288*0Sstevel@tonic-gateconcepts (and associated metacharacter notations) that will allow a 289*0Sstevel@tonic-gateregexp to not just represent a single character sequence, but a I<whole 290*0Sstevel@tonic-gateclass> of them. 291*0Sstevel@tonic-gate 292*0Sstevel@tonic-gateOne such concept is that of a B<character class>. A character class 293*0Sstevel@tonic-gateallows a set of possible characters, rather than just a single 294*0Sstevel@tonic-gatecharacter, to match at a particular point in a regexp. Character 295*0Sstevel@tonic-gateclasses are denoted by brackets C<[...]>, with the set of characters 296*0Sstevel@tonic-gateto be possibly matched inside. Here are some examples: 297*0Sstevel@tonic-gate 298*0Sstevel@tonic-gate /cat/; # matches 'cat' 299*0Sstevel@tonic-gate /[bcr]at/; # matches 'bat, 'cat', or 'rat' 300*0Sstevel@tonic-gate /item[0123456789]/; # matches 'item0' or ... or 'item9' 301*0Sstevel@tonic-gate "abc" =~ /[cab]/; # matches 'a' 302*0Sstevel@tonic-gate 303*0Sstevel@tonic-gateIn the last statement, even though C<'c'> is the first character in 304*0Sstevel@tonic-gatethe class, C<'a'> matches because the first character position in the 305*0Sstevel@tonic-gatestring is the earliest point at which the regexp can match. 306*0Sstevel@tonic-gate 307*0Sstevel@tonic-gate /[yY][eE][sS]/; # match 'yes' in a case-insensitive way 308*0Sstevel@tonic-gate # 'yes', 'Yes', 'YES', etc. 309*0Sstevel@tonic-gate 310*0Sstevel@tonic-gateThis regexp displays a common task: perform a case-insensitive 311*0Sstevel@tonic-gatematch. Perl provides away of avoiding all those brackets by simply 312*0Sstevel@tonic-gateappending an C<'i'> to the end of the match. Then C</[yY][eE][sS]/;> 313*0Sstevel@tonic-gatecan be rewritten as C</yes/i;>. The C<'i'> stands for 314*0Sstevel@tonic-gatecase-insensitive and is an example of a B<modifier> of the matching 315*0Sstevel@tonic-gateoperation. We will meet other modifiers later in the tutorial. 316*0Sstevel@tonic-gate 317*0Sstevel@tonic-gateWe saw in the section above that there were ordinary characters, which 318*0Sstevel@tonic-gaterepresented themselves, and special characters, which needed a 319*0Sstevel@tonic-gatebackslash C<\> to represent themselves. The same is true in a 320*0Sstevel@tonic-gatecharacter class, but the sets of ordinary and special characters 321*0Sstevel@tonic-gateinside a character class are different than those outside a character 322*0Sstevel@tonic-gateclass. The special characters for a character class are C<-]\^$>. C<]> 323*0Sstevel@tonic-gateis special because it denotes the end of a character class. C<$> is 324*0Sstevel@tonic-gatespecial because it denotes a scalar variable. C<\> is special because 325*0Sstevel@tonic-gateit is used in escape sequences, just like above. Here is how the 326*0Sstevel@tonic-gatespecial characters C<]$\> are handled: 327*0Sstevel@tonic-gate 328*0Sstevel@tonic-gate /[\]c]def/; # matches ']def' or 'cdef' 329*0Sstevel@tonic-gate $x = 'bcr'; 330*0Sstevel@tonic-gate /[$x]at/; # matches 'bat', 'cat', or 'rat' 331*0Sstevel@tonic-gate /[\$x]at/; # matches '$at' or 'xat' 332*0Sstevel@tonic-gate /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' 333*0Sstevel@tonic-gate 334*0Sstevel@tonic-gateThe last two are a little tricky. in C<[\$x]>, the backslash protects 335*0Sstevel@tonic-gatethe dollar sign, so the character class has two members C<$> and C<x>. 336*0Sstevel@tonic-gateIn C<[\\$x]>, the backslash is protected, so C<$x> is treated as a 337*0Sstevel@tonic-gatevariable and substituted in double quote fashion. 338*0Sstevel@tonic-gate 339*0Sstevel@tonic-gateThe special character C<'-'> acts as a range operator within character 340*0Sstevel@tonic-gateclasses, so that a contiguous set of characters can be written as a 341*0Sstevel@tonic-gaterange. With ranges, the unwieldy C<[0123456789]> and C<[abc...xyz]> 342*0Sstevel@tonic-gatebecome the svelte C<[0-9]> and C<[a-z]>. Some examples are 343*0Sstevel@tonic-gate 344*0Sstevel@tonic-gate /item[0-9]/; # matches 'item0' or ... or 'item9' 345*0Sstevel@tonic-gate /[0-9bx-z]aa/; # matches '0aa', ..., '9aa', 346*0Sstevel@tonic-gate # 'baa', 'xaa', 'yaa', or 'zaa' 347*0Sstevel@tonic-gate /[0-9a-fA-F]/; # matches a hexadecimal digit 348*0Sstevel@tonic-gate /[0-9a-zA-Z_]/; # matches a "word" character, 349*0Sstevel@tonic-gate # like those in a perl variable name 350*0Sstevel@tonic-gate 351*0Sstevel@tonic-gateIf C<'-'> is the first or last character in a character class, it is 352*0Sstevel@tonic-gatetreated as an ordinary character; C<[-ab]>, C<[ab-]> and C<[a\-b]> are 353*0Sstevel@tonic-gateall equivalent. 354*0Sstevel@tonic-gate 355*0Sstevel@tonic-gateThe special character C<^> in the first position of a character class 356*0Sstevel@tonic-gatedenotes a B<negated character class>, which matches any character but 357*0Sstevel@tonic-gatethose in the brackets. Both C<[...]> and C<[^...]> must match a 358*0Sstevel@tonic-gatecharacter, or the match fails. Then 359*0Sstevel@tonic-gate 360*0Sstevel@tonic-gate /[^a]at/; # doesn't match 'aat' or 'at', but matches 361*0Sstevel@tonic-gate # all other 'bat', 'cat, '0at', '%at', etc. 362*0Sstevel@tonic-gate /[^0-9]/; # matches a non-numeric character 363*0Sstevel@tonic-gate /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary 364*0Sstevel@tonic-gate 365*0Sstevel@tonic-gateNow, even C<[0-9]> can be a bother the write multiple times, so in the 366*0Sstevel@tonic-gateinterest of saving keystrokes and making regexps more readable, Perl 367*0Sstevel@tonic-gatehas several abbreviations for common character classes: 368*0Sstevel@tonic-gate 369*0Sstevel@tonic-gate=over 4 370*0Sstevel@tonic-gate 371*0Sstevel@tonic-gate=item * 372*0Sstevel@tonic-gate 373*0Sstevel@tonic-gate\d is a digit and represents [0-9] 374*0Sstevel@tonic-gate 375*0Sstevel@tonic-gate=item * 376*0Sstevel@tonic-gate 377*0Sstevel@tonic-gate\s is a whitespace character and represents [\ \t\r\n\f] 378*0Sstevel@tonic-gate 379*0Sstevel@tonic-gate=item * 380*0Sstevel@tonic-gate 381*0Sstevel@tonic-gate\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_] 382*0Sstevel@tonic-gate 383*0Sstevel@tonic-gate=item * 384*0Sstevel@tonic-gate 385*0Sstevel@tonic-gate\D is a negated \d; it represents any character but a digit [^0-9] 386*0Sstevel@tonic-gate 387*0Sstevel@tonic-gate=item * 388*0Sstevel@tonic-gate 389*0Sstevel@tonic-gate\S is a negated \s; it represents any non-whitespace character [^\s] 390*0Sstevel@tonic-gate 391*0Sstevel@tonic-gate=item * 392*0Sstevel@tonic-gate 393*0Sstevel@tonic-gate\W is a negated \w; it represents any non-word character [^\w] 394*0Sstevel@tonic-gate 395*0Sstevel@tonic-gate=item * 396*0Sstevel@tonic-gate 397*0Sstevel@tonic-gateThe period '.' matches any character but "\n" 398*0Sstevel@tonic-gate 399*0Sstevel@tonic-gate=back 400*0Sstevel@tonic-gate 401*0Sstevel@tonic-gateThe C<\d\s\w\D\S\W> abbreviations can be used both inside and outside 402*0Sstevel@tonic-gateof character classes. Here are some in use: 403*0Sstevel@tonic-gate 404*0Sstevel@tonic-gate /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format 405*0Sstevel@tonic-gate /[\d\s]/; # matches any digit or whitespace character 406*0Sstevel@tonic-gate /\w\W\w/; # matches a word char, followed by a 407*0Sstevel@tonic-gate # non-word char, followed by a word char 408*0Sstevel@tonic-gate /..rt/; # matches any two chars, followed by 'rt' 409*0Sstevel@tonic-gate /end\./; # matches 'end.' 410*0Sstevel@tonic-gate /end[.]/; # same thing, matches 'end.' 411*0Sstevel@tonic-gate 412*0Sstevel@tonic-gateBecause a period is a metacharacter, it needs to be escaped to match 413*0Sstevel@tonic-gateas an ordinary period. Because, for example, C<\d> and C<\w> are sets 414*0Sstevel@tonic-gateof characters, it is incorrect to think of C<[^\d\w]> as C<[\D\W]>; in 415*0Sstevel@tonic-gatefact C<[^\d\w]> is the same as C<[^\w]>, which is the same as 416*0Sstevel@tonic-gateC<[\W]>. Think DeMorgan's laws. 417*0Sstevel@tonic-gate 418*0Sstevel@tonic-gateAn anchor useful in basic regexps is the S<B<word anchor> > 419*0Sstevel@tonic-gateC<\b>. This matches a boundary between a word character and a non-word 420*0Sstevel@tonic-gatecharacter C<\w\W> or C<\W\w>: 421*0Sstevel@tonic-gate 422*0Sstevel@tonic-gate $x = "Housecat catenates house and cat"; 423*0Sstevel@tonic-gate $x =~ /cat/; # matches cat in 'housecat' 424*0Sstevel@tonic-gate $x =~ /\bcat/; # matches cat in 'catenates' 425*0Sstevel@tonic-gate $x =~ /cat\b/; # matches cat in 'housecat' 426*0Sstevel@tonic-gate $x =~ /\bcat\b/; # matches 'cat' at end of string 427*0Sstevel@tonic-gate 428*0Sstevel@tonic-gateNote in the last example, the end of the string is considered a word 429*0Sstevel@tonic-gateboundary. 430*0Sstevel@tonic-gate 431*0Sstevel@tonic-gateYou might wonder why C<'.'> matches everything but C<"\n"> - why not 432*0Sstevel@tonic-gateevery character? The reason is that often one is matching against 433*0Sstevel@tonic-gatelines and would like to ignore the newline characters. For instance, 434*0Sstevel@tonic-gatewhile the string C<"\n"> represents one line, we would like to think 435*0Sstevel@tonic-gateof as empty. Then 436*0Sstevel@tonic-gate 437*0Sstevel@tonic-gate "" =~ /^$/; # matches 438*0Sstevel@tonic-gate "\n" =~ /^$/; # matches, "\n" is ignored 439*0Sstevel@tonic-gate 440*0Sstevel@tonic-gate "" =~ /./; # doesn't match; it needs a char 441*0Sstevel@tonic-gate "" =~ /^.$/; # doesn't match; it needs a char 442*0Sstevel@tonic-gate "\n" =~ /^.$/; # doesn't match; it needs a char other than "\n" 443*0Sstevel@tonic-gate "a" =~ /^.$/; # matches 444*0Sstevel@tonic-gate "a\n" =~ /^.$/; # matches, ignores the "\n" 445*0Sstevel@tonic-gate 446*0Sstevel@tonic-gateThis behavior is convenient, because we usually want to ignore 447*0Sstevel@tonic-gatenewlines when we count and match characters in a line. Sometimes, 448*0Sstevel@tonic-gatehowever, we want to keep track of newlines. We might even want C<^> 449*0Sstevel@tonic-gateand C<$> to anchor at the beginning and end of lines within the 450*0Sstevel@tonic-gatestring, rather than just the beginning and end of the string. Perl 451*0Sstevel@tonic-gateallows us to choose between ignoring and paying attention to newlines 452*0Sstevel@tonic-gateby using the C<//s> and C<//m> modifiers. C<//s> and C<//m> stand for 453*0Sstevel@tonic-gatesingle line and multi-line and they determine whether a string is to 454*0Sstevel@tonic-gatebe treated as one continuous string, or as a set of lines. The two 455*0Sstevel@tonic-gatemodifiers affect two aspects of how the regexp is interpreted: 1) how 456*0Sstevel@tonic-gatethe C<'.'> character class is defined, and 2) where the anchors C<^> 457*0Sstevel@tonic-gateand C<$> are able to match. Here are the four possible combinations: 458*0Sstevel@tonic-gate 459*0Sstevel@tonic-gate=over 4 460*0Sstevel@tonic-gate 461*0Sstevel@tonic-gate=item * 462*0Sstevel@tonic-gate 463*0Sstevel@tonic-gateno modifiers (//): Default behavior. C<'.'> matches any character 464*0Sstevel@tonic-gateexcept C<"\n">. C<^> matches only at the beginning of the string and 465*0Sstevel@tonic-gateC<$> matches only at the end or before a newline at the end. 466*0Sstevel@tonic-gate 467*0Sstevel@tonic-gate=item * 468*0Sstevel@tonic-gate 469*0Sstevel@tonic-gates modifier (//s): Treat string as a single long line. C<'.'> matches 470*0Sstevel@tonic-gateany character, even C<"\n">. C<^> matches only at the beginning of 471*0Sstevel@tonic-gatethe string and C<$> matches only at the end or before a newline at the 472*0Sstevel@tonic-gateend. 473*0Sstevel@tonic-gate 474*0Sstevel@tonic-gate=item * 475*0Sstevel@tonic-gate 476*0Sstevel@tonic-gatem modifier (//m): Treat string as a set of multiple lines. C<'.'> 477*0Sstevel@tonic-gatematches any character except C<"\n">. C<^> and C<$> are able to match 478*0Sstevel@tonic-gateat the start or end of I<any> line within the string. 479*0Sstevel@tonic-gate 480*0Sstevel@tonic-gate=item * 481*0Sstevel@tonic-gate 482*0Sstevel@tonic-gateboth s and m modifiers (//sm): Treat string as a single long line, but 483*0Sstevel@tonic-gatedetect multiple lines. C<'.'> matches any character, even 484*0Sstevel@tonic-gateC<"\n">. C<^> and C<$>, however, are able to match at the start or end 485*0Sstevel@tonic-gateof I<any> line within the string. 486*0Sstevel@tonic-gate 487*0Sstevel@tonic-gate=back 488*0Sstevel@tonic-gate 489*0Sstevel@tonic-gateHere are examples of C<//s> and C<//m> in action: 490*0Sstevel@tonic-gate 491*0Sstevel@tonic-gate $x = "There once was a girl\nWho programmed in Perl\n"; 492*0Sstevel@tonic-gate 493*0Sstevel@tonic-gate $x =~ /^Who/; # doesn't match, "Who" not at start of string 494*0Sstevel@tonic-gate $x =~ /^Who/s; # doesn't match, "Who" not at start of string 495*0Sstevel@tonic-gate $x =~ /^Who/m; # matches, "Who" at start of second line 496*0Sstevel@tonic-gate $x =~ /^Who/sm; # matches, "Who" at start of second line 497*0Sstevel@tonic-gate 498*0Sstevel@tonic-gate $x =~ /girl.Who/; # doesn't match, "." doesn't match "\n" 499*0Sstevel@tonic-gate $x =~ /girl.Who/s; # matches, "." matches "\n" 500*0Sstevel@tonic-gate $x =~ /girl.Who/m; # doesn't match, "." doesn't match "\n" 501*0Sstevel@tonic-gate $x =~ /girl.Who/sm; # matches, "." matches "\n" 502*0Sstevel@tonic-gate 503*0Sstevel@tonic-gateMost of the time, the default behavior is what is want, but C<//s> and 504*0Sstevel@tonic-gateC<//m> are occasionally very useful. If C<//m> is being used, the start 505*0Sstevel@tonic-gateof the string can still be matched with C<\A> and the end of string 506*0Sstevel@tonic-gatecan still be matched with the anchors C<\Z> (matches both the end and 507*0Sstevel@tonic-gatethe newline before, like C<$>), and C<\z> (matches only the end): 508*0Sstevel@tonic-gate 509*0Sstevel@tonic-gate $x =~ /^Who/m; # matches, "Who" at start of second line 510*0Sstevel@tonic-gate $x =~ /\AWho/m; # doesn't match, "Who" is not at start of string 511*0Sstevel@tonic-gate 512*0Sstevel@tonic-gate $x =~ /girl$/m; # matches, "girl" at end of first line 513*0Sstevel@tonic-gate $x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string 514*0Sstevel@tonic-gate 515*0Sstevel@tonic-gate $x =~ /Perl\Z/m; # matches, "Perl" is at newline before end 516*0Sstevel@tonic-gate $x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of string 517*0Sstevel@tonic-gate 518*0Sstevel@tonic-gateWe now know how to create choices among classes of characters in a 519*0Sstevel@tonic-gateregexp. What about choices among words or character strings? Such 520*0Sstevel@tonic-gatechoices are described in the next section. 521*0Sstevel@tonic-gate 522*0Sstevel@tonic-gate=head2 Matching this or that 523*0Sstevel@tonic-gate 524*0Sstevel@tonic-gateSometimes we would like to our regexp to be able to match different 525*0Sstevel@tonic-gatepossible words or character strings. This is accomplished by using 526*0Sstevel@tonic-gatethe B<alternation> metacharacter C<|>. To match C<dog> or C<cat>, we 527*0Sstevel@tonic-gateform the regexp C<dog|cat>. As before, perl will try to match the 528*0Sstevel@tonic-gateregexp at the earliest possible point in the string. At each 529*0Sstevel@tonic-gatecharacter position, perl will first try to match the first 530*0Sstevel@tonic-gatealternative, C<dog>. If C<dog> doesn't match, perl will then try the 531*0Sstevel@tonic-gatenext alternative, C<cat>. If C<cat> doesn't match either, then the 532*0Sstevel@tonic-gatematch fails and perl moves to the next position in the string. Some 533*0Sstevel@tonic-gateexamples: 534*0Sstevel@tonic-gate 535*0Sstevel@tonic-gate "cats and dogs" =~ /cat|dog|bird/; # matches "cat" 536*0Sstevel@tonic-gate "cats and dogs" =~ /dog|cat|bird/; # matches "cat" 537*0Sstevel@tonic-gate 538*0Sstevel@tonic-gateEven though C<dog> is the first alternative in the second regexp, 539*0Sstevel@tonic-gateC<cat> is able to match earlier in the string. 540*0Sstevel@tonic-gate 541*0Sstevel@tonic-gate "cats" =~ /c|ca|cat|cats/; # matches "c" 542*0Sstevel@tonic-gate "cats" =~ /cats|cat|ca|c/; # matches "cats" 543*0Sstevel@tonic-gate 544*0Sstevel@tonic-gateHere, all the alternatives match at the first string position, so the 545*0Sstevel@tonic-gatefirst alternative is the one that matches. If some of the 546*0Sstevel@tonic-gatealternatives are truncations of the others, put the longest ones first 547*0Sstevel@tonic-gateto give them a chance to match. 548*0Sstevel@tonic-gate 549*0Sstevel@tonic-gate "cab" =~ /a|b|c/ # matches "c" 550*0Sstevel@tonic-gate # /a|b|c/ == /[abc]/ 551*0Sstevel@tonic-gate 552*0Sstevel@tonic-gateThe last example points out that character classes are like 553*0Sstevel@tonic-gatealternations of characters. At a given character position, the first 554*0Sstevel@tonic-gatealternative that allows the regexp match to succeed will be the one 555*0Sstevel@tonic-gatethat matches. 556*0Sstevel@tonic-gate 557*0Sstevel@tonic-gate=head2 Grouping things and hierarchical matching 558*0Sstevel@tonic-gate 559*0Sstevel@tonic-gateAlternation allows a regexp to choose among alternatives, but by 560*0Sstevel@tonic-gateitself it unsatisfying. The reason is that each alternative is a whole 561*0Sstevel@tonic-gateregexp, but sometime we want alternatives for just part of a 562*0Sstevel@tonic-gateregexp. For instance, suppose we want to search for housecats or 563*0Sstevel@tonic-gatehousekeepers. The regexp C<housecat|housekeeper> fits the bill, but is 564*0Sstevel@tonic-gateinefficient because we had to type C<house> twice. It would be nice to 565*0Sstevel@tonic-gatehave parts of the regexp be constant, like C<house>, and some 566*0Sstevel@tonic-gateparts have alternatives, like C<cat|keeper>. 567*0Sstevel@tonic-gate 568*0Sstevel@tonic-gateThe B<grouping> metacharacters C<()> solve this problem. Grouping 569*0Sstevel@tonic-gateallows parts of a regexp to be treated as a single unit. Parts of a 570*0Sstevel@tonic-gateregexp are grouped by enclosing them in parentheses. Thus we could solve 571*0Sstevel@tonic-gatethe C<housecat|housekeeper> by forming the regexp as 572*0Sstevel@tonic-gateC<house(cat|keeper)>. The regexp C<house(cat|keeper)> means match 573*0Sstevel@tonic-gateC<house> followed by either C<cat> or C<keeper>. Some more examples 574*0Sstevel@tonic-gateare 575*0Sstevel@tonic-gate 576*0Sstevel@tonic-gate /(a|b)b/; # matches 'ab' or 'bb' 577*0Sstevel@tonic-gate /(ac|b)b/; # matches 'acb' or 'bb' 578*0Sstevel@tonic-gate /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere 579*0Sstevel@tonic-gate /(a|[bc])d/; # matches 'ad', 'bd', or 'cd' 580*0Sstevel@tonic-gate 581*0Sstevel@tonic-gate /house(cat|)/; # matches either 'housecat' or 'house' 582*0Sstevel@tonic-gate /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or 583*0Sstevel@tonic-gate # 'house'. Note groups can be nested. 584*0Sstevel@tonic-gate 585*0Sstevel@tonic-gate /(19|20|)\d\d/; # match years 19xx, 20xx, or the Y2K problem, xx 586*0Sstevel@tonic-gate "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', 587*0Sstevel@tonic-gate # because '20\d\d' can't match 588*0Sstevel@tonic-gate 589*0Sstevel@tonic-gateAlternations behave the same way in groups as out of them: at a given 590*0Sstevel@tonic-gatestring position, the leftmost alternative that allows the regexp to 591*0Sstevel@tonic-gatematch is taken. So in the last example at the first string position, 592*0Sstevel@tonic-gateC<"20"> matches the second alternative, but there is nothing left over 593*0Sstevel@tonic-gateto match the next two digits C<\d\d>. So perl moves on to the next 594*0Sstevel@tonic-gatealternative, which is the null alternative and that works, since 595*0Sstevel@tonic-gateC<"20"> is two digits. 596*0Sstevel@tonic-gate 597*0Sstevel@tonic-gateThe process of trying one alternative, seeing if it matches, and 598*0Sstevel@tonic-gatemoving on to the next alternative if it doesn't, is called 599*0Sstevel@tonic-gateB<backtracking>. The term 'backtracking' comes from the idea that 600*0Sstevel@tonic-gatematching a regexp is like a walk in the woods. Successfully matching 601*0Sstevel@tonic-gatea regexp is like arriving at a destination. There are many possible 602*0Sstevel@tonic-gatetrailheads, one for each string position, and each one is tried in 603*0Sstevel@tonic-gateorder, left to right. From each trailhead there may be many paths, 604*0Sstevel@tonic-gatesome of which get you there, and some which are dead ends. When you 605*0Sstevel@tonic-gatewalk along a trail and hit a dead end, you have to backtrack along the 606*0Sstevel@tonic-gatetrail to an earlier point to try another trail. If you hit your 607*0Sstevel@tonic-gatedestination, you stop immediately and forget about trying all the 608*0Sstevel@tonic-gateother trails. You are persistent, and only if you have tried all the 609*0Sstevel@tonic-gatetrails from all the trailheads and not arrived at your destination, do 610*0Sstevel@tonic-gateyou declare failure. To be concrete, here is a step-by-step analysis 611*0Sstevel@tonic-gateof what perl does when it tries to match the regexp 612*0Sstevel@tonic-gate 613*0Sstevel@tonic-gate "abcde" =~ /(abd|abc)(df|d|de)/; 614*0Sstevel@tonic-gate 615*0Sstevel@tonic-gate=over 4 616*0Sstevel@tonic-gate 617*0Sstevel@tonic-gate=item 0 618*0Sstevel@tonic-gate 619*0Sstevel@tonic-gateStart with the first letter in the string 'a'. 620*0Sstevel@tonic-gate 621*0Sstevel@tonic-gate=item 1 622*0Sstevel@tonic-gate 623*0Sstevel@tonic-gateTry the first alternative in the first group 'abd'. 624*0Sstevel@tonic-gate 625*0Sstevel@tonic-gate=item 2 626*0Sstevel@tonic-gate 627*0Sstevel@tonic-gateMatch 'a' followed by 'b'. So far so good. 628*0Sstevel@tonic-gate 629*0Sstevel@tonic-gate=item 3 630*0Sstevel@tonic-gate 631*0Sstevel@tonic-gate'd' in the regexp doesn't match 'c' in the string - a dead 632*0Sstevel@tonic-gateend. So backtrack two characters and pick the second alternative in 633*0Sstevel@tonic-gatethe first group 'abc'. 634*0Sstevel@tonic-gate 635*0Sstevel@tonic-gate=item 4 636*0Sstevel@tonic-gate 637*0Sstevel@tonic-gateMatch 'a' followed by 'b' followed by 'c'. We are on a roll 638*0Sstevel@tonic-gateand have satisfied the first group. Set $1 to 'abc'. 639*0Sstevel@tonic-gate 640*0Sstevel@tonic-gate=item 5 641*0Sstevel@tonic-gate 642*0Sstevel@tonic-gateMove on to the second group and pick the first alternative 643*0Sstevel@tonic-gate'df'. 644*0Sstevel@tonic-gate 645*0Sstevel@tonic-gate=item 6 646*0Sstevel@tonic-gate 647*0Sstevel@tonic-gateMatch the 'd'. 648*0Sstevel@tonic-gate 649*0Sstevel@tonic-gate=item 7 650*0Sstevel@tonic-gate 651*0Sstevel@tonic-gate'f' in the regexp doesn't match 'e' in the string, so a dead 652*0Sstevel@tonic-gateend. Backtrack one character and pick the second alternative in the 653*0Sstevel@tonic-gatesecond group 'd'. 654*0Sstevel@tonic-gate 655*0Sstevel@tonic-gate=item 8 656*0Sstevel@tonic-gate 657*0Sstevel@tonic-gate'd' matches. The second grouping is satisfied, so set $2 to 658*0Sstevel@tonic-gate'd'. 659*0Sstevel@tonic-gate 660*0Sstevel@tonic-gate=item 9 661*0Sstevel@tonic-gate 662*0Sstevel@tonic-gateWe are at the end of the regexp, so we are done! We have 663*0Sstevel@tonic-gatematched 'abcd' out of the string "abcde". 664*0Sstevel@tonic-gate 665*0Sstevel@tonic-gate=back 666*0Sstevel@tonic-gate 667*0Sstevel@tonic-gateThere are a couple of things to note about this analysis. First, the 668*0Sstevel@tonic-gatethird alternative in the second group 'de' also allows a match, but we 669*0Sstevel@tonic-gatestopped before we got to it - at a given character position, leftmost 670*0Sstevel@tonic-gatewins. Second, we were able to get a match at the first character 671*0Sstevel@tonic-gateposition of the string 'a'. If there were no matches at the first 672*0Sstevel@tonic-gateposition, perl would move to the second character position 'b' and 673*0Sstevel@tonic-gateattempt the match all over again. Only when all possible paths at all 674*0Sstevel@tonic-gatepossible character positions have been exhausted does perl give 675*0Sstevel@tonic-gateup and declare S<C<$string =~ /(abd|abc)(df|d|de)/;> > to be false. 676*0Sstevel@tonic-gate 677*0Sstevel@tonic-gateEven with all this work, regexp matching happens remarkably fast. To 678*0Sstevel@tonic-gatespeed things up, during compilation stage, perl compiles the regexp 679*0Sstevel@tonic-gateinto a compact sequence of opcodes that can often fit inside a 680*0Sstevel@tonic-gateprocessor cache. When the code is executed, these opcodes can then run 681*0Sstevel@tonic-gateat full throttle and search very quickly. 682*0Sstevel@tonic-gate 683*0Sstevel@tonic-gate=head2 Extracting matches 684*0Sstevel@tonic-gate 685*0Sstevel@tonic-gateThe grouping metacharacters C<()> also serve another completely 686*0Sstevel@tonic-gatedifferent function: they allow the extraction of the parts of a string 687*0Sstevel@tonic-gatethat matched. This is very useful to find out what matched and for 688*0Sstevel@tonic-gatetext processing in general. For each grouping, the part that matched 689*0Sstevel@tonic-gateinside goes into the special variables C<$1>, C<$2>, etc. They can be 690*0Sstevel@tonic-gateused just as ordinary variables: 691*0Sstevel@tonic-gate 692*0Sstevel@tonic-gate # extract hours, minutes, seconds 693*0Sstevel@tonic-gate if ($time =~ /(\d\d):(\d\d):(\d\d)/) { # match hh:mm:ss format 694*0Sstevel@tonic-gate $hours = $1; 695*0Sstevel@tonic-gate $minutes = $2; 696*0Sstevel@tonic-gate $seconds = $3; 697*0Sstevel@tonic-gate } 698*0Sstevel@tonic-gate 699*0Sstevel@tonic-gateNow, we know that in scalar context, 700*0Sstevel@tonic-gateS<C<$time =~ /(\d\d):(\d\d):(\d\d)/> > returns a true or false 701*0Sstevel@tonic-gatevalue. In list context, however, it returns the list of matched values 702*0Sstevel@tonic-gateC<($1,$2,$3)>. So we could write the code more compactly as 703*0Sstevel@tonic-gate 704*0Sstevel@tonic-gate # extract hours, minutes, seconds 705*0Sstevel@tonic-gate ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); 706*0Sstevel@tonic-gate 707*0Sstevel@tonic-gateIf the groupings in a regexp are nested, C<$1> gets the group with the 708*0Sstevel@tonic-gateleftmost opening parenthesis, C<$2> the next opening parenthesis, 709*0Sstevel@tonic-gateetc. For example, here is a complex regexp and the matching variables 710*0Sstevel@tonic-gateindicated below it: 711*0Sstevel@tonic-gate 712*0Sstevel@tonic-gate /(ab(cd|ef)((gi)|j))/; 713*0Sstevel@tonic-gate 1 2 34 714*0Sstevel@tonic-gate 715*0Sstevel@tonic-gateso that if the regexp matched, e.g., C<$2> would contain 'cd' or 'ef'. For 716*0Sstevel@tonic-gateconvenience, perl sets C<$+> to the string held by the highest numbered 717*0Sstevel@tonic-gateC<$1>, C<$2>, ... that got assigned (and, somewhat related, C<$^N> to the 718*0Sstevel@tonic-gatevalue of the C<$1>, C<$2>, ... most-recently assigned; i.e. the C<$1>, 719*0Sstevel@tonic-gateC<$2>, ... associated with the rightmost closing parenthesis used in the 720*0Sstevel@tonic-gatematch). 721*0Sstevel@tonic-gate 722*0Sstevel@tonic-gateClosely associated with the matching variables C<$1>, C<$2>, ... are 723*0Sstevel@tonic-gatethe B<backreferences> C<\1>, C<\2>, ... . Backreferences are simply 724*0Sstevel@tonic-gatematching variables that can be used I<inside> a regexp. This is a 725*0Sstevel@tonic-gatereally nice feature - what matches later in a regexp can depend on 726*0Sstevel@tonic-gatewhat matched earlier in the regexp. Suppose we wanted to look 727*0Sstevel@tonic-gatefor doubled words in text, like 'the the'. The following regexp finds 728*0Sstevel@tonic-gateall 3-letter doubles with a space in between: 729*0Sstevel@tonic-gate 730*0Sstevel@tonic-gate /(\w\w\w)\s\1/; 731*0Sstevel@tonic-gate 732*0Sstevel@tonic-gateThe grouping assigns a value to \1, so that the same 3 letter sequence 733*0Sstevel@tonic-gateis used for both parts. Here are some words with repeated parts: 734*0Sstevel@tonic-gate 735*0Sstevel@tonic-gate % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\1$' /usr/dict/words 736*0Sstevel@tonic-gate beriberi 737*0Sstevel@tonic-gate booboo 738*0Sstevel@tonic-gate coco 739*0Sstevel@tonic-gate mama 740*0Sstevel@tonic-gate murmur 741*0Sstevel@tonic-gate papa 742*0Sstevel@tonic-gate 743*0Sstevel@tonic-gateThe regexp has a single grouping which considers 4-letter 744*0Sstevel@tonic-gatecombinations, then 3-letter combinations, etc. and uses C<\1> to look for 745*0Sstevel@tonic-gatea repeat. Although C<$1> and C<\1> represent the same thing, care should be 746*0Sstevel@tonic-gatetaken to use matched variables C<$1>, C<$2>, ... only outside a regexp 747*0Sstevel@tonic-gateand backreferences C<\1>, C<\2>, ... only inside a regexp; not doing 748*0Sstevel@tonic-gateso may lead to surprising and/or undefined results. 749*0Sstevel@tonic-gate 750*0Sstevel@tonic-gateIn addition to what was matched, Perl 5.6.0 also provides the 751*0Sstevel@tonic-gatepositions of what was matched with the C<@-> and C<@+> 752*0Sstevel@tonic-gatearrays. C<$-[0]> is the position of the start of the entire match and 753*0Sstevel@tonic-gateC<$+[0]> is the position of the end. Similarly, C<$-[n]> is the 754*0Sstevel@tonic-gateposition of the start of the C<$n> match and C<$+[n]> is the position 755*0Sstevel@tonic-gateof the end. If C<$n> is undefined, so are C<$-[n]> and C<$+[n]>. Then 756*0Sstevel@tonic-gatethis code 757*0Sstevel@tonic-gate 758*0Sstevel@tonic-gate $x = "Mmm...donut, thought Homer"; 759*0Sstevel@tonic-gate $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches 760*0Sstevel@tonic-gate foreach $expr (1..$#-) { 761*0Sstevel@tonic-gate print "Match $expr: '${$expr}' at position ($-[$expr],$+[$expr])\n"; 762*0Sstevel@tonic-gate } 763*0Sstevel@tonic-gate 764*0Sstevel@tonic-gateprints 765*0Sstevel@tonic-gate 766*0Sstevel@tonic-gate Match 1: 'Mmm' at position (0,3) 767*0Sstevel@tonic-gate Match 2: 'donut' at position (6,11) 768*0Sstevel@tonic-gate 769*0Sstevel@tonic-gateEven if there are no groupings in a regexp, it is still possible to 770*0Sstevel@tonic-gatefind out what exactly matched in a string. If you use them, perl 771*0Sstevel@tonic-gatewill set C<$`> to the part of the string before the match, will set C<$&> 772*0Sstevel@tonic-gateto the part of the string that matched, and will set C<$'> to the part 773*0Sstevel@tonic-gateof the string after the match. An example: 774*0Sstevel@tonic-gate 775*0Sstevel@tonic-gate $x = "the cat caught the mouse"; 776*0Sstevel@tonic-gate $x =~ /cat/; # $` = 'the ', $& = 'cat', $' = ' caught the mouse' 777*0Sstevel@tonic-gate $x =~ /the/; # $` = '', $& = 'the', $' = ' cat caught the mouse' 778*0Sstevel@tonic-gate 779*0Sstevel@tonic-gateIn the second match, S<C<$` = ''> > because the regexp matched at the 780*0Sstevel@tonic-gatefirst character position in the string and stopped, it never saw the 781*0Sstevel@tonic-gatesecond 'the'. It is important to note that using C<$`> and C<$'> 782*0Sstevel@tonic-gateslows down regexp matching quite a bit, and C< $& > slows it down to a 783*0Sstevel@tonic-gatelesser extent, because if they are used in one regexp in a program, 784*0Sstevel@tonic-gatethey are generated for <all> regexps in the program. So if raw 785*0Sstevel@tonic-gateperformance is a goal of your application, they should be avoided. 786*0Sstevel@tonic-gateIf you need them, use C<@-> and C<@+> instead: 787*0Sstevel@tonic-gate 788*0Sstevel@tonic-gate $` is the same as substr( $x, 0, $-[0] ) 789*0Sstevel@tonic-gate $& is the same as substr( $x, $-[0], $+[0]-$-[0] ) 790*0Sstevel@tonic-gate $' is the same as substr( $x, $+[0] ) 791*0Sstevel@tonic-gate 792*0Sstevel@tonic-gate=head2 Matching repetitions 793*0Sstevel@tonic-gate 794*0Sstevel@tonic-gateThe examples in the previous section display an annoying weakness. We 795*0Sstevel@tonic-gatewere only matching 3-letter words, or syllables of 4 letters or 796*0Sstevel@tonic-gateless. We'd like to be able to match words or syllables of any length, 797*0Sstevel@tonic-gatewithout writing out tedious alternatives like 798*0Sstevel@tonic-gateC<\w\w\w\w|\w\w\w|\w\w|\w>. 799*0Sstevel@tonic-gate 800*0Sstevel@tonic-gateThis is exactly the problem the B<quantifier> metacharacters C<?>, 801*0Sstevel@tonic-gateC<*>, C<+>, and C<{}> were created for. They allow us to determine the 802*0Sstevel@tonic-gatenumber of repeats of a portion of a regexp we consider to be a 803*0Sstevel@tonic-gatematch. Quantifiers are put immediately after the character, character 804*0Sstevel@tonic-gateclass, or grouping that we want to specify. They have the following 805*0Sstevel@tonic-gatemeanings: 806*0Sstevel@tonic-gate 807*0Sstevel@tonic-gate=over 4 808*0Sstevel@tonic-gate 809*0Sstevel@tonic-gate=item * 810*0Sstevel@tonic-gate 811*0Sstevel@tonic-gateC<a?> = match 'a' 1 or 0 times 812*0Sstevel@tonic-gate 813*0Sstevel@tonic-gate=item * 814*0Sstevel@tonic-gate 815*0Sstevel@tonic-gateC<a*> = match 'a' 0 or more times, i.e., any number of times 816*0Sstevel@tonic-gate 817*0Sstevel@tonic-gate=item * 818*0Sstevel@tonic-gate 819*0Sstevel@tonic-gateC<a+> = match 'a' 1 or more times, i.e., at least once 820*0Sstevel@tonic-gate 821*0Sstevel@tonic-gate=item * 822*0Sstevel@tonic-gate 823*0Sstevel@tonic-gateC<a{n,m}> = match at least C<n> times, but not more than C<m> 824*0Sstevel@tonic-gatetimes. 825*0Sstevel@tonic-gate 826*0Sstevel@tonic-gate=item * 827*0Sstevel@tonic-gate 828*0Sstevel@tonic-gateC<a{n,}> = match at least C<n> or more times 829*0Sstevel@tonic-gate 830*0Sstevel@tonic-gate=item * 831*0Sstevel@tonic-gate 832*0Sstevel@tonic-gateC<a{n}> = match exactly C<n> times 833*0Sstevel@tonic-gate 834*0Sstevel@tonic-gate=back 835*0Sstevel@tonic-gate 836*0Sstevel@tonic-gateHere are some examples: 837*0Sstevel@tonic-gate 838*0Sstevel@tonic-gate /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and 839*0Sstevel@tonic-gate # any number of digits 840*0Sstevel@tonic-gate /(\w+)\s+\1/; # match doubled words of arbitrary length 841*0Sstevel@tonic-gate /y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes' 842*0Sstevel@tonic-gate $year =~ /\d{2,4}/; # make sure year is at least 2 but not more 843*0Sstevel@tonic-gate # than 4 digits 844*0Sstevel@tonic-gate $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates 845*0Sstevel@tonic-gate $year =~ /\d{2}(\d{2})?/; # same thing written differently. However, 846*0Sstevel@tonic-gate # this produces $1 and the other does not. 847*0Sstevel@tonic-gate 848*0Sstevel@tonic-gate % simple_grep '^(\w+)\1$' /usr/dict/words # isn't this easier? 849*0Sstevel@tonic-gate beriberi 850*0Sstevel@tonic-gate booboo 851*0Sstevel@tonic-gate coco 852*0Sstevel@tonic-gate mama 853*0Sstevel@tonic-gate murmur 854*0Sstevel@tonic-gate papa 855*0Sstevel@tonic-gate 856*0Sstevel@tonic-gateFor all of these quantifiers, perl will try to match as much of the 857*0Sstevel@tonic-gatestring as possible, while still allowing the regexp to succeed. Thus 858*0Sstevel@tonic-gatewith C</a?.../>, perl will first try to match the regexp with the C<a> 859*0Sstevel@tonic-gatepresent; if that fails, perl will try to match the regexp without the 860*0Sstevel@tonic-gateC<a> present. For the quantifier C<*>, we get the following: 861*0Sstevel@tonic-gate 862*0Sstevel@tonic-gate $x = "the cat in the hat"; 863*0Sstevel@tonic-gate $x =~ /^(.*)(cat)(.*)$/; # matches, 864*0Sstevel@tonic-gate # $1 = 'the ' 865*0Sstevel@tonic-gate # $2 = 'cat' 866*0Sstevel@tonic-gate # $3 = ' in the hat' 867*0Sstevel@tonic-gate 868*0Sstevel@tonic-gateWhich is what we might expect, the match finds the only C<cat> in the 869*0Sstevel@tonic-gatestring and locks onto it. Consider, however, this regexp: 870*0Sstevel@tonic-gate 871*0Sstevel@tonic-gate $x =~ /^(.*)(at)(.*)$/; # matches, 872*0Sstevel@tonic-gate # $1 = 'the cat in the h' 873*0Sstevel@tonic-gate # $2 = 'at' 874*0Sstevel@tonic-gate # $3 = '' (0 matches) 875*0Sstevel@tonic-gate 876*0Sstevel@tonic-gateOne might initially guess that perl would find the C<at> in C<cat> and 877*0Sstevel@tonic-gatestop there, but that wouldn't give the longest possible string to the 878*0Sstevel@tonic-gatefirst quantifier C<.*>. Instead, the first quantifier C<.*> grabs as 879*0Sstevel@tonic-gatemuch of the string as possible while still having the regexp match. In 880*0Sstevel@tonic-gatethis example, that means having the C<at> sequence with the final C<at> 881*0Sstevel@tonic-gatein the string. The other important principle illustrated here is that 882*0Sstevel@tonic-gatewhen there are two or more elements in a regexp, the I<leftmost> 883*0Sstevel@tonic-gatequantifier, if there is one, gets to grab as much the string as 884*0Sstevel@tonic-gatepossible, leaving the rest of the regexp to fight over scraps. Thus in 885*0Sstevel@tonic-gateour example, the first quantifier C<.*> grabs most of the string, while 886*0Sstevel@tonic-gatethe second quantifier C<.*> gets the empty string. Quantifiers that 887*0Sstevel@tonic-gategrab as much of the string as possible are called B<maximal match> or 888*0Sstevel@tonic-gateB<greedy> quantifiers. 889*0Sstevel@tonic-gate 890*0Sstevel@tonic-gateWhen a regexp can match a string in several different ways, we can use 891*0Sstevel@tonic-gatethe principles above to predict which way the regexp will match: 892*0Sstevel@tonic-gate 893*0Sstevel@tonic-gate=over 4 894*0Sstevel@tonic-gate 895*0Sstevel@tonic-gate=item * 896*0Sstevel@tonic-gate 897*0Sstevel@tonic-gatePrinciple 0: Taken as a whole, any regexp will be matched at the 898*0Sstevel@tonic-gateearliest possible position in the string. 899*0Sstevel@tonic-gate 900*0Sstevel@tonic-gate=item * 901*0Sstevel@tonic-gate 902*0Sstevel@tonic-gatePrinciple 1: In an alternation C<a|b|c...>, the leftmost alternative 903*0Sstevel@tonic-gatethat allows a match for the whole regexp will be the one used. 904*0Sstevel@tonic-gate 905*0Sstevel@tonic-gate=item * 906*0Sstevel@tonic-gate 907*0Sstevel@tonic-gatePrinciple 2: The maximal matching quantifiers C<?>, C<*>, C<+> and 908*0Sstevel@tonic-gateC<{n,m}> will in general match as much of the string as possible while 909*0Sstevel@tonic-gatestill allowing the whole regexp to match. 910*0Sstevel@tonic-gate 911*0Sstevel@tonic-gate=item * 912*0Sstevel@tonic-gate 913*0Sstevel@tonic-gatePrinciple 3: If there are two or more elements in a regexp, the 914*0Sstevel@tonic-gateleftmost greedy quantifier, if any, will match as much of the string 915*0Sstevel@tonic-gateas possible while still allowing the whole regexp to match. The next 916*0Sstevel@tonic-gateleftmost greedy quantifier, if any, will try to match as much of the 917*0Sstevel@tonic-gatestring remaining available to it as possible, while still allowing the 918*0Sstevel@tonic-gatewhole regexp to match. And so on, until all the regexp elements are 919*0Sstevel@tonic-gatesatisfied. 920*0Sstevel@tonic-gate 921*0Sstevel@tonic-gate=back 922*0Sstevel@tonic-gate 923*0Sstevel@tonic-gateAs we have seen above, Principle 0 overrides the others - the regexp 924*0Sstevel@tonic-gatewill be matched as early as possible, with the other principles 925*0Sstevel@tonic-gatedetermining how the regexp matches at that earliest character 926*0Sstevel@tonic-gateposition. 927*0Sstevel@tonic-gate 928*0Sstevel@tonic-gateHere is an example of these principles in action: 929*0Sstevel@tonic-gate 930*0Sstevel@tonic-gate $x = "The programming republic of Perl"; 931*0Sstevel@tonic-gate $x =~ /^(.+)(e|r)(.*)$/; # matches, 932*0Sstevel@tonic-gate # $1 = 'The programming republic of Pe' 933*0Sstevel@tonic-gate # $2 = 'r' 934*0Sstevel@tonic-gate # $3 = 'l' 935*0Sstevel@tonic-gate 936*0Sstevel@tonic-gateThis regexp matches at the earliest string position, C<'T'>. One 937*0Sstevel@tonic-gatemight think that C<e>, being leftmost in the alternation, would be 938*0Sstevel@tonic-gatematched, but C<r> produces the longest string in the first quantifier. 939*0Sstevel@tonic-gate 940*0Sstevel@tonic-gate $x =~ /(m{1,2})(.*)$/; # matches, 941*0Sstevel@tonic-gate # $1 = 'mm' 942*0Sstevel@tonic-gate # $2 = 'ing republic of Perl' 943*0Sstevel@tonic-gate 944*0Sstevel@tonic-gateHere, The earliest possible match is at the first C<'m'> in 945*0Sstevel@tonic-gateC<programming>. C<m{1,2}> is the first quantifier, so it gets to match 946*0Sstevel@tonic-gatea maximal C<mm>. 947*0Sstevel@tonic-gate 948*0Sstevel@tonic-gate $x =~ /.*(m{1,2})(.*)$/; # matches, 949*0Sstevel@tonic-gate # $1 = 'm' 950*0Sstevel@tonic-gate # $2 = 'ing republic of Perl' 951*0Sstevel@tonic-gate 952*0Sstevel@tonic-gateHere, the regexp matches at the start of the string. The first 953*0Sstevel@tonic-gatequantifier C<.*> grabs as much as possible, leaving just a single 954*0Sstevel@tonic-gateC<'m'> for the second quantifier C<m{1,2}>. 955*0Sstevel@tonic-gate 956*0Sstevel@tonic-gate $x =~ /(.?)(m{1,2})(.*)$/; # matches, 957*0Sstevel@tonic-gate # $1 = 'a' 958*0Sstevel@tonic-gate # $2 = 'mm' 959*0Sstevel@tonic-gate # $3 = 'ing republic of Perl' 960*0Sstevel@tonic-gate 961*0Sstevel@tonic-gateHere, C<.?> eats its maximal one character at the earliest possible 962*0Sstevel@tonic-gateposition in the string, C<'a'> in C<programming>, leaving C<m{1,2}> 963*0Sstevel@tonic-gatethe opportunity to match both C<m>'s. Finally, 964*0Sstevel@tonic-gate 965*0Sstevel@tonic-gate "aXXXb" =~ /(X*)/; # matches with $1 = '' 966*0Sstevel@tonic-gate 967*0Sstevel@tonic-gatebecause it can match zero copies of C<'X'> at the beginning of the 968*0Sstevel@tonic-gatestring. If you definitely want to match at least one C<'X'>, use 969*0Sstevel@tonic-gateC<X+>, not C<X*>. 970*0Sstevel@tonic-gate 971*0Sstevel@tonic-gateSometimes greed is not good. At times, we would like quantifiers to 972*0Sstevel@tonic-gatematch a I<minimal> piece of string, rather than a maximal piece. For 973*0Sstevel@tonic-gatethis purpose, Larry Wall created the S<B<minimal match> > or 974*0Sstevel@tonic-gateB<non-greedy> quantifiers C<??>,C<*?>, C<+?>, and C<{}?>. These are 975*0Sstevel@tonic-gatethe usual quantifiers with a C<?> appended to them. They have the 976*0Sstevel@tonic-gatefollowing meanings: 977*0Sstevel@tonic-gate 978*0Sstevel@tonic-gate=over 4 979*0Sstevel@tonic-gate 980*0Sstevel@tonic-gate=item * 981*0Sstevel@tonic-gate 982*0Sstevel@tonic-gateC<a??> = match 'a' 0 or 1 times. Try 0 first, then 1. 983*0Sstevel@tonic-gate 984*0Sstevel@tonic-gate=item * 985*0Sstevel@tonic-gate 986*0Sstevel@tonic-gateC<a*?> = match 'a' 0 or more times, i.e., any number of times, 987*0Sstevel@tonic-gatebut as few times as possible 988*0Sstevel@tonic-gate 989*0Sstevel@tonic-gate=item * 990*0Sstevel@tonic-gate 991*0Sstevel@tonic-gateC<a+?> = match 'a' 1 or more times, i.e., at least once, but 992*0Sstevel@tonic-gateas few times as possible 993*0Sstevel@tonic-gate 994*0Sstevel@tonic-gate=item * 995*0Sstevel@tonic-gate 996*0Sstevel@tonic-gateC<a{n,m}?> = match at least C<n> times, not more than C<m> 997*0Sstevel@tonic-gatetimes, as few times as possible 998*0Sstevel@tonic-gate 999*0Sstevel@tonic-gate=item * 1000*0Sstevel@tonic-gate 1001*0Sstevel@tonic-gateC<a{n,}?> = match at least C<n> times, but as few times as 1002*0Sstevel@tonic-gatepossible 1003*0Sstevel@tonic-gate 1004*0Sstevel@tonic-gate=item * 1005*0Sstevel@tonic-gate 1006*0Sstevel@tonic-gateC<a{n}?> = match exactly C<n> times. Because we match exactly 1007*0Sstevel@tonic-gateC<n> times, C<a{n}?> is equivalent to C<a{n}> and is just there for 1008*0Sstevel@tonic-gatenotational consistency. 1009*0Sstevel@tonic-gate 1010*0Sstevel@tonic-gate=back 1011*0Sstevel@tonic-gate 1012*0Sstevel@tonic-gateLet's look at the example above, but with minimal quantifiers: 1013*0Sstevel@tonic-gate 1014*0Sstevel@tonic-gate $x = "The programming republic of Perl"; 1015*0Sstevel@tonic-gate $x =~ /^(.+?)(e|r)(.*)$/; # matches, 1016*0Sstevel@tonic-gate # $1 = 'Th' 1017*0Sstevel@tonic-gate # $2 = 'e' 1018*0Sstevel@tonic-gate # $3 = ' programming republic of Perl' 1019*0Sstevel@tonic-gate 1020*0Sstevel@tonic-gateThe minimal string that will allow both the start of the string C<^> 1021*0Sstevel@tonic-gateand the alternation to match is C<Th>, with the alternation C<e|r> 1022*0Sstevel@tonic-gatematching C<e>. The second quantifier C<.*> is free to gobble up the 1023*0Sstevel@tonic-gaterest of the string. 1024*0Sstevel@tonic-gate 1025*0Sstevel@tonic-gate $x =~ /(m{1,2}?)(.*?)$/; # matches, 1026*0Sstevel@tonic-gate # $1 = 'm' 1027*0Sstevel@tonic-gate # $2 = 'ming republic of Perl' 1028*0Sstevel@tonic-gate 1029*0Sstevel@tonic-gateThe first string position that this regexp can match is at the first 1030*0Sstevel@tonic-gateC<'m'> in C<programming>. At this position, the minimal C<m{1,2}?> 1031*0Sstevel@tonic-gatematches just one C<'m'>. Although the second quantifier C<.*?> would 1032*0Sstevel@tonic-gateprefer to match no characters, it is constrained by the end-of-string 1033*0Sstevel@tonic-gateanchor C<$> to match the rest of the string. 1034*0Sstevel@tonic-gate 1035*0Sstevel@tonic-gate $x =~ /(.*?)(m{1,2}?)(.*)$/; # matches, 1036*0Sstevel@tonic-gate # $1 = 'The progra' 1037*0Sstevel@tonic-gate # $2 = 'm' 1038*0Sstevel@tonic-gate # $3 = 'ming republic of Perl' 1039*0Sstevel@tonic-gate 1040*0Sstevel@tonic-gateIn this regexp, you might expect the first minimal quantifier C<.*?> 1041*0Sstevel@tonic-gateto match the empty string, because it is not constrained by a C<^> 1042*0Sstevel@tonic-gateanchor to match the beginning of the word. Principle 0 applies here, 1043*0Sstevel@tonic-gatehowever. Because it is possible for the whole regexp to match at the 1044*0Sstevel@tonic-gatestart of the string, it I<will> match at the start of the string. Thus 1045*0Sstevel@tonic-gatethe first quantifier has to match everything up to the first C<m>. The 1046*0Sstevel@tonic-gatesecond minimal quantifier matches just one C<m> and the third 1047*0Sstevel@tonic-gatequantifier matches the rest of the string. 1048*0Sstevel@tonic-gate 1049*0Sstevel@tonic-gate $x =~ /(.??)(m{1,2})(.*)$/; # matches, 1050*0Sstevel@tonic-gate # $1 = 'a' 1051*0Sstevel@tonic-gate # $2 = 'mm' 1052*0Sstevel@tonic-gate # $3 = 'ing republic of Perl' 1053*0Sstevel@tonic-gate 1054*0Sstevel@tonic-gateJust as in the previous regexp, the first quantifier C<.??> can match 1055*0Sstevel@tonic-gateearliest at position C<'a'>, so it does. The second quantifier is 1056*0Sstevel@tonic-gategreedy, so it matches C<mm>, and the third matches the rest of the 1057*0Sstevel@tonic-gatestring. 1058*0Sstevel@tonic-gate 1059*0Sstevel@tonic-gateWe can modify principle 3 above to take into account non-greedy 1060*0Sstevel@tonic-gatequantifiers: 1061*0Sstevel@tonic-gate 1062*0Sstevel@tonic-gate=over 4 1063*0Sstevel@tonic-gate 1064*0Sstevel@tonic-gate=item * 1065*0Sstevel@tonic-gate 1066*0Sstevel@tonic-gatePrinciple 3: If there are two or more elements in a regexp, the 1067*0Sstevel@tonic-gateleftmost greedy (non-greedy) quantifier, if any, will match as much 1068*0Sstevel@tonic-gate(little) of the string as possible while still allowing the whole 1069*0Sstevel@tonic-gateregexp to match. The next leftmost greedy (non-greedy) quantifier, if 1070*0Sstevel@tonic-gateany, will try to match as much (little) of the string remaining 1071*0Sstevel@tonic-gateavailable to it as possible, while still allowing the whole regexp to 1072*0Sstevel@tonic-gatematch. And so on, until all the regexp elements are satisfied. 1073*0Sstevel@tonic-gate 1074*0Sstevel@tonic-gate=back 1075*0Sstevel@tonic-gate 1076*0Sstevel@tonic-gateJust like alternation, quantifiers are also susceptible to 1077*0Sstevel@tonic-gatebacktracking. Here is a step-by-step analysis of the example 1078*0Sstevel@tonic-gate 1079*0Sstevel@tonic-gate $x = "the cat in the hat"; 1080*0Sstevel@tonic-gate $x =~ /^(.*)(at)(.*)$/; # matches, 1081*0Sstevel@tonic-gate # $1 = 'the cat in the h' 1082*0Sstevel@tonic-gate # $2 = 'at' 1083*0Sstevel@tonic-gate # $3 = '' (0 matches) 1084*0Sstevel@tonic-gate 1085*0Sstevel@tonic-gate=over 4 1086*0Sstevel@tonic-gate 1087*0Sstevel@tonic-gate=item 0 1088*0Sstevel@tonic-gate 1089*0Sstevel@tonic-gateStart with the first letter in the string 't'. 1090*0Sstevel@tonic-gate 1091*0Sstevel@tonic-gate=item 1 1092*0Sstevel@tonic-gate 1093*0Sstevel@tonic-gateThe first quantifier '.*' starts out by matching the whole 1094*0Sstevel@tonic-gatestring 'the cat in the hat'. 1095*0Sstevel@tonic-gate 1096*0Sstevel@tonic-gate=item 2 1097*0Sstevel@tonic-gate 1098*0Sstevel@tonic-gate'a' in the regexp element 'at' doesn't match the end of the 1099*0Sstevel@tonic-gatestring. Backtrack one character. 1100*0Sstevel@tonic-gate 1101*0Sstevel@tonic-gate=item 3 1102*0Sstevel@tonic-gate 1103*0Sstevel@tonic-gate'a' in the regexp element 'at' still doesn't match the last 1104*0Sstevel@tonic-gateletter of the string 't', so backtrack one more character. 1105*0Sstevel@tonic-gate 1106*0Sstevel@tonic-gate=item 4 1107*0Sstevel@tonic-gate 1108*0Sstevel@tonic-gateNow we can match the 'a' and the 't'. 1109*0Sstevel@tonic-gate 1110*0Sstevel@tonic-gate=item 5 1111*0Sstevel@tonic-gate 1112*0Sstevel@tonic-gateMove on to the third element '.*'. Since we are at the end of 1113*0Sstevel@tonic-gatethe string and '.*' can match 0 times, assign it the empty string. 1114*0Sstevel@tonic-gate 1115*0Sstevel@tonic-gate=item 6 1116*0Sstevel@tonic-gate 1117*0Sstevel@tonic-gateWe are done! 1118*0Sstevel@tonic-gate 1119*0Sstevel@tonic-gate=back 1120*0Sstevel@tonic-gate 1121*0Sstevel@tonic-gateMost of the time, all this moving forward and backtracking happens 1122*0Sstevel@tonic-gatequickly and searching is fast. There are some pathological regexps, 1123*0Sstevel@tonic-gatehowever, whose execution time exponentially grows with the size of the 1124*0Sstevel@tonic-gatestring. A typical structure that blows up in your face is of the form 1125*0Sstevel@tonic-gate 1126*0Sstevel@tonic-gate /(a|b+)*/; 1127*0Sstevel@tonic-gate 1128*0Sstevel@tonic-gateThe problem is the nested indeterminate quantifiers. There are many 1129*0Sstevel@tonic-gatedifferent ways of partitioning a string of length n between the C<+> 1130*0Sstevel@tonic-gateand C<*>: one repetition with C<b+> of length n, two repetitions with 1131*0Sstevel@tonic-gatethe first C<b+> length k and the second with length n-k, m repetitions 1132*0Sstevel@tonic-gatewhose bits add up to length n, etc. In fact there are an exponential 1133*0Sstevel@tonic-gatenumber of ways to partition a string as a function of length. A 1134*0Sstevel@tonic-gateregexp may get lucky and match early in the process, but if there is 1135*0Sstevel@tonic-gateno match, perl will try I<every> possibility before giving up. So be 1136*0Sstevel@tonic-gatecareful with nested C<*>'s, C<{n,m}>'s, and C<+>'s. The book 1137*0Sstevel@tonic-gateI<Mastering regular expressions> by Jeffrey Friedl gives a wonderful 1138*0Sstevel@tonic-gatediscussion of this and other efficiency issues. 1139*0Sstevel@tonic-gate 1140*0Sstevel@tonic-gate=head2 Building a regexp 1141*0Sstevel@tonic-gate 1142*0Sstevel@tonic-gateAt this point, we have all the basic regexp concepts covered, so let's 1143*0Sstevel@tonic-gategive a more involved example of a regular expression. We will build a 1144*0Sstevel@tonic-gateregexp that matches numbers. 1145*0Sstevel@tonic-gate 1146*0Sstevel@tonic-gateThe first task in building a regexp is to decide what we want to match 1147*0Sstevel@tonic-gateand what we want to exclude. In our case, we want to match both 1148*0Sstevel@tonic-gateintegers and floating point numbers and we want to reject any string 1149*0Sstevel@tonic-gatethat isn't a number. 1150*0Sstevel@tonic-gate 1151*0Sstevel@tonic-gateThe next task is to break the problem down into smaller problems that 1152*0Sstevel@tonic-gateare easily converted into a regexp. 1153*0Sstevel@tonic-gate 1154*0Sstevel@tonic-gateThe simplest case is integers. These consist of a sequence of digits, 1155*0Sstevel@tonic-gatewith an optional sign in front. The digits we can represent with 1156*0Sstevel@tonic-gateC<\d+> and the sign can be matched with C<[+-]>. Thus the integer 1157*0Sstevel@tonic-gateregexp is 1158*0Sstevel@tonic-gate 1159*0Sstevel@tonic-gate /[+-]?\d+/; # matches integers 1160*0Sstevel@tonic-gate 1161*0Sstevel@tonic-gateA floating point number potentially has a sign, an integral part, a 1162*0Sstevel@tonic-gatedecimal point, a fractional part, and an exponent. One or more of these 1163*0Sstevel@tonic-gateparts is optional, so we need to check out the different 1164*0Sstevel@tonic-gatepossibilities. Floating point numbers which are in proper form include 1165*0Sstevel@tonic-gate123., 0.345, .34, -1e6, and 25.4E-72. As with integers, the sign out 1166*0Sstevel@tonic-gatefront is completely optional and can be matched by C<[+-]?>. We can 1167*0Sstevel@tonic-gatesee that if there is no exponent, floating point numbers must have a 1168*0Sstevel@tonic-gatedecimal point, otherwise they are integers. We might be tempted to 1169*0Sstevel@tonic-gatemodel these with C<\d*\.\d*>, but this would also match just a single 1170*0Sstevel@tonic-gatedecimal point, which is not a number. So the three cases of floating 1171*0Sstevel@tonic-gatepoint number sans exponent are 1172*0Sstevel@tonic-gate 1173*0Sstevel@tonic-gate /[+-]?\d+\./; # 1., 321., etc. 1174*0Sstevel@tonic-gate /[+-]?\.\d+/; # .1, .234, etc. 1175*0Sstevel@tonic-gate /[+-]?\d+\.\d+/; # 1.0, 30.56, etc. 1176*0Sstevel@tonic-gate 1177*0Sstevel@tonic-gateThese can be combined into a single regexp with a three-way alternation: 1178*0Sstevel@tonic-gate 1179*0Sstevel@tonic-gate /[+-]?(\d+\.\d+|\d+\.|\.\d+)/; # floating point, no exponent 1180*0Sstevel@tonic-gate 1181*0Sstevel@tonic-gateIn this alternation, it is important to put C<'\d+\.\d+'> before 1182*0Sstevel@tonic-gateC<'\d+\.'>. If C<'\d+\.'> were first, the regexp would happily match that 1183*0Sstevel@tonic-gateand ignore the fractional part of the number. 1184*0Sstevel@tonic-gate 1185*0Sstevel@tonic-gateNow consider floating point numbers with exponents. The key 1186*0Sstevel@tonic-gateobservation here is that I<both> integers and numbers with decimal 1187*0Sstevel@tonic-gatepoints are allowed in front of an exponent. Then exponents, like the 1188*0Sstevel@tonic-gateoverall sign, are independent of whether we are matching numbers with 1189*0Sstevel@tonic-gateor without decimal points, and can be 'decoupled' from the 1190*0Sstevel@tonic-gatemantissa. The overall form of the regexp now becomes clear: 1191*0Sstevel@tonic-gate 1192*0Sstevel@tonic-gate /^(optional sign)(integer | f.p. mantissa)(optional exponent)$/; 1193*0Sstevel@tonic-gate 1194*0Sstevel@tonic-gateThe exponent is an C<e> or C<E>, followed by an integer. So the 1195*0Sstevel@tonic-gateexponent regexp is 1196*0Sstevel@tonic-gate 1197*0Sstevel@tonic-gate /[eE][+-]?\d+/; # exponent 1198*0Sstevel@tonic-gate 1199*0Sstevel@tonic-gatePutting all the parts together, we get a regexp that matches numbers: 1200*0Sstevel@tonic-gate 1201*0Sstevel@tonic-gate /^[+-]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/; # Ta da! 1202*0Sstevel@tonic-gate 1203*0Sstevel@tonic-gateLong regexps like this may impress your friends, but can be hard to 1204*0Sstevel@tonic-gatedecipher. In complex situations like this, the C<//x> modifier for a 1205*0Sstevel@tonic-gatematch is invaluable. It allows one to put nearly arbitrary whitespace 1206*0Sstevel@tonic-gateand comments into a regexp without affecting their meaning. Using it, 1207*0Sstevel@tonic-gatewe can rewrite our 'extended' regexp in the more pleasing form 1208*0Sstevel@tonic-gate 1209*0Sstevel@tonic-gate /^ 1210*0Sstevel@tonic-gate [+-]? # first, match an optional sign 1211*0Sstevel@tonic-gate ( # then match integers or f.p. mantissas: 1212*0Sstevel@tonic-gate \d+\.\d+ # mantissa of the form a.b 1213*0Sstevel@tonic-gate |\d+\. # mantissa of the form a. 1214*0Sstevel@tonic-gate |\.\d+ # mantissa of the form .b 1215*0Sstevel@tonic-gate |\d+ # integer of the form a 1216*0Sstevel@tonic-gate ) 1217*0Sstevel@tonic-gate ([eE][+-]?\d+)? # finally, optionally match an exponent 1218*0Sstevel@tonic-gate $/x; 1219*0Sstevel@tonic-gate 1220*0Sstevel@tonic-gateIf whitespace is mostly irrelevant, how does one include space 1221*0Sstevel@tonic-gatecharacters in an extended regexp? The answer is to backslash it 1222*0Sstevel@tonic-gateS<C<'\ '> > or put it in a character class S<C<[ ]> >. The same thing 1223*0Sstevel@tonic-gategoes for pound signs, use C<\#> or C<[#]>. For instance, Perl allows 1224*0Sstevel@tonic-gatea space between the sign and the mantissa/integer, and we could add 1225*0Sstevel@tonic-gatethis to our regexp as follows: 1226*0Sstevel@tonic-gate 1227*0Sstevel@tonic-gate /^ 1228*0Sstevel@tonic-gate [+-]?\ * # first, match an optional sign *and space* 1229*0Sstevel@tonic-gate ( # then match integers or f.p. mantissas: 1230*0Sstevel@tonic-gate \d+\.\d+ # mantissa of the form a.b 1231*0Sstevel@tonic-gate |\d+\. # mantissa of the form a. 1232*0Sstevel@tonic-gate |\.\d+ # mantissa of the form .b 1233*0Sstevel@tonic-gate |\d+ # integer of the form a 1234*0Sstevel@tonic-gate ) 1235*0Sstevel@tonic-gate ([eE][+-]?\d+)? # finally, optionally match an exponent 1236*0Sstevel@tonic-gate $/x; 1237*0Sstevel@tonic-gate 1238*0Sstevel@tonic-gateIn this form, it is easier to see a way to simplify the 1239*0Sstevel@tonic-gatealternation. Alternatives 1, 2, and 4 all start with C<\d+>, so it 1240*0Sstevel@tonic-gatecould be factored out: 1241*0Sstevel@tonic-gate 1242*0Sstevel@tonic-gate /^ 1243*0Sstevel@tonic-gate [+-]?\ * # first, match an optional sign 1244*0Sstevel@tonic-gate ( # then match integers or f.p. mantissas: 1245*0Sstevel@tonic-gate \d+ # start out with a ... 1246*0Sstevel@tonic-gate ( 1247*0Sstevel@tonic-gate \.\d* # mantissa of the form a.b or a. 1248*0Sstevel@tonic-gate )? # ? takes care of integers of the form a 1249*0Sstevel@tonic-gate |\.\d+ # mantissa of the form .b 1250*0Sstevel@tonic-gate ) 1251*0Sstevel@tonic-gate ([eE][+-]?\d+)? # finally, optionally match an exponent 1252*0Sstevel@tonic-gate $/x; 1253*0Sstevel@tonic-gate 1254*0Sstevel@tonic-gateor written in the compact form, 1255*0Sstevel@tonic-gate 1256*0Sstevel@tonic-gate /^[+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?$/; 1257*0Sstevel@tonic-gate 1258*0Sstevel@tonic-gateThis is our final regexp. To recap, we built a regexp by 1259*0Sstevel@tonic-gate 1260*0Sstevel@tonic-gate=over 4 1261*0Sstevel@tonic-gate 1262*0Sstevel@tonic-gate=item * 1263*0Sstevel@tonic-gate 1264*0Sstevel@tonic-gatespecifying the task in detail, 1265*0Sstevel@tonic-gate 1266*0Sstevel@tonic-gate=item * 1267*0Sstevel@tonic-gate 1268*0Sstevel@tonic-gatebreaking down the problem into smaller parts, 1269*0Sstevel@tonic-gate 1270*0Sstevel@tonic-gate=item * 1271*0Sstevel@tonic-gate 1272*0Sstevel@tonic-gatetranslating the small parts into regexps, 1273*0Sstevel@tonic-gate 1274*0Sstevel@tonic-gate=item * 1275*0Sstevel@tonic-gate 1276*0Sstevel@tonic-gatecombining the regexps, 1277*0Sstevel@tonic-gate 1278*0Sstevel@tonic-gate=item * 1279*0Sstevel@tonic-gate 1280*0Sstevel@tonic-gateand optimizing the final combined regexp. 1281*0Sstevel@tonic-gate 1282*0Sstevel@tonic-gate=back 1283*0Sstevel@tonic-gate 1284*0Sstevel@tonic-gateThese are also the typical steps involved in writing a computer 1285*0Sstevel@tonic-gateprogram. This makes perfect sense, because regular expressions are 1286*0Sstevel@tonic-gateessentially programs written a little computer language that specifies 1287*0Sstevel@tonic-gatepatterns. 1288*0Sstevel@tonic-gate 1289*0Sstevel@tonic-gate=head2 Using regular expressions in Perl 1290*0Sstevel@tonic-gate 1291*0Sstevel@tonic-gateThe last topic of Part 1 briefly covers how regexps are used in Perl 1292*0Sstevel@tonic-gateprograms. Where do they fit into Perl syntax? 1293*0Sstevel@tonic-gate 1294*0Sstevel@tonic-gateWe have already introduced the matching operator in its default 1295*0Sstevel@tonic-gateC</regexp/> and arbitrary delimiter C<m!regexp!> forms. We have used 1296*0Sstevel@tonic-gatethe binding operator C<=~> and its negation C<!~> to test for string 1297*0Sstevel@tonic-gatematches. Associated with the matching operator, we have discussed the 1298*0Sstevel@tonic-gatesingle line C<//s>, multi-line C<//m>, case-insensitive C<//i> and 1299*0Sstevel@tonic-gateextended C<//x> modifiers. 1300*0Sstevel@tonic-gate 1301*0Sstevel@tonic-gateThere are a few more things you might want to know about matching 1302*0Sstevel@tonic-gateoperators. First, we pointed out earlier that variables in regexps are 1303*0Sstevel@tonic-gatesubstituted before the regexp is evaluated: 1304*0Sstevel@tonic-gate 1305*0Sstevel@tonic-gate $pattern = 'Seuss'; 1306*0Sstevel@tonic-gate while (<>) { 1307*0Sstevel@tonic-gate print if /$pattern/; 1308*0Sstevel@tonic-gate } 1309*0Sstevel@tonic-gate 1310*0Sstevel@tonic-gateThis will print any lines containing the word C<Seuss>. It is not as 1311*0Sstevel@tonic-gateefficient as it could be, however, because perl has to re-evaluate 1312*0Sstevel@tonic-gateC<$pattern> each time through the loop. If C<$pattern> won't be 1313*0Sstevel@tonic-gatechanging over the lifetime of the script, we can add the C<//o> 1314*0Sstevel@tonic-gatemodifier, which directs perl to only perform variable substitutions 1315*0Sstevel@tonic-gateonce: 1316*0Sstevel@tonic-gate 1317*0Sstevel@tonic-gate #!/usr/bin/perl 1318*0Sstevel@tonic-gate # Improved simple_grep 1319*0Sstevel@tonic-gate $regexp = shift; 1320*0Sstevel@tonic-gate while (<>) { 1321*0Sstevel@tonic-gate print if /$regexp/o; # a good deal faster 1322*0Sstevel@tonic-gate } 1323*0Sstevel@tonic-gate 1324*0Sstevel@tonic-gateIf you change C<$pattern> after the first substitution happens, perl 1325*0Sstevel@tonic-gatewill ignore it. If you don't want any substitutions at all, use the 1326*0Sstevel@tonic-gatespecial delimiter C<m''>: 1327*0Sstevel@tonic-gate 1328*0Sstevel@tonic-gate @pattern = ('Seuss'); 1329*0Sstevel@tonic-gate while (<>) { 1330*0Sstevel@tonic-gate print if m'@pattern'; # matches literal '@pattern', not 'Seuss' 1331*0Sstevel@tonic-gate } 1332*0Sstevel@tonic-gate 1333*0Sstevel@tonic-gateC<m''> acts like single quotes on a regexp; all other C<m> delimiters 1334*0Sstevel@tonic-gateact like double quotes. If the regexp evaluates to the empty string, 1335*0Sstevel@tonic-gatethe regexp in the I<last successful match> is used instead. So we have 1336*0Sstevel@tonic-gate 1337*0Sstevel@tonic-gate "dog" =~ /d/; # 'd' matches 1338*0Sstevel@tonic-gate "dogbert =~ //; # this matches the 'd' regexp used before 1339*0Sstevel@tonic-gate 1340*0Sstevel@tonic-gateThe final two modifiers C<//g> and C<//c> concern multiple matches. 1341*0Sstevel@tonic-gateThe modifier C<//g> stands for global matching and allows the 1342*0Sstevel@tonic-gatematching operator to match within a string as many times as possible. 1343*0Sstevel@tonic-gateIn scalar context, successive invocations against a string will have 1344*0Sstevel@tonic-gate`C<//g> jump from match to match, keeping track of position in the 1345*0Sstevel@tonic-gatestring as it goes along. You can get or set the position with the 1346*0Sstevel@tonic-gateC<pos()> function. 1347*0Sstevel@tonic-gate 1348*0Sstevel@tonic-gateThe use of C<//g> is shown in the following example. Suppose we have 1349*0Sstevel@tonic-gatea string that consists of words separated by spaces. If we know how 1350*0Sstevel@tonic-gatemany words there are in advance, we could extract the words using 1351*0Sstevel@tonic-gategroupings: 1352*0Sstevel@tonic-gate 1353*0Sstevel@tonic-gate $x = "cat dog house"; # 3 words 1354*0Sstevel@tonic-gate $x =~ /^\s*(\w+)\s+(\w+)\s+(\w+)\s*$/; # matches, 1355*0Sstevel@tonic-gate # $1 = 'cat' 1356*0Sstevel@tonic-gate # $2 = 'dog' 1357*0Sstevel@tonic-gate # $3 = 'house' 1358*0Sstevel@tonic-gate 1359*0Sstevel@tonic-gateBut what if we had an indeterminate number of words? This is the sort 1360*0Sstevel@tonic-gateof task C<//g> was made for. To extract all words, form the simple 1361*0Sstevel@tonic-gateregexp C<(\w+)> and loop over all matches with C</(\w+)/g>: 1362*0Sstevel@tonic-gate 1363*0Sstevel@tonic-gate while ($x =~ /(\w+)/g) { 1364*0Sstevel@tonic-gate print "Word is $1, ends at position ", pos $x, "\n"; 1365*0Sstevel@tonic-gate } 1366*0Sstevel@tonic-gate 1367*0Sstevel@tonic-gateprints 1368*0Sstevel@tonic-gate 1369*0Sstevel@tonic-gate Word is cat, ends at position 3 1370*0Sstevel@tonic-gate Word is dog, ends at position 7 1371*0Sstevel@tonic-gate Word is house, ends at position 13 1372*0Sstevel@tonic-gate 1373*0Sstevel@tonic-gateA failed match or changing the target string resets the position. If 1374*0Sstevel@tonic-gateyou don't want the position reset after failure to match, add the 1375*0Sstevel@tonic-gateC<//c>, as in C</regexp/gc>. The current position in the string is 1376*0Sstevel@tonic-gateassociated with the string, not the regexp. This means that different 1377*0Sstevel@tonic-gatestrings have different positions and their respective positions can be 1378*0Sstevel@tonic-gateset or read independently. 1379*0Sstevel@tonic-gate 1380*0Sstevel@tonic-gateIn list context, C<//g> returns a list of matched groupings, or if 1381*0Sstevel@tonic-gatethere are no groupings, a list of matches to the whole regexp. So if 1382*0Sstevel@tonic-gatewe wanted just the words, we could use 1383*0Sstevel@tonic-gate 1384*0Sstevel@tonic-gate @words = ($x =~ /(\w+)/g); # matches, 1385*0Sstevel@tonic-gate # $word[0] = 'cat' 1386*0Sstevel@tonic-gate # $word[1] = 'dog' 1387*0Sstevel@tonic-gate # $word[2] = 'house' 1388*0Sstevel@tonic-gate 1389*0Sstevel@tonic-gateClosely associated with the C<//g> modifier is the C<\G> anchor. The 1390*0Sstevel@tonic-gateC<\G> anchor matches at the point where the previous C<//g> match left 1391*0Sstevel@tonic-gateoff. C<\G> allows us to easily do context-sensitive matching: 1392*0Sstevel@tonic-gate 1393*0Sstevel@tonic-gate $metric = 1; # use metric units 1394*0Sstevel@tonic-gate ... 1395*0Sstevel@tonic-gate $x = <FILE>; # read in measurement 1396*0Sstevel@tonic-gate $x =~ /^([+-]?\d+)\s*/g; # get magnitude 1397*0Sstevel@tonic-gate $weight = $1; 1398*0Sstevel@tonic-gate if ($metric) { # error checking 1399*0Sstevel@tonic-gate print "Units error!" unless $x =~ /\Gkg\./g; 1400*0Sstevel@tonic-gate } 1401*0Sstevel@tonic-gate else { 1402*0Sstevel@tonic-gate print "Units error!" unless $x =~ /\Glbs\./g; 1403*0Sstevel@tonic-gate } 1404*0Sstevel@tonic-gate $x =~ /\G\s+(widget|sprocket)/g; # continue processing 1405*0Sstevel@tonic-gate 1406*0Sstevel@tonic-gateThe combination of C<//g> and C<\G> allows us to process the string a 1407*0Sstevel@tonic-gatebit at a time and use arbitrary Perl logic to decide what to do next. 1408*0Sstevel@tonic-gateCurrently, the C<\G> anchor is only fully supported when used to anchor 1409*0Sstevel@tonic-gateto the start of the pattern. 1410*0Sstevel@tonic-gate 1411*0Sstevel@tonic-gateC<\G> is also invaluable in processing fixed length records with 1412*0Sstevel@tonic-gateregexps. Suppose we have a snippet of coding region DNA, encoded as 1413*0Sstevel@tonic-gatebase pair letters C<ATCGTTGAAT...> and we want to find all the stop 1414*0Sstevel@tonic-gatecodons C<TGA>. In a coding region, codons are 3-letter sequences, so 1415*0Sstevel@tonic-gatewe can think of the DNA snippet as a sequence of 3-letter records. The 1416*0Sstevel@tonic-gatenaive regexp 1417*0Sstevel@tonic-gate 1418*0Sstevel@tonic-gate # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC" 1419*0Sstevel@tonic-gate $dna = "ATCGTTGAATGCAAATGACATGAC"; 1420*0Sstevel@tonic-gate $dna =~ /TGA/; 1421*0Sstevel@tonic-gate 1422*0Sstevel@tonic-gatedoesn't work; it may match a C<TGA>, but there is no guarantee that 1423*0Sstevel@tonic-gatethe match is aligned with codon boundaries, e.g., the substring 1424*0Sstevel@tonic-gateS<C<GTT GAA> > gives a match. A better solution is 1425*0Sstevel@tonic-gate 1426*0Sstevel@tonic-gate while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *? 1427*0Sstevel@tonic-gate print "Got a TGA stop codon at position ", pos $dna, "\n"; 1428*0Sstevel@tonic-gate } 1429*0Sstevel@tonic-gate 1430*0Sstevel@tonic-gatewhich prints 1431*0Sstevel@tonic-gate 1432*0Sstevel@tonic-gate Got a TGA stop codon at position 18 1433*0Sstevel@tonic-gate Got a TGA stop codon at position 23 1434*0Sstevel@tonic-gate 1435*0Sstevel@tonic-gatePosition 18 is good, but position 23 is bogus. What happened? 1436*0Sstevel@tonic-gate 1437*0Sstevel@tonic-gateThe answer is that our regexp works well until we get past the last 1438*0Sstevel@tonic-gatereal match. Then the regexp will fail to match a synchronized C<TGA> 1439*0Sstevel@tonic-gateand start stepping ahead one character position at a time, not what we 1440*0Sstevel@tonic-gatewant. The solution is to use C<\G> to anchor the match to the codon 1441*0Sstevel@tonic-gatealignment: 1442*0Sstevel@tonic-gate 1443*0Sstevel@tonic-gate while ($dna =~ /\G(\w\w\w)*?TGA/g) { 1444*0Sstevel@tonic-gate print "Got a TGA stop codon at position ", pos $dna, "\n"; 1445*0Sstevel@tonic-gate } 1446*0Sstevel@tonic-gate 1447*0Sstevel@tonic-gateThis prints 1448*0Sstevel@tonic-gate 1449*0Sstevel@tonic-gate Got a TGA stop codon at position 18 1450*0Sstevel@tonic-gate 1451*0Sstevel@tonic-gatewhich is the correct answer. This example illustrates that it is 1452*0Sstevel@tonic-gateimportant not only to match what is desired, but to reject what is not 1453*0Sstevel@tonic-gatedesired. 1454*0Sstevel@tonic-gate 1455*0Sstevel@tonic-gateB<search and replace> 1456*0Sstevel@tonic-gate 1457*0Sstevel@tonic-gateRegular expressions also play a big role in B<search and replace> 1458*0Sstevel@tonic-gateoperations in Perl. Search and replace is accomplished with the 1459*0Sstevel@tonic-gateC<s///> operator. The general form is 1460*0Sstevel@tonic-gateC<s/regexp/replacement/modifiers>, with everything we know about 1461*0Sstevel@tonic-gateregexps and modifiers applying in this case as well. The 1462*0Sstevel@tonic-gateC<replacement> is a Perl double quoted string that replaces in the 1463*0Sstevel@tonic-gatestring whatever is matched with the C<regexp>. The operator C<=~> is 1464*0Sstevel@tonic-gatealso used here to associate a string with C<s///>. If matching 1465*0Sstevel@tonic-gateagainst C<$_>, the S<C<$_ =~> > can be dropped. If there is a match, 1466*0Sstevel@tonic-gateC<s///> returns the number of substitutions made, otherwise it returns 1467*0Sstevel@tonic-gatefalse. Here are a few examples: 1468*0Sstevel@tonic-gate 1469*0Sstevel@tonic-gate $x = "Time to feed the cat!"; 1470*0Sstevel@tonic-gate $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" 1471*0Sstevel@tonic-gate if ($x =~ s/^(Time.*hacker)!$/$1 now!/) { 1472*0Sstevel@tonic-gate $more_insistent = 1; 1473*0Sstevel@tonic-gate } 1474*0Sstevel@tonic-gate $y = "'quoted words'"; 1475*0Sstevel@tonic-gate $y =~ s/^'(.*)'$/$1/; # strip single quotes, 1476*0Sstevel@tonic-gate # $y contains "quoted words" 1477*0Sstevel@tonic-gate 1478*0Sstevel@tonic-gateIn the last example, the whole string was matched, but only the part 1479*0Sstevel@tonic-gateinside the single quotes was grouped. With the C<s///> operator, the 1480*0Sstevel@tonic-gatematched variables C<$1>, C<$2>, etc. are immediately available for use 1481*0Sstevel@tonic-gatein the replacement expression, so we use C<$1> to replace the quoted 1482*0Sstevel@tonic-gatestring with just what was quoted. With the global modifier, C<s///g> 1483*0Sstevel@tonic-gatewill search and replace all occurrences of the regexp in the string: 1484*0Sstevel@tonic-gate 1485*0Sstevel@tonic-gate $x = "I batted 4 for 4"; 1486*0Sstevel@tonic-gate $x =~ s/4/four/; # doesn't do it all: 1487*0Sstevel@tonic-gate # $x contains "I batted four for 4" 1488*0Sstevel@tonic-gate $x = "I batted 4 for 4"; 1489*0Sstevel@tonic-gate $x =~ s/4/four/g; # does it all: 1490*0Sstevel@tonic-gate # $x contains "I batted four for four" 1491*0Sstevel@tonic-gate 1492*0Sstevel@tonic-gateIf you prefer 'regex' over 'regexp' in this tutorial, you could use 1493*0Sstevel@tonic-gatethe following program to replace it: 1494*0Sstevel@tonic-gate 1495*0Sstevel@tonic-gate % cat > simple_replace 1496*0Sstevel@tonic-gate #!/usr/bin/perl 1497*0Sstevel@tonic-gate $regexp = shift; 1498*0Sstevel@tonic-gate $replacement = shift; 1499*0Sstevel@tonic-gate while (<>) { 1500*0Sstevel@tonic-gate s/$regexp/$replacement/go; 1501*0Sstevel@tonic-gate print; 1502*0Sstevel@tonic-gate } 1503*0Sstevel@tonic-gate ^D 1504*0Sstevel@tonic-gate 1505*0Sstevel@tonic-gate % simple_replace regexp regex perlretut.pod 1506*0Sstevel@tonic-gate 1507*0Sstevel@tonic-gateIn C<simple_replace> we used the C<s///g> modifier to replace all 1508*0Sstevel@tonic-gateoccurrences of the regexp on each line and the C<s///o> modifier to 1509*0Sstevel@tonic-gatecompile the regexp only once. As with C<simple_grep>, both the 1510*0Sstevel@tonic-gateC<print> and the C<s/$regexp/$replacement/go> use C<$_> implicitly. 1511*0Sstevel@tonic-gate 1512*0Sstevel@tonic-gateA modifier available specifically to search and replace is the 1513*0Sstevel@tonic-gateC<s///e> evaluation modifier. C<s///e> wraps an C<eval{...}> around 1514*0Sstevel@tonic-gatethe replacement string and the evaluated result is substituted for the 1515*0Sstevel@tonic-gatematched substring. C<s///e> is useful if you need to do a bit of 1516*0Sstevel@tonic-gatecomputation in the process of replacing text. This example counts 1517*0Sstevel@tonic-gatecharacter frequencies in a line: 1518*0Sstevel@tonic-gate 1519*0Sstevel@tonic-gate $x = "Bill the cat"; 1520*0Sstevel@tonic-gate $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself 1521*0Sstevel@tonic-gate print "frequency of '$_' is $chars{$_}\n" 1522*0Sstevel@tonic-gate foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars); 1523*0Sstevel@tonic-gate 1524*0Sstevel@tonic-gateThis prints 1525*0Sstevel@tonic-gate 1526*0Sstevel@tonic-gate frequency of ' ' is 2 1527*0Sstevel@tonic-gate frequency of 't' is 2 1528*0Sstevel@tonic-gate frequency of 'l' is 2 1529*0Sstevel@tonic-gate frequency of 'B' is 1 1530*0Sstevel@tonic-gate frequency of 'c' is 1 1531*0Sstevel@tonic-gate frequency of 'e' is 1 1532*0Sstevel@tonic-gate frequency of 'h' is 1 1533*0Sstevel@tonic-gate frequency of 'i' is 1 1534*0Sstevel@tonic-gate frequency of 'a' is 1 1535*0Sstevel@tonic-gate 1536*0Sstevel@tonic-gateAs with the match C<m//> operator, C<s///> can use other delimiters, 1537*0Sstevel@tonic-gatesuch as C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are 1538*0Sstevel@tonic-gateused C<s'''>, then the regexp and replacement are treated as single 1539*0Sstevel@tonic-gatequoted strings and there are no substitutions. C<s///> in list context 1540*0Sstevel@tonic-gatereturns the same thing as in scalar context, i.e., the number of 1541*0Sstevel@tonic-gatematches. 1542*0Sstevel@tonic-gate 1543*0Sstevel@tonic-gateB<The split operator> 1544*0Sstevel@tonic-gate 1545*0Sstevel@tonic-gateThe B<C<split> > function can also optionally use a matching operator 1546*0Sstevel@tonic-gateC<m//> to split a string. C<split /regexp/, string, limit> splits 1547*0Sstevel@tonic-gateC<string> into a list of substrings and returns that list. The regexp 1548*0Sstevel@tonic-gateis used to match the character sequence that the C<string> is split 1549*0Sstevel@tonic-gatewith respect to. The C<limit>, if present, constrains splitting into 1550*0Sstevel@tonic-gateno more than C<limit> number of strings. For example, to split a 1551*0Sstevel@tonic-gatestring into words, use 1552*0Sstevel@tonic-gate 1553*0Sstevel@tonic-gate $x = "Calvin and Hobbes"; 1554*0Sstevel@tonic-gate @words = split /\s+/, $x; # $word[0] = 'Calvin' 1555*0Sstevel@tonic-gate # $word[1] = 'and' 1556*0Sstevel@tonic-gate # $word[2] = 'Hobbes' 1557*0Sstevel@tonic-gate 1558*0Sstevel@tonic-gateIf the empty regexp C<//> is used, the regexp always matches and 1559*0Sstevel@tonic-gatethe string is split into individual characters. If the regexp has 1560*0Sstevel@tonic-gategroupings, then list produced contains the matched substrings from the 1561*0Sstevel@tonic-gategroupings as well. For instance, 1562*0Sstevel@tonic-gate 1563*0Sstevel@tonic-gate $x = "/usr/bin/perl"; 1564*0Sstevel@tonic-gate @dirs = split m!/!, $x; # $dirs[0] = '' 1565*0Sstevel@tonic-gate # $dirs[1] = 'usr' 1566*0Sstevel@tonic-gate # $dirs[2] = 'bin' 1567*0Sstevel@tonic-gate # $dirs[3] = 'perl' 1568*0Sstevel@tonic-gate @parts = split m!(/)!, $x; # $parts[0] = '' 1569*0Sstevel@tonic-gate # $parts[1] = '/' 1570*0Sstevel@tonic-gate # $parts[2] = 'usr' 1571*0Sstevel@tonic-gate # $parts[3] = '/' 1572*0Sstevel@tonic-gate # $parts[4] = 'bin' 1573*0Sstevel@tonic-gate # $parts[5] = '/' 1574*0Sstevel@tonic-gate # $parts[6] = 'perl' 1575*0Sstevel@tonic-gate 1576*0Sstevel@tonic-gateSince the first character of $x matched the regexp, C<split> prepended 1577*0Sstevel@tonic-gatean empty initial element to the list. 1578*0Sstevel@tonic-gate 1579*0Sstevel@tonic-gateIf you have read this far, congratulations! You now have all the basic 1580*0Sstevel@tonic-gatetools needed to use regular expressions to solve a wide range of text 1581*0Sstevel@tonic-gateprocessing problems. If this is your first time through the tutorial, 1582*0Sstevel@tonic-gatewhy not stop here and play around with regexps a while... S<Part 2> 1583*0Sstevel@tonic-gateconcerns the more esoteric aspects of regular expressions and those 1584*0Sstevel@tonic-gateconcepts certainly aren't needed right at the start. 1585*0Sstevel@tonic-gate 1586*0Sstevel@tonic-gate=head1 Part 2: Power tools 1587*0Sstevel@tonic-gate 1588*0Sstevel@tonic-gateOK, you know the basics of regexps and you want to know more. If 1589*0Sstevel@tonic-gatematching regular expressions is analogous to a walk in the woods, then 1590*0Sstevel@tonic-gatethe tools discussed in Part 1 are analogous to topo maps and a 1591*0Sstevel@tonic-gatecompass, basic tools we use all the time. Most of the tools in part 2 1592*0Sstevel@tonic-gateare analogous to flare guns and satellite phones. They aren't used 1593*0Sstevel@tonic-gatetoo often on a hike, but when we are stuck, they can be invaluable. 1594*0Sstevel@tonic-gate 1595*0Sstevel@tonic-gateWhat follows are the more advanced, less used, or sometimes esoteric 1596*0Sstevel@tonic-gatecapabilities of perl regexps. In Part 2, we will assume you are 1597*0Sstevel@tonic-gatecomfortable with the basics and concentrate on the new features. 1598*0Sstevel@tonic-gate 1599*0Sstevel@tonic-gate=head2 More on characters, strings, and character classes 1600*0Sstevel@tonic-gate 1601*0Sstevel@tonic-gateThere are a number of escape sequences and character classes that we 1602*0Sstevel@tonic-gatehaven't covered yet. 1603*0Sstevel@tonic-gate 1604*0Sstevel@tonic-gateThere are several escape sequences that convert characters or strings 1605*0Sstevel@tonic-gatebetween upper and lower case. C<\l> and C<\u> convert the next 1606*0Sstevel@tonic-gatecharacter to lower or upper case, respectively: 1607*0Sstevel@tonic-gate 1608*0Sstevel@tonic-gate $x = "perl"; 1609*0Sstevel@tonic-gate $string =~ /\u$x/; # matches 'Perl' in $string 1610*0Sstevel@tonic-gate $x = "M(rs?|s)\\."; # note the double backslash 1611*0Sstevel@tonic-gate $string =~ /\l$x/; # matches 'mr.', 'mrs.', and 'ms.', 1612*0Sstevel@tonic-gate 1613*0Sstevel@tonic-gateC<\L> and C<\U> converts a whole substring, delimited by C<\L> or 1614*0Sstevel@tonic-gateC<\U> and C<\E>, to lower or upper case: 1615*0Sstevel@tonic-gate 1616*0Sstevel@tonic-gate $x = "This word is in lower case:\L SHOUT\E"; 1617*0Sstevel@tonic-gate $x =~ /shout/; # matches 1618*0Sstevel@tonic-gate $x = "I STILL KEYPUNCH CARDS FOR MY 360" 1619*0Sstevel@tonic-gate $x =~ /\Ukeypunch/; # matches punch card string 1620*0Sstevel@tonic-gate 1621*0Sstevel@tonic-gateIf there is no C<\E>, case is converted until the end of the 1622*0Sstevel@tonic-gatestring. The regexps C<\L\u$word> or C<\u\L$word> convert the first 1623*0Sstevel@tonic-gatecharacter of C<$word> to uppercase and the rest of the characters to 1624*0Sstevel@tonic-gatelowercase. 1625*0Sstevel@tonic-gate 1626*0Sstevel@tonic-gateControl characters can be escaped with C<\c>, so that a control-Z 1627*0Sstevel@tonic-gatecharacter would be matched with C<\cZ>. The escape sequence 1628*0Sstevel@tonic-gateC<\Q>...C<\E> quotes, or protects most non-alphabetic characters. For 1629*0Sstevel@tonic-gateinstance, 1630*0Sstevel@tonic-gate 1631*0Sstevel@tonic-gate $x = "\QThat !^*&%~& cat!"; 1632*0Sstevel@tonic-gate $x =~ /\Q!^*&%~&\E/; # check for rough language 1633*0Sstevel@tonic-gate 1634*0Sstevel@tonic-gateIt does not protect C<$> or C<@>, so that variables can still be 1635*0Sstevel@tonic-gatesubstituted. 1636*0Sstevel@tonic-gate 1637*0Sstevel@tonic-gateWith the advent of 5.6.0, perl regexps can handle more than just the 1638*0Sstevel@tonic-gatestandard ASCII character set. Perl now supports B<Unicode>, a standard 1639*0Sstevel@tonic-gatefor encoding the character sets from many of the world's written 1640*0Sstevel@tonic-gatelanguages. Unicode does this by allowing characters to be more than 1641*0Sstevel@tonic-gateone byte wide. Perl uses the UTF-8 encoding, in which ASCII characters 1642*0Sstevel@tonic-gateare still encoded as one byte, but characters greater than C<chr(127)> 1643*0Sstevel@tonic-gatemay be stored as two or more bytes. 1644*0Sstevel@tonic-gate 1645*0Sstevel@tonic-gateWhat does this mean for regexps? Well, regexp users don't need to know 1646*0Sstevel@tonic-gatemuch about perl's internal representation of strings. But they do need 1647*0Sstevel@tonic-gateto know 1) how to represent Unicode characters in a regexp and 2) when 1648*0Sstevel@tonic-gatea matching operation will treat the string to be searched as a 1649*0Sstevel@tonic-gatesequence of bytes (the old way) or as a sequence of Unicode characters 1650*0Sstevel@tonic-gate(the new way). The answer to 1) is that Unicode characters greater 1651*0Sstevel@tonic-gatethan C<chr(127)> may be represented using the C<\x{hex}> notation, 1652*0Sstevel@tonic-gatewith C<hex> a hexadecimal integer: 1653*0Sstevel@tonic-gate 1654*0Sstevel@tonic-gate /\x{263a}/; # match a Unicode smiley face :) 1655*0Sstevel@tonic-gate 1656*0Sstevel@tonic-gateUnicode characters in the range of 128-255 use two hexadecimal digits 1657*0Sstevel@tonic-gatewith braces: C<\x{ab}>. Note that this is different than C<\xab>, 1658*0Sstevel@tonic-gatewhich is just a hexadecimal byte with no Unicode significance. 1659*0Sstevel@tonic-gate 1660*0Sstevel@tonic-gateB<NOTE>: in Perl 5.6.0 it used to be that one needed to say C<use 1661*0Sstevel@tonic-gateutf8> to use any Unicode features. This is no more the case: for 1662*0Sstevel@tonic-gatealmost all Unicode processing, the explicit C<utf8> pragma is not 1663*0Sstevel@tonic-gateneeded. (The only case where it matters is if your Perl script is in 1664*0Sstevel@tonic-gateUnicode and encoded in UTF-8, then an explicit C<use utf8> is needed.) 1665*0Sstevel@tonic-gate 1666*0Sstevel@tonic-gateFiguring out the hexadecimal sequence of a Unicode character you want 1667*0Sstevel@tonic-gateor deciphering someone else's hexadecimal Unicode regexp is about as 1668*0Sstevel@tonic-gatemuch fun as programming in machine code. So another way to specify 1669*0Sstevel@tonic-gateUnicode characters is to use the S<B<named character> > escape 1670*0Sstevel@tonic-gatesequence C<\N{name}>. C<name> is a name for the Unicode character, as 1671*0Sstevel@tonic-gatespecified in the Unicode standard. For instance, if we wanted to 1672*0Sstevel@tonic-gaterepresent or match the astrological sign for the planet Mercury, we 1673*0Sstevel@tonic-gatecould use 1674*0Sstevel@tonic-gate 1675*0Sstevel@tonic-gate use charnames ":full"; # use named chars with Unicode full names 1676*0Sstevel@tonic-gate $x = "abc\N{MERCURY}def"; 1677*0Sstevel@tonic-gate $x =~ /\N{MERCURY}/; # matches 1678*0Sstevel@tonic-gate 1679*0Sstevel@tonic-gateOne can also use short names or restrict names to a certain alphabet: 1680*0Sstevel@tonic-gate 1681*0Sstevel@tonic-gate use charnames ':full'; 1682*0Sstevel@tonic-gate print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n"; 1683*0Sstevel@tonic-gate 1684*0Sstevel@tonic-gate use charnames ":short"; 1685*0Sstevel@tonic-gate print "\N{greek:Sigma} is an upper-case sigma.\n"; 1686*0Sstevel@tonic-gate 1687*0Sstevel@tonic-gate use charnames qw(greek); 1688*0Sstevel@tonic-gate print "\N{sigma} is Greek sigma\n"; 1689*0Sstevel@tonic-gate 1690*0Sstevel@tonic-gateA list of full names is found in the file Names.txt in the 1691*0Sstevel@tonic-gatelib/perl5/5.X.X/unicore directory. 1692*0Sstevel@tonic-gate 1693*0Sstevel@tonic-gateThe answer to requirement 2), as of 5.6.0, is that if a regexp 1694*0Sstevel@tonic-gatecontains Unicode characters, the string is searched as a sequence of 1695*0Sstevel@tonic-gateUnicode characters. Otherwise, the string is searched as a sequence of 1696*0Sstevel@tonic-gatebytes. If the string is being searched as a sequence of Unicode 1697*0Sstevel@tonic-gatecharacters, but matching a single byte is required, we can use the C<\C> 1698*0Sstevel@tonic-gateescape sequence. C<\C> is a character class akin to C<.> except that 1699*0Sstevel@tonic-gateit matches I<any> byte 0-255. So 1700*0Sstevel@tonic-gate 1701*0Sstevel@tonic-gate use charnames ":full"; # use named chars with Unicode full names 1702*0Sstevel@tonic-gate $x = "a"; 1703*0Sstevel@tonic-gate $x =~ /\C/; # matches 'a', eats one byte 1704*0Sstevel@tonic-gate $x = ""; 1705*0Sstevel@tonic-gate $x =~ /\C/; # doesn't match, no bytes to match 1706*0Sstevel@tonic-gate $x = "\N{MERCURY}"; # two-byte Unicode character 1707*0Sstevel@tonic-gate $x =~ /\C/; # matches, but dangerous! 1708*0Sstevel@tonic-gate 1709*0Sstevel@tonic-gateThe last regexp matches, but is dangerous because the string 1710*0Sstevel@tonic-gateI<character> position is no longer synchronized to the string I<byte> 1711*0Sstevel@tonic-gateposition. This generates the warning 'Malformed UTF-8 1712*0Sstevel@tonic-gatecharacter'. The C<\C> is best used for matching the binary data in strings 1713*0Sstevel@tonic-gatewith binary data intermixed with Unicode characters. 1714*0Sstevel@tonic-gate 1715*0Sstevel@tonic-gateLet us now discuss the rest of the character classes. Just as with 1716*0Sstevel@tonic-gateUnicode characters, there are named Unicode character classes 1717*0Sstevel@tonic-gaterepresented by the C<\p{name}> escape sequence. Closely associated is 1718*0Sstevel@tonic-gatethe C<\P{name}> character class, which is the negation of the 1719*0Sstevel@tonic-gateC<\p{name}> class. For example, to match lower and uppercase 1720*0Sstevel@tonic-gatecharacters, 1721*0Sstevel@tonic-gate 1722*0Sstevel@tonic-gate use charnames ":full"; # use named chars with Unicode full names 1723*0Sstevel@tonic-gate $x = "BOB"; 1724*0Sstevel@tonic-gate $x =~ /^\p{IsUpper}/; # matches, uppercase char class 1725*0Sstevel@tonic-gate $x =~ /^\P{IsUpper}/; # doesn't match, char class sans uppercase 1726*0Sstevel@tonic-gate $x =~ /^\p{IsLower}/; # doesn't match, lowercase char class 1727*0Sstevel@tonic-gate $x =~ /^\P{IsLower}/; # matches, char class sans lowercase 1728*0Sstevel@tonic-gate 1729*0Sstevel@tonic-gateHere is the association between some Perl named classes and the 1730*0Sstevel@tonic-gatetraditional Unicode classes: 1731*0Sstevel@tonic-gate 1732*0Sstevel@tonic-gate Perl class name Unicode class name or regular expression 1733*0Sstevel@tonic-gate 1734*0Sstevel@tonic-gate IsAlpha /^[LM]/ 1735*0Sstevel@tonic-gate IsAlnum /^[LMN]/ 1736*0Sstevel@tonic-gate IsASCII $code <= 127 1737*0Sstevel@tonic-gate IsCntrl /^C/ 1738*0Sstevel@tonic-gate IsBlank $code =~ /^(0020|0009)$/ || /^Z[^lp]/ 1739*0Sstevel@tonic-gate IsDigit Nd 1740*0Sstevel@tonic-gate IsGraph /^([LMNPS]|Co)/ 1741*0Sstevel@tonic-gate IsLower Ll 1742*0Sstevel@tonic-gate IsPrint /^([LMNPS]|Co|Zs)/ 1743*0Sstevel@tonic-gate IsPunct /^P/ 1744*0Sstevel@tonic-gate IsSpace /^Z/ || ($code =~ /^(0009|000A|000B|000C|000D)$/ 1745*0Sstevel@tonic-gate IsSpacePerl /^Z/ || ($code =~ /^(0009|000A|000C|000D|0085|2028|2029)$/ 1746*0Sstevel@tonic-gate IsUpper /^L[ut]/ 1747*0Sstevel@tonic-gate IsWord /^[LMN]/ || $code eq "005F" 1748*0Sstevel@tonic-gate IsXDigit $code =~ /^00(3[0-9]|[46][1-6])$/ 1749*0Sstevel@tonic-gate 1750*0Sstevel@tonic-gateYou can also use the official Unicode class names with the C<\p> and 1751*0Sstevel@tonic-gateC<\P>, like C<\p{L}> for Unicode 'letters', or C<\p{Lu}> for uppercase 1752*0Sstevel@tonic-gateletters, or C<\P{Nd}> for non-digits. If a C<name> is just one 1753*0Sstevel@tonic-gateletter, the braces can be dropped. For instance, C<\pM> is the 1754*0Sstevel@tonic-gatecharacter class of Unicode 'marks', for example accent marks. 1755*0Sstevel@tonic-gateFor the full list see L<perlunicode>. 1756*0Sstevel@tonic-gate 1757*0Sstevel@tonic-gateThe Unicode has also been separated into various sets of charaters 1758*0Sstevel@tonic-gatewhich you can test with C<\p{In...}> (in) and C<\P{In...}> (not in), 1759*0Sstevel@tonic-gatefor example C<\p{Latin}>, C<\p{Greek}>, or C<\P{Katakana}>. 1760*0Sstevel@tonic-gateFor the full list see L<perlunicode>. 1761*0Sstevel@tonic-gate 1762*0Sstevel@tonic-gateC<\X> is an abbreviation for a character class sequence that includes 1763*0Sstevel@tonic-gatethe Unicode 'combining character sequences'. A 'combining character 1764*0Sstevel@tonic-gatesequence' is a base character followed by any number of combining 1765*0Sstevel@tonic-gatecharacters. An example of a combining character is an accent. Using 1766*0Sstevel@tonic-gatethe Unicode full names, e.g., S<C<A + COMBINING RING> > is a combining 1767*0Sstevel@tonic-gatecharacter sequence with base character C<A> and combining character 1768*0Sstevel@tonic-gateS<C<COMBINING RING> >, which translates in Danish to A with the circle 1769*0Sstevel@tonic-gateatop it, as in the word Angstrom. C<\X> is equivalent to C<\PM\pM*}>, 1770*0Sstevel@tonic-gatei.e., a non-mark followed by one or more marks. 1771*0Sstevel@tonic-gate 1772*0Sstevel@tonic-gateFor the full and latest information about Unicode see the latest 1773*0Sstevel@tonic-gateUnicode standard, or the Unicode Consortium's website http://www.unicode.org/ 1774*0Sstevel@tonic-gate 1775*0Sstevel@tonic-gateAs if all those classes weren't enough, Perl also defines POSIX style 1776*0Sstevel@tonic-gatecharacter classes. These have the form C<[:name:]>, with C<name> the 1777*0Sstevel@tonic-gatename of the POSIX class. The POSIX classes are C<alpha>, C<alnum>, 1778*0Sstevel@tonic-gateC<ascii>, C<cntrl>, C<digit>, C<graph>, C<lower>, C<print>, C<punct>, 1779*0Sstevel@tonic-gateC<space>, C<upper>, and C<xdigit>, and two extensions, C<word> (a Perl 1780*0Sstevel@tonic-gateextension to match C<\w>), and C<blank> (a GNU extension). If C<utf8> 1781*0Sstevel@tonic-gateis being used, then these classes are defined the same as their 1782*0Sstevel@tonic-gatecorresponding perl Unicode classes: C<[:upper:]> is the same as 1783*0Sstevel@tonic-gateC<\p{IsUpper}>, etc. The POSIX character classes, however, don't 1784*0Sstevel@tonic-gaterequire using C<utf8>. The C<[:digit:]>, C<[:word:]>, and 1785*0Sstevel@tonic-gateC<[:space:]> correspond to the familiar C<\d>, C<\w>, and C<\s> 1786*0Sstevel@tonic-gatecharacter classes. To negate a POSIX class, put a C<^> in front of 1787*0Sstevel@tonic-gatethe name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and under 1788*0Sstevel@tonic-gateC<utf8>, C<\P{IsDigit}>. The Unicode and POSIX character classes can 1789*0Sstevel@tonic-gatebe used just like C<\d>, with the exception that POSIX character 1790*0Sstevel@tonic-gateclasses can only be used inside of a character class: 1791*0Sstevel@tonic-gate 1792*0Sstevel@tonic-gate /\s+[abc[:digit:]xyz]\s*/; # match a,b,c,x,y,z, or a digit 1793*0Sstevel@tonic-gate /^=item\s[[:digit:]]/; # match '=item', 1794*0Sstevel@tonic-gate # followed by a space and a digit 1795*0Sstevel@tonic-gate use charnames ":full"; 1796*0Sstevel@tonic-gate /\s+[abc\p{IsDigit}xyz]\s+/; # match a,b,c,x,y,z, or a digit 1797*0Sstevel@tonic-gate /^=item\s\p{IsDigit}/; # match '=item', 1798*0Sstevel@tonic-gate # followed by a space and a digit 1799*0Sstevel@tonic-gate 1800*0Sstevel@tonic-gateWhew! That is all the rest of the characters and character classes. 1801*0Sstevel@tonic-gate 1802*0Sstevel@tonic-gate=head2 Compiling and saving regular expressions 1803*0Sstevel@tonic-gate 1804*0Sstevel@tonic-gateIn Part 1 we discussed the C<//o> modifier, which compiles a regexp 1805*0Sstevel@tonic-gatejust once. This suggests that a compiled regexp is some data structure 1806*0Sstevel@tonic-gatethat can be stored once and used again and again. The regexp quote 1807*0Sstevel@tonic-gateC<qr//> does exactly that: C<qr/string/> compiles the C<string> as a 1808*0Sstevel@tonic-gateregexp and transforms the result into a form that can be assigned to a 1809*0Sstevel@tonic-gatevariable: 1810*0Sstevel@tonic-gate 1811*0Sstevel@tonic-gate $reg = qr/foo+bar?/; # reg contains a compiled regexp 1812*0Sstevel@tonic-gate 1813*0Sstevel@tonic-gateThen C<$reg> can be used as a regexp: 1814*0Sstevel@tonic-gate 1815*0Sstevel@tonic-gate $x = "fooooba"; 1816*0Sstevel@tonic-gate $x =~ $reg; # matches, just like /foo+bar?/ 1817*0Sstevel@tonic-gate $x =~ /$reg/; # same thing, alternate form 1818*0Sstevel@tonic-gate 1819*0Sstevel@tonic-gateC<$reg> can also be interpolated into a larger regexp: 1820*0Sstevel@tonic-gate 1821*0Sstevel@tonic-gate $x =~ /(abc)?$reg/; # still matches 1822*0Sstevel@tonic-gate 1823*0Sstevel@tonic-gateAs with the matching operator, the regexp quote can use different 1824*0Sstevel@tonic-gatedelimiters, e.g., C<qr!!>, C<qr{}> and C<qr~~>. The single quote 1825*0Sstevel@tonic-gatedelimiters C<qr''> prevent any interpolation from taking place. 1826*0Sstevel@tonic-gate 1827*0Sstevel@tonic-gatePre-compiled regexps are useful for creating dynamic matches that 1828*0Sstevel@tonic-gatedon't need to be recompiled each time they are encountered. Using 1829*0Sstevel@tonic-gatepre-compiled regexps, C<simple_grep> program can be expanded into a 1830*0Sstevel@tonic-gateprogram that matches multiple patterns: 1831*0Sstevel@tonic-gate 1832*0Sstevel@tonic-gate % cat > multi_grep 1833*0Sstevel@tonic-gate #!/usr/bin/perl 1834*0Sstevel@tonic-gate # multi_grep - match any of <number> regexps 1835*0Sstevel@tonic-gate # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ... 1836*0Sstevel@tonic-gate 1837*0Sstevel@tonic-gate $number = shift; 1838*0Sstevel@tonic-gate $regexp[$_] = shift foreach (0..$number-1); 1839*0Sstevel@tonic-gate @compiled = map qr/$_/, @regexp; 1840*0Sstevel@tonic-gate while ($line = <>) { 1841*0Sstevel@tonic-gate foreach $pattern (@compiled) { 1842*0Sstevel@tonic-gate if ($line =~ /$pattern/) { 1843*0Sstevel@tonic-gate print $line; 1844*0Sstevel@tonic-gate last; # we matched, so move onto the next line 1845*0Sstevel@tonic-gate } 1846*0Sstevel@tonic-gate } 1847*0Sstevel@tonic-gate } 1848*0Sstevel@tonic-gate ^D 1849*0Sstevel@tonic-gate 1850*0Sstevel@tonic-gate % multi_grep 2 last for multi_grep 1851*0Sstevel@tonic-gate $regexp[$_] = shift foreach (0..$number-1); 1852*0Sstevel@tonic-gate foreach $pattern (@compiled) { 1853*0Sstevel@tonic-gate last; 1854*0Sstevel@tonic-gate 1855*0Sstevel@tonic-gateStoring pre-compiled regexps in an array C<@compiled> allows us to 1856*0Sstevel@tonic-gatesimply loop through the regexps without any recompilation, thus gaining 1857*0Sstevel@tonic-gateflexibility without sacrificing speed. 1858*0Sstevel@tonic-gate 1859*0Sstevel@tonic-gate=head2 Embedding comments and modifiers in a regular expression 1860*0Sstevel@tonic-gate 1861*0Sstevel@tonic-gateStarting with this section, we will be discussing Perl's set of 1862*0Sstevel@tonic-gateB<extended patterns>. These are extensions to the traditional regular 1863*0Sstevel@tonic-gateexpression syntax that provide powerful new tools for pattern 1864*0Sstevel@tonic-gatematching. We have already seen extensions in the form of the minimal 1865*0Sstevel@tonic-gatematching constructs C<??>, C<*?>, C<+?>, C<{n,m}?>, and C<{n,}?>. The 1866*0Sstevel@tonic-gaterest of the extensions below have the form C<(?char...)>, where the 1867*0Sstevel@tonic-gateC<char> is a character that determines the type of extension. 1868*0Sstevel@tonic-gate 1869*0Sstevel@tonic-gateThe first extension is an embedded comment C<(?#text)>. This embeds a 1870*0Sstevel@tonic-gatecomment into the regular expression without affecting its meaning. The 1871*0Sstevel@tonic-gatecomment should not have any closing parentheses in the text. An 1872*0Sstevel@tonic-gateexample is 1873*0Sstevel@tonic-gate 1874*0Sstevel@tonic-gate /(?# Match an integer:)[+-]?\d+/; 1875*0Sstevel@tonic-gate 1876*0Sstevel@tonic-gateThis style of commenting has been largely superseded by the raw, 1877*0Sstevel@tonic-gatefreeform commenting that is allowed with the C<//x> modifier. 1878*0Sstevel@tonic-gate 1879*0Sstevel@tonic-gateThe modifiers C<//i>, C<//m>, C<//s>, and C<//x> can also embedded in 1880*0Sstevel@tonic-gatea regexp using C<(?i)>, C<(?m)>, C<(?s)>, and C<(?x)>. For instance, 1881*0Sstevel@tonic-gate 1882*0Sstevel@tonic-gate /(?i)yes/; # match 'yes' case insensitively 1883*0Sstevel@tonic-gate /yes/i; # same thing 1884*0Sstevel@tonic-gate /(?x)( # freeform version of an integer regexp 1885*0Sstevel@tonic-gate [+-]? # match an optional sign 1886*0Sstevel@tonic-gate \d+ # match a sequence of digits 1887*0Sstevel@tonic-gate ) 1888*0Sstevel@tonic-gate /x; 1889*0Sstevel@tonic-gate 1890*0Sstevel@tonic-gateEmbedded modifiers can have two important advantages over the usual 1891*0Sstevel@tonic-gatemodifiers. Embedded modifiers allow a custom set of modifiers to 1892*0Sstevel@tonic-gateI<each> regexp pattern. This is great for matching an array of regexps 1893*0Sstevel@tonic-gatethat must have different modifiers: 1894*0Sstevel@tonic-gate 1895*0Sstevel@tonic-gate $pattern[0] = '(?i)doctor'; 1896*0Sstevel@tonic-gate $pattern[1] = 'Johnson'; 1897*0Sstevel@tonic-gate ... 1898*0Sstevel@tonic-gate while (<>) { 1899*0Sstevel@tonic-gate foreach $patt (@pattern) { 1900*0Sstevel@tonic-gate print if /$patt/; 1901*0Sstevel@tonic-gate } 1902*0Sstevel@tonic-gate } 1903*0Sstevel@tonic-gate 1904*0Sstevel@tonic-gateThe second advantage is that embedded modifiers only affect the regexp 1905*0Sstevel@tonic-gateinside the group the embedded modifier is contained in. So grouping 1906*0Sstevel@tonic-gatecan be used to localize the modifier's effects: 1907*0Sstevel@tonic-gate 1908*0Sstevel@tonic-gate /Answer: ((?i)yes)/; # matches 'Answer: yes', 'Answer: YES', etc. 1909*0Sstevel@tonic-gate 1910*0Sstevel@tonic-gateEmbedded modifiers can also turn off any modifiers already present 1911*0Sstevel@tonic-gateby using, e.g., C<(?-i)>. Modifiers can also be combined into 1912*0Sstevel@tonic-gatea single expression, e.g., C<(?s-i)> turns on single line mode and 1913*0Sstevel@tonic-gateturns off case insensitivity. 1914*0Sstevel@tonic-gate 1915*0Sstevel@tonic-gate=head2 Non-capturing groupings 1916*0Sstevel@tonic-gate 1917*0Sstevel@tonic-gateWe noted in Part 1 that groupings C<()> had two distinct functions: 1) 1918*0Sstevel@tonic-gategroup regexp elements together as a single unit, and 2) extract, or 1919*0Sstevel@tonic-gatecapture, substrings that matched the regexp in the 1920*0Sstevel@tonic-gategrouping. Non-capturing groupings, denoted by C<(?:regexp)>, allow the 1921*0Sstevel@tonic-gateregexp to be treated as a single unit, but don't extract substrings or 1922*0Sstevel@tonic-gateset matching variables C<$1>, etc. Both capturing and non-capturing 1923*0Sstevel@tonic-gategroupings are allowed to co-exist in the same regexp. Because there is 1924*0Sstevel@tonic-gateno extraction, non-capturing groupings are faster than capturing 1925*0Sstevel@tonic-gategroupings. Non-capturing groupings are also handy for choosing exactly 1926*0Sstevel@tonic-gatewhich parts of a regexp are to be extracted to matching variables: 1927*0Sstevel@tonic-gate 1928*0Sstevel@tonic-gate # match a number, $1-$4 are set, but we only want $1 1929*0Sstevel@tonic-gate /([+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?)/; 1930*0Sstevel@tonic-gate 1931*0Sstevel@tonic-gate # match a number faster , only $1 is set 1932*0Sstevel@tonic-gate /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)/; 1933*0Sstevel@tonic-gate 1934*0Sstevel@tonic-gate # match a number, get $1 = whole number, $2 = exponent 1935*0Sstevel@tonic-gate /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE]([+-]?\d+))?)/; 1936*0Sstevel@tonic-gate 1937*0Sstevel@tonic-gateNon-capturing groupings are also useful for removing nuisance 1938*0Sstevel@tonic-gateelements gathered from a split operation: 1939*0Sstevel@tonic-gate 1940*0Sstevel@tonic-gate $x = '12a34b5'; 1941*0Sstevel@tonic-gate @num = split /(a|b)/, $x; # @num = ('12','a','34','b','5') 1942*0Sstevel@tonic-gate @num = split /(?:a|b)/, $x; # @num = ('12','34','5') 1943*0Sstevel@tonic-gate 1944*0Sstevel@tonic-gateNon-capturing groupings may also have embedded modifiers: 1945*0Sstevel@tonic-gateC<(?i-m:regexp)> is a non-capturing grouping that matches C<regexp> 1946*0Sstevel@tonic-gatecase insensitively and turns off multi-line mode. 1947*0Sstevel@tonic-gate 1948*0Sstevel@tonic-gate=head2 Looking ahead and looking behind 1949*0Sstevel@tonic-gate 1950*0Sstevel@tonic-gateThis section concerns the lookahead and lookbehind assertions. First, 1951*0Sstevel@tonic-gatea little background. 1952*0Sstevel@tonic-gate 1953*0Sstevel@tonic-gateIn Perl regular expressions, most regexp elements 'eat up' a certain 1954*0Sstevel@tonic-gateamount of string when they match. For instance, the regexp element 1955*0Sstevel@tonic-gateC<[abc}]> eats up one character of the string when it matches, in the 1956*0Sstevel@tonic-gatesense that perl moves to the next character position in the string 1957*0Sstevel@tonic-gateafter the match. There are some elements, however, that don't eat up 1958*0Sstevel@tonic-gatecharacters (advance the character position) if they match. The examples 1959*0Sstevel@tonic-gatewe have seen so far are the anchors. The anchor C<^> matches the 1960*0Sstevel@tonic-gatebeginning of the line, but doesn't eat any characters. Similarly, the 1961*0Sstevel@tonic-gateword boundary anchor C<\b> matches, e.g., if the character to the left 1962*0Sstevel@tonic-gateis a word character and the character to the right is a non-word 1963*0Sstevel@tonic-gatecharacter, but it doesn't eat up any characters itself. Anchors are 1964*0Sstevel@tonic-gateexamples of 'zero-width assertions'. Zero-width, because they consume 1965*0Sstevel@tonic-gateno characters, and assertions, because they test some property of the 1966*0Sstevel@tonic-gatestring. In the context of our walk in the woods analogy to regexp 1967*0Sstevel@tonic-gatematching, most regexp elements move us along a trail, but anchors have 1968*0Sstevel@tonic-gateus stop a moment and check our surroundings. If the local environment 1969*0Sstevel@tonic-gatechecks out, we can proceed forward. But if the local environment 1970*0Sstevel@tonic-gatedoesn't satisfy us, we must backtrack. 1971*0Sstevel@tonic-gate 1972*0Sstevel@tonic-gateChecking the environment entails either looking ahead on the trail, 1973*0Sstevel@tonic-gatelooking behind, or both. C<^> looks behind, to see that there are no 1974*0Sstevel@tonic-gatecharacters before. C<$> looks ahead, to see that there are no 1975*0Sstevel@tonic-gatecharacters after. C<\b> looks both ahead and behind, to see if the 1976*0Sstevel@tonic-gatecharacters on either side differ in their 'word'-ness. 1977*0Sstevel@tonic-gate 1978*0Sstevel@tonic-gateThe lookahead and lookbehind assertions are generalizations of the 1979*0Sstevel@tonic-gateanchor concept. Lookahead and lookbehind are zero-width assertions 1980*0Sstevel@tonic-gatethat let us specify which characters we want to test for. The 1981*0Sstevel@tonic-gatelookahead assertion is denoted by C<(?=regexp)> and the lookbehind 1982*0Sstevel@tonic-gateassertion is denoted by C<< (?<=fixed-regexp) >>. Some examples are 1983*0Sstevel@tonic-gate 1984*0Sstevel@tonic-gate $x = "I catch the housecat 'Tom-cat' with catnip"; 1985*0Sstevel@tonic-gate $x =~ /cat(?=\s+)/; # matches 'cat' in 'housecat' 1986*0Sstevel@tonic-gate @catwords = ($x =~ /(?<=\s)cat\w+/g); # matches, 1987*0Sstevel@tonic-gate # $catwords[0] = 'catch' 1988*0Sstevel@tonic-gate # $catwords[1] = 'catnip' 1989*0Sstevel@tonic-gate $x =~ /\bcat\b/; # matches 'cat' in 'Tom-cat' 1990*0Sstevel@tonic-gate $x =~ /(?<=\s)cat(?=\s)/; # doesn't match; no isolated 'cat' in 1991*0Sstevel@tonic-gate # middle of $x 1992*0Sstevel@tonic-gate 1993*0Sstevel@tonic-gateNote that the parentheses in C<(?=regexp)> and C<< (?<=regexp) >> are 1994*0Sstevel@tonic-gatenon-capturing, since these are zero-width assertions. Thus in the 1995*0Sstevel@tonic-gatesecond regexp, the substrings captured are those of the whole regexp 1996*0Sstevel@tonic-gateitself. Lookahead C<(?=regexp)> can match arbitrary regexps, but 1997*0Sstevel@tonic-gatelookbehind C<< (?<=fixed-regexp) >> only works for regexps of fixed 1998*0Sstevel@tonic-gatewidth, i.e., a fixed number of characters long. Thus 1999*0Sstevel@tonic-gateC<< (?<=(ab|bc)) >> is fine, but C<< (?<=(ab)*) >> is not. The 2000*0Sstevel@tonic-gatenegated versions of the lookahead and lookbehind assertions are 2001*0Sstevel@tonic-gatedenoted by C<(?!regexp)> and C<< (?<!fixed-regexp) >> respectively. 2002*0Sstevel@tonic-gateThey evaluate true if the regexps do I<not> match: 2003*0Sstevel@tonic-gate 2004*0Sstevel@tonic-gate $x = "foobar"; 2005*0Sstevel@tonic-gate $x =~ /foo(?!bar)/; # doesn't match, 'bar' follows 'foo' 2006*0Sstevel@tonic-gate $x =~ /foo(?!baz)/; # matches, 'baz' doesn't follow 'foo' 2007*0Sstevel@tonic-gate $x =~ /(?<!\s)foo/; # matches, there is no \s before 'foo' 2008*0Sstevel@tonic-gate 2009*0Sstevel@tonic-gateThe C<\C> is unsupported in lookbehind, because the already 2010*0Sstevel@tonic-gatetreacherous definition of C<\C> would become even more so 2011*0Sstevel@tonic-gatewhen going backwards. 2012*0Sstevel@tonic-gate 2013*0Sstevel@tonic-gate=head2 Using independent subexpressions to prevent backtracking 2014*0Sstevel@tonic-gate 2015*0Sstevel@tonic-gateThe last few extended patterns in this tutorial are experimental as of 2016*0Sstevel@tonic-gate5.6.0. Play with them, use them in some code, but don't rely on them 2017*0Sstevel@tonic-gatejust yet for production code. 2018*0Sstevel@tonic-gate 2019*0Sstevel@tonic-gateS<B<Independent subexpressions> > are regular expressions, in the 2020*0Sstevel@tonic-gatecontext of a larger regular expression, that function independently of 2021*0Sstevel@tonic-gatethe larger regular expression. That is, they consume as much or as 2022*0Sstevel@tonic-gatelittle of the string as they wish without regard for the ability of 2023*0Sstevel@tonic-gatethe larger regexp to match. Independent subexpressions are represented 2024*0Sstevel@tonic-gateby C<< (?>regexp) >>. We can illustrate their behavior by first 2025*0Sstevel@tonic-gateconsidering an ordinary regexp: 2026*0Sstevel@tonic-gate 2027*0Sstevel@tonic-gate $x = "ab"; 2028*0Sstevel@tonic-gate $x =~ /a*ab/; # matches 2029*0Sstevel@tonic-gate 2030*0Sstevel@tonic-gateThis obviously matches, but in the process of matching, the 2031*0Sstevel@tonic-gatesubexpression C<a*> first grabbed the C<a>. Doing so, however, 2032*0Sstevel@tonic-gatewouldn't allow the whole regexp to match, so after backtracking, C<a*> 2033*0Sstevel@tonic-gateeventually gave back the C<a> and matched the empty string. Here, what 2034*0Sstevel@tonic-gateC<a*> matched was I<dependent> on what the rest of the regexp matched. 2035*0Sstevel@tonic-gate 2036*0Sstevel@tonic-gateContrast that with an independent subexpression: 2037*0Sstevel@tonic-gate 2038*0Sstevel@tonic-gate $x =~ /(?>a*)ab/; # doesn't match! 2039*0Sstevel@tonic-gate 2040*0Sstevel@tonic-gateThe independent subexpression C<< (?>a*) >> doesn't care about the rest 2041*0Sstevel@tonic-gateof the regexp, so it sees an C<a> and grabs it. Then the rest of the 2042*0Sstevel@tonic-gateregexp C<ab> cannot match. Because C<< (?>a*) >> is independent, there 2043*0Sstevel@tonic-gateis no backtracking and the independent subexpression does not give 2044*0Sstevel@tonic-gateup its C<a>. Thus the match of the regexp as a whole fails. A similar 2045*0Sstevel@tonic-gatebehavior occurs with completely independent regexps: 2046*0Sstevel@tonic-gate 2047*0Sstevel@tonic-gate $x = "ab"; 2048*0Sstevel@tonic-gate $x =~ /a*/g; # matches, eats an 'a' 2049*0Sstevel@tonic-gate $x =~ /\Gab/g; # doesn't match, no 'a' available 2050*0Sstevel@tonic-gate 2051*0Sstevel@tonic-gateHere C<//g> and C<\G> create a 'tag team' handoff of the string from 2052*0Sstevel@tonic-gateone regexp to the other. Regexps with an independent subexpression are 2053*0Sstevel@tonic-gatemuch like this, with a handoff of the string to the independent 2054*0Sstevel@tonic-gatesubexpression, and a handoff of the string back to the enclosing 2055*0Sstevel@tonic-gateregexp. 2056*0Sstevel@tonic-gate 2057*0Sstevel@tonic-gateThe ability of an independent subexpression to prevent backtracking 2058*0Sstevel@tonic-gatecan be quite useful. Suppose we want to match a non-empty string 2059*0Sstevel@tonic-gateenclosed in parentheses up to two levels deep. Then the following 2060*0Sstevel@tonic-gateregexp matches: 2061*0Sstevel@tonic-gate 2062*0Sstevel@tonic-gate $x = "abc(de(fg)h"; # unbalanced parentheses 2063*0Sstevel@tonic-gate $x =~ /\( ( [^()]+ | \([^()]*\) )+ \)/x; 2064*0Sstevel@tonic-gate 2065*0Sstevel@tonic-gateThe regexp matches an open parenthesis, one or more copies of an 2066*0Sstevel@tonic-gatealternation, and a close parenthesis. The alternation is two-way, with 2067*0Sstevel@tonic-gatethe first alternative C<[^()]+> matching a substring with no 2068*0Sstevel@tonic-gateparentheses and the second alternative C<\([^()]*\)> matching a 2069*0Sstevel@tonic-gatesubstring delimited by parentheses. The problem with this regexp is 2070*0Sstevel@tonic-gatethat it is pathological: it has nested indeterminate quantifiers 2071*0Sstevel@tonic-gateof the form C<(a+|b)+>. We discussed in Part 1 how nested quantifiers 2072*0Sstevel@tonic-gatelike this could take an exponentially long time to execute if there 2073*0Sstevel@tonic-gatewas no match possible. To prevent the exponential blowup, we need to 2074*0Sstevel@tonic-gateprevent useless backtracking at some point. This can be done by 2075*0Sstevel@tonic-gateenclosing the inner quantifier as an independent subexpression: 2076*0Sstevel@tonic-gate 2077*0Sstevel@tonic-gate $x =~ /\( ( (?>[^()]+) | \([^()]*\) )+ \)/x; 2078*0Sstevel@tonic-gate 2079*0Sstevel@tonic-gateHere, C<< (?>[^()]+) >> breaks the degeneracy of string partitioning 2080*0Sstevel@tonic-gateby gobbling up as much of the string as possible and keeping it. Then 2081*0Sstevel@tonic-gatematch failures fail much more quickly. 2082*0Sstevel@tonic-gate 2083*0Sstevel@tonic-gate=head2 Conditional expressions 2084*0Sstevel@tonic-gate 2085*0Sstevel@tonic-gateA S<B<conditional expression> > is a form of if-then-else statement 2086*0Sstevel@tonic-gatethat allows one to choose which patterns are to be matched, based on 2087*0Sstevel@tonic-gatesome condition. There are two types of conditional expression: 2088*0Sstevel@tonic-gateC<(?(condition)yes-regexp)> and 2089*0Sstevel@tonic-gateC<(?(condition)yes-regexp|no-regexp)>. C<(?(condition)yes-regexp)> is 2090*0Sstevel@tonic-gatelike an S<C<'if () {}'> > statement in Perl. If the C<condition> is true, 2091*0Sstevel@tonic-gatethe C<yes-regexp> will be matched. If the C<condition> is false, the 2092*0Sstevel@tonic-gateC<yes-regexp> will be skipped and perl will move onto the next regexp 2093*0Sstevel@tonic-gateelement. The second form is like an S<C<'if () {} else {}'> > statement 2094*0Sstevel@tonic-gatein Perl. If the C<condition> is true, the C<yes-regexp> will be 2095*0Sstevel@tonic-gatematched, otherwise the C<no-regexp> will be matched. 2096*0Sstevel@tonic-gate 2097*0Sstevel@tonic-gateThe C<condition> can have two forms. The first form is simply an 2098*0Sstevel@tonic-gateinteger in parentheses C<(integer)>. It is true if the corresponding 2099*0Sstevel@tonic-gatebackreference C<\integer> matched earlier in the regexp. The second 2100*0Sstevel@tonic-gateform is a bare zero width assertion C<(?...)>, either a 2101*0Sstevel@tonic-gatelookahead, a lookbehind, or a code assertion (discussed in the next 2102*0Sstevel@tonic-gatesection). 2103*0Sstevel@tonic-gate 2104*0Sstevel@tonic-gateThe integer form of the C<condition> allows us to choose, with more 2105*0Sstevel@tonic-gateflexibility, what to match based on what matched earlier in the 2106*0Sstevel@tonic-gateregexp. This searches for words of the form C<"$x$x"> or 2107*0Sstevel@tonic-gateC<"$x$y$y$x">: 2108*0Sstevel@tonic-gate 2109*0Sstevel@tonic-gate % simple_grep '^(\w+)(\w+)?(?(2)\2\1|\1)$' /usr/dict/words 2110*0Sstevel@tonic-gate beriberi 2111*0Sstevel@tonic-gate coco 2112*0Sstevel@tonic-gate couscous 2113*0Sstevel@tonic-gate deed 2114*0Sstevel@tonic-gate ... 2115*0Sstevel@tonic-gate toot 2116*0Sstevel@tonic-gate toto 2117*0Sstevel@tonic-gate tutu 2118*0Sstevel@tonic-gate 2119*0Sstevel@tonic-gateThe lookbehind C<condition> allows, along with backreferences, 2120*0Sstevel@tonic-gatean earlier part of the match to influence a later part of the 2121*0Sstevel@tonic-gatematch. For instance, 2122*0Sstevel@tonic-gate 2123*0Sstevel@tonic-gate /[ATGC]+(?(?<=AA)G|C)$/; 2124*0Sstevel@tonic-gate 2125*0Sstevel@tonic-gatematches a DNA sequence such that it either ends in C<AAG>, or some 2126*0Sstevel@tonic-gateother base pair combination and C<C>. Note that the form is 2127*0Sstevel@tonic-gateC<< (?(?<=AA)G|C) >> and not C<< (?((?<=AA))G|C) >>; for the 2128*0Sstevel@tonic-gatelookahead, lookbehind or code assertions, the parentheses around the 2129*0Sstevel@tonic-gateconditional are not needed. 2130*0Sstevel@tonic-gate 2131*0Sstevel@tonic-gate=head2 A bit of magic: executing Perl code in a regular expression 2132*0Sstevel@tonic-gate 2133*0Sstevel@tonic-gateNormally, regexps are a part of Perl expressions. 2134*0Sstevel@tonic-gateS<B<Code evaluation> > expressions turn that around by allowing 2135*0Sstevel@tonic-gatearbitrary Perl code to be a part of a regexp. A code evaluation 2136*0Sstevel@tonic-gateexpression is denoted C<(?{code})>, with C<code> a string of Perl 2137*0Sstevel@tonic-gatestatements. 2138*0Sstevel@tonic-gate 2139*0Sstevel@tonic-gateCode expressions are zero-width assertions, and the value they return 2140*0Sstevel@tonic-gatedepends on their environment. There are two possibilities: either the 2141*0Sstevel@tonic-gatecode expression is used as a conditional in a conditional expression 2142*0Sstevel@tonic-gateC<(?(condition)...)>, or it is not. If the code expression is a 2143*0Sstevel@tonic-gateconditional, the code is evaluated and the result (i.e., the result of 2144*0Sstevel@tonic-gatethe last statement) is used to determine truth or falsehood. If the 2145*0Sstevel@tonic-gatecode expression is not used as a conditional, the assertion always 2146*0Sstevel@tonic-gateevaluates true and the result is put into the special variable 2147*0Sstevel@tonic-gateC<$^R>. The variable C<$^R> can then be used in code expressions later 2148*0Sstevel@tonic-gatein the regexp. Here are some silly examples: 2149*0Sstevel@tonic-gate 2150*0Sstevel@tonic-gate $x = "abcdef"; 2151*0Sstevel@tonic-gate $x =~ /abc(?{print "Hi Mom!";})def/; # matches, 2152*0Sstevel@tonic-gate # prints 'Hi Mom!' 2153*0Sstevel@tonic-gate $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match, 2154*0Sstevel@tonic-gate # no 'Hi Mom!' 2155*0Sstevel@tonic-gate 2156*0Sstevel@tonic-gatePay careful attention to the next example: 2157*0Sstevel@tonic-gate 2158*0Sstevel@tonic-gate $x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match, 2159*0Sstevel@tonic-gate # no 'Hi Mom!' 2160*0Sstevel@tonic-gate # but why not? 2161*0Sstevel@tonic-gate 2162*0Sstevel@tonic-gateAt first glance, you'd think that it shouldn't print, because obviously 2163*0Sstevel@tonic-gatethe C<ddd> isn't going to match the target string. But look at this 2164*0Sstevel@tonic-gateexample: 2165*0Sstevel@tonic-gate 2166*0Sstevel@tonic-gate $x =~ /abc(?{print "Hi Mom!";})[d]dd/; # doesn't match, 2167*0Sstevel@tonic-gate # but _does_ print 2168*0Sstevel@tonic-gate 2169*0Sstevel@tonic-gateHmm. What happened here? If you've been following along, you know that 2170*0Sstevel@tonic-gatethe above pattern should be effectively the same as the last one -- 2171*0Sstevel@tonic-gateenclosing the d in a character class isn't going to change what it 2172*0Sstevel@tonic-gatematches. So why does the first not print while the second one does? 2173*0Sstevel@tonic-gate 2174*0Sstevel@tonic-gateThe answer lies in the optimizations the REx engine makes. In the first 2175*0Sstevel@tonic-gatecase, all the engine sees are plain old characters (aside from the 2176*0Sstevel@tonic-gateC<?{}> construct). It's smart enough to realize that the string 'ddd' 2177*0Sstevel@tonic-gatedoesn't occur in our target string before actually running the pattern 2178*0Sstevel@tonic-gatethrough. But in the second case, we've tricked it into thinking that our 2179*0Sstevel@tonic-gatepattern is more complicated than it is. It takes a look, sees our 2180*0Sstevel@tonic-gatecharacter class, and decides that it will have to actually run the 2181*0Sstevel@tonic-gatepattern to determine whether or not it matches, and in the process of 2182*0Sstevel@tonic-gaterunning it hits the print statement before it discovers that we don't 2183*0Sstevel@tonic-gatehave a match. 2184*0Sstevel@tonic-gate 2185*0Sstevel@tonic-gateTo take a closer look at how the engine does optimizations, see the 2186*0Sstevel@tonic-gatesection L<"Pragmas and debugging"> below. 2187*0Sstevel@tonic-gate 2188*0Sstevel@tonic-gateMore fun with C<?{}>: 2189*0Sstevel@tonic-gate 2190*0Sstevel@tonic-gate $x =~ /(?{print "Hi Mom!";})/; # matches, 2191*0Sstevel@tonic-gate # prints 'Hi Mom!' 2192*0Sstevel@tonic-gate $x =~ /(?{$c = 1;})(?{print "$c";})/; # matches, 2193*0Sstevel@tonic-gate # prints '1' 2194*0Sstevel@tonic-gate $x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches, 2195*0Sstevel@tonic-gate # prints '1' 2196*0Sstevel@tonic-gate 2197*0Sstevel@tonic-gateThe bit of magic mentioned in the section title occurs when the regexp 2198*0Sstevel@tonic-gatebacktracks in the process of searching for a match. If the regexp 2199*0Sstevel@tonic-gatebacktracks over a code expression and if the variables used within are 2200*0Sstevel@tonic-gatelocalized using C<local>, the changes in the variables produced by the 2201*0Sstevel@tonic-gatecode expression are undone! Thus, if we wanted to count how many times 2202*0Sstevel@tonic-gatea character got matched inside a group, we could use, e.g., 2203*0Sstevel@tonic-gate 2204*0Sstevel@tonic-gate $x = "aaaa"; 2205*0Sstevel@tonic-gate $count = 0; # initialize 'a' count 2206*0Sstevel@tonic-gate $c = "bob"; # test if $c gets clobbered 2207*0Sstevel@tonic-gate $x =~ /(?{local $c = 0;}) # initialize count 2208*0Sstevel@tonic-gate ( a # match 'a' 2209*0Sstevel@tonic-gate (?{local $c = $c + 1;}) # increment count 2210*0Sstevel@tonic-gate )* # do this any number of times, 2211*0Sstevel@tonic-gate aa # but match 'aa' at the end 2212*0Sstevel@tonic-gate (?{$count = $c;}) # copy local $c var into $count 2213*0Sstevel@tonic-gate /x; 2214*0Sstevel@tonic-gate print "'a' count is $count, \$c variable is '$c'\n"; 2215*0Sstevel@tonic-gate 2216*0Sstevel@tonic-gateThis prints 2217*0Sstevel@tonic-gate 2218*0Sstevel@tonic-gate 'a' count is 2, $c variable is 'bob' 2219*0Sstevel@tonic-gate 2220*0Sstevel@tonic-gateIf we replace the S<C< (?{local $c = $c + 1;})> > with 2221*0Sstevel@tonic-gateS<C< (?{$c = $c + 1;})> >, the variable changes are I<not> undone 2222*0Sstevel@tonic-gateduring backtracking, and we get 2223*0Sstevel@tonic-gate 2224*0Sstevel@tonic-gate 'a' count is 4, $c variable is 'bob' 2225*0Sstevel@tonic-gate 2226*0Sstevel@tonic-gateNote that only localized variable changes are undone. Other side 2227*0Sstevel@tonic-gateeffects of code expression execution are permanent. Thus 2228*0Sstevel@tonic-gate 2229*0Sstevel@tonic-gate $x = "aaaa"; 2230*0Sstevel@tonic-gate $x =~ /(a(?{print "Yow\n";}))*aa/; 2231*0Sstevel@tonic-gate 2232*0Sstevel@tonic-gateproduces 2233*0Sstevel@tonic-gate 2234*0Sstevel@tonic-gate Yow 2235*0Sstevel@tonic-gate Yow 2236*0Sstevel@tonic-gate Yow 2237*0Sstevel@tonic-gate Yow 2238*0Sstevel@tonic-gate 2239*0Sstevel@tonic-gateThe result C<$^R> is automatically localized, so that it will behave 2240*0Sstevel@tonic-gateproperly in the presence of backtracking. 2241*0Sstevel@tonic-gate 2242*0Sstevel@tonic-gateThis example uses a code expression in a conditional to match the 2243*0Sstevel@tonic-gatearticle 'the' in either English or German: 2244*0Sstevel@tonic-gate 2245*0Sstevel@tonic-gate $lang = 'DE'; # use German 2246*0Sstevel@tonic-gate ... 2247*0Sstevel@tonic-gate $text = "das"; 2248*0Sstevel@tonic-gate print "matched\n" 2249*0Sstevel@tonic-gate if $text =~ /(?(?{ 2250*0Sstevel@tonic-gate $lang eq 'EN'; # is the language English? 2251*0Sstevel@tonic-gate }) 2252*0Sstevel@tonic-gate the | # if so, then match 'the' 2253*0Sstevel@tonic-gate (die|das|der) # else, match 'die|das|der' 2254*0Sstevel@tonic-gate ) 2255*0Sstevel@tonic-gate /xi; 2256*0Sstevel@tonic-gate 2257*0Sstevel@tonic-gateNote that the syntax here is C<(?(?{...})yes-regexp|no-regexp)>, not 2258*0Sstevel@tonic-gateC<(?((?{...}))yes-regexp|no-regexp)>. In other words, in the case of a 2259*0Sstevel@tonic-gatecode expression, we don't need the extra parentheses around the 2260*0Sstevel@tonic-gateconditional. 2261*0Sstevel@tonic-gate 2262*0Sstevel@tonic-gateIf you try to use code expressions with interpolating variables, perl 2263*0Sstevel@tonic-gatemay surprise you: 2264*0Sstevel@tonic-gate 2265*0Sstevel@tonic-gate $bar = 5; 2266*0Sstevel@tonic-gate $pat = '(?{ 1 })'; 2267*0Sstevel@tonic-gate /foo(?{ $bar })bar/; # compiles ok, $bar not interpolated 2268*0Sstevel@tonic-gate /foo(?{ 1 })$bar/; # compile error! 2269*0Sstevel@tonic-gate /foo${pat}bar/; # compile error! 2270*0Sstevel@tonic-gate 2271*0Sstevel@tonic-gate $pat = qr/(?{ $foo = 1 })/; # precompile code regexp 2272*0Sstevel@tonic-gate /foo${pat}bar/; # compiles ok 2273*0Sstevel@tonic-gate 2274*0Sstevel@tonic-gateIf a regexp has (1) code expressions and interpolating variables,or 2275*0Sstevel@tonic-gate(2) a variable that interpolates a code expression, perl treats the 2276*0Sstevel@tonic-gateregexp as an error. If the code expression is precompiled into a 2277*0Sstevel@tonic-gatevariable, however, interpolating is ok. The question is, why is this 2278*0Sstevel@tonic-gatean error? 2279*0Sstevel@tonic-gate 2280*0Sstevel@tonic-gateThe reason is that variable interpolation and code expressions 2281*0Sstevel@tonic-gatetogether pose a security risk. The combination is dangerous because 2282*0Sstevel@tonic-gatemany programmers who write search engines often take user input and 2283*0Sstevel@tonic-gateplug it directly into a regexp: 2284*0Sstevel@tonic-gate 2285*0Sstevel@tonic-gate $regexp = <>; # read user-supplied regexp 2286*0Sstevel@tonic-gate $chomp $regexp; # get rid of possible newline 2287*0Sstevel@tonic-gate $text =~ /$regexp/; # search $text for the $regexp 2288*0Sstevel@tonic-gate 2289*0Sstevel@tonic-gateIf the C<$regexp> variable contains a code expression, the user could 2290*0Sstevel@tonic-gatethen execute arbitrary Perl code. For instance, some joker could 2291*0Sstevel@tonic-gatesearch for S<C<system('rm -rf *');> > to erase your files. In this 2292*0Sstevel@tonic-gatesense, the combination of interpolation and code expressions B<taints> 2293*0Sstevel@tonic-gateyour regexp. So by default, using both interpolation and code 2294*0Sstevel@tonic-gateexpressions in the same regexp is not allowed. If you're not 2295*0Sstevel@tonic-gateconcerned about malicious users, it is possible to bypass this 2296*0Sstevel@tonic-gatesecurity check by invoking S<C<use re 'eval'> >: 2297*0Sstevel@tonic-gate 2298*0Sstevel@tonic-gate use re 'eval'; # throw caution out the door 2299*0Sstevel@tonic-gate $bar = 5; 2300*0Sstevel@tonic-gate $pat = '(?{ 1 })'; 2301*0Sstevel@tonic-gate /foo(?{ 1 })$bar/; # compiles ok 2302*0Sstevel@tonic-gate /foo${pat}bar/; # compiles ok 2303*0Sstevel@tonic-gate 2304*0Sstevel@tonic-gateAnother form of code expression is the S<B<pattern code expression> >. 2305*0Sstevel@tonic-gateThe pattern code expression is like a regular code expression, except 2306*0Sstevel@tonic-gatethat the result of the code evaluation is treated as a regular 2307*0Sstevel@tonic-gateexpression and matched immediately. A simple example is 2308*0Sstevel@tonic-gate 2309*0Sstevel@tonic-gate $length = 5; 2310*0Sstevel@tonic-gate $char = 'a'; 2311*0Sstevel@tonic-gate $x = 'aaaaabb'; 2312*0Sstevel@tonic-gate $x =~ /(??{$char x $length})/x; # matches, there are 5 of 'a' 2313*0Sstevel@tonic-gate 2314*0Sstevel@tonic-gate 2315*0Sstevel@tonic-gateThis final example contains both ordinary and pattern code 2316*0Sstevel@tonic-gateexpressions. It detects if a binary string C<1101010010001...> has a 2317*0Sstevel@tonic-gateFibonacci spacing 0,1,1,2,3,5,... of the C<1>'s: 2318*0Sstevel@tonic-gate 2319*0Sstevel@tonic-gate $s0 = 0; $s1 = 1; # initial conditions 2320*0Sstevel@tonic-gate $x = "1101010010001000001"; 2321*0Sstevel@tonic-gate print "It is a Fibonacci sequence\n" 2322*0Sstevel@tonic-gate if $x =~ /^1 # match an initial '1' 2323*0Sstevel@tonic-gate ( 2324*0Sstevel@tonic-gate (??{'0' x $s0}) # match $s0 of '0' 2325*0Sstevel@tonic-gate 1 # and then a '1' 2326*0Sstevel@tonic-gate (?{ 2327*0Sstevel@tonic-gate $largest = $s0; # largest seq so far 2328*0Sstevel@tonic-gate $s2 = $s1 + $s0; # compute next term 2329*0Sstevel@tonic-gate $s0 = $s1; # in Fibonacci sequence 2330*0Sstevel@tonic-gate $s1 = $s2; 2331*0Sstevel@tonic-gate }) 2332*0Sstevel@tonic-gate )+ # repeat as needed 2333*0Sstevel@tonic-gate $ # that is all there is 2334*0Sstevel@tonic-gate /x; 2335*0Sstevel@tonic-gate print "Largest sequence matched was $largest\n"; 2336*0Sstevel@tonic-gate 2337*0Sstevel@tonic-gateThis prints 2338*0Sstevel@tonic-gate 2339*0Sstevel@tonic-gate It is a Fibonacci sequence 2340*0Sstevel@tonic-gate Largest sequence matched was 5 2341*0Sstevel@tonic-gate 2342*0Sstevel@tonic-gateHa! Try that with your garden variety regexp package... 2343*0Sstevel@tonic-gate 2344*0Sstevel@tonic-gateNote that the variables C<$s0> and C<$s1> are not substituted when the 2345*0Sstevel@tonic-gateregexp is compiled, as happens for ordinary variables outside a code 2346*0Sstevel@tonic-gateexpression. Rather, the code expressions are evaluated when perl 2347*0Sstevel@tonic-gateencounters them during the search for a match. 2348*0Sstevel@tonic-gate 2349*0Sstevel@tonic-gateThe regexp without the C<//x> modifier is 2350*0Sstevel@tonic-gate 2351*0Sstevel@tonic-gate /^1((??{'0'x$s0})1(?{$largest=$s0;$s2=$s1+$s0$s0=$s1;$s1=$s2;}))+$/; 2352*0Sstevel@tonic-gate 2353*0Sstevel@tonic-gateand is a great start on an Obfuscated Perl entry :-) When working with 2354*0Sstevel@tonic-gatecode and conditional expressions, the extended form of regexps is 2355*0Sstevel@tonic-gatealmost necessary in creating and debugging regexps. 2356*0Sstevel@tonic-gate 2357*0Sstevel@tonic-gate=head2 Pragmas and debugging 2358*0Sstevel@tonic-gate 2359*0Sstevel@tonic-gateSpeaking of debugging, there are several pragmas available to control 2360*0Sstevel@tonic-gateand debug regexps in Perl. We have already encountered one pragma in 2361*0Sstevel@tonic-gatethe previous section, S<C<use re 'eval';> >, that allows variable 2362*0Sstevel@tonic-gateinterpolation and code expressions to coexist in a regexp. The other 2363*0Sstevel@tonic-gatepragmas are 2364*0Sstevel@tonic-gate 2365*0Sstevel@tonic-gate use re 'taint'; 2366*0Sstevel@tonic-gate $tainted = <>; 2367*0Sstevel@tonic-gate @parts = ($tainted =~ /(\w+)\s+(\w+)/; # @parts is now tainted 2368*0Sstevel@tonic-gate 2369*0Sstevel@tonic-gateThe C<taint> pragma causes any substrings from a match with a tainted 2370*0Sstevel@tonic-gatevariable to be tainted as well. This is not normally the case, as 2371*0Sstevel@tonic-gateregexps are often used to extract the safe bits from a tainted 2372*0Sstevel@tonic-gatevariable. Use C<taint> when you are not extracting safe bits, but are 2373*0Sstevel@tonic-gateperforming some other processing. Both C<taint> and C<eval> pragmas 2374*0Sstevel@tonic-gateare lexically scoped, which means they are in effect only until 2375*0Sstevel@tonic-gatethe end of the block enclosing the pragmas. 2376*0Sstevel@tonic-gate 2377*0Sstevel@tonic-gate use re 'debug'; 2378*0Sstevel@tonic-gate /^(.*)$/s; # output debugging info 2379*0Sstevel@tonic-gate 2380*0Sstevel@tonic-gate use re 'debugcolor'; 2381*0Sstevel@tonic-gate /^(.*)$/s; # output debugging info in living color 2382*0Sstevel@tonic-gate 2383*0Sstevel@tonic-gateThe global C<debug> and C<debugcolor> pragmas allow one to get 2384*0Sstevel@tonic-gatedetailed debugging info about regexp compilation and 2385*0Sstevel@tonic-gateexecution. C<debugcolor> is the same as debug, except the debugging 2386*0Sstevel@tonic-gateinformation is displayed in color on terminals that can display 2387*0Sstevel@tonic-gatetermcap color sequences. Here is example output: 2388*0Sstevel@tonic-gate 2389*0Sstevel@tonic-gate % perl -e 'use re "debug"; "abc" =~ /a*b+c/;' 2390*0Sstevel@tonic-gate Compiling REx `a*b+c' 2391*0Sstevel@tonic-gate size 9 first at 1 2392*0Sstevel@tonic-gate 1: STAR(4) 2393*0Sstevel@tonic-gate 2: EXACT <a>(0) 2394*0Sstevel@tonic-gate 4: PLUS(7) 2395*0Sstevel@tonic-gate 5: EXACT <b>(0) 2396*0Sstevel@tonic-gate 7: EXACT <c>(9) 2397*0Sstevel@tonic-gate 9: END(0) 2398*0Sstevel@tonic-gate floating `bc' at 0..2147483647 (checking floating) minlen 2 2399*0Sstevel@tonic-gate Guessing start of match, REx `a*b+c' against `abc'... 2400*0Sstevel@tonic-gate Found floating substr `bc' at offset 1... 2401*0Sstevel@tonic-gate Guessed: match at offset 0 2402*0Sstevel@tonic-gate Matching REx `a*b+c' against `abc' 2403*0Sstevel@tonic-gate Setting an EVAL scope, savestack=3 2404*0Sstevel@tonic-gate 0 <> <abc> | 1: STAR 2405*0Sstevel@tonic-gate EXACT <a> can match 1 times out of 32767... 2406*0Sstevel@tonic-gate Setting an EVAL scope, savestack=3 2407*0Sstevel@tonic-gate 1 <a> <bc> | 4: PLUS 2408*0Sstevel@tonic-gate EXACT <b> can match 1 times out of 32767... 2409*0Sstevel@tonic-gate Setting an EVAL scope, savestack=3 2410*0Sstevel@tonic-gate 2 <ab> <c> | 7: EXACT <c> 2411*0Sstevel@tonic-gate 3 <abc> <> | 9: END 2412*0Sstevel@tonic-gate Match successful! 2413*0Sstevel@tonic-gate Freeing REx: `a*b+c' 2414*0Sstevel@tonic-gate 2415*0Sstevel@tonic-gateIf you have gotten this far into the tutorial, you can probably guess 2416*0Sstevel@tonic-gatewhat the different parts of the debugging output tell you. The first 2417*0Sstevel@tonic-gatepart 2418*0Sstevel@tonic-gate 2419*0Sstevel@tonic-gate Compiling REx `a*b+c' 2420*0Sstevel@tonic-gate size 9 first at 1 2421*0Sstevel@tonic-gate 1: STAR(4) 2422*0Sstevel@tonic-gate 2: EXACT <a>(0) 2423*0Sstevel@tonic-gate 4: PLUS(7) 2424*0Sstevel@tonic-gate 5: EXACT <b>(0) 2425*0Sstevel@tonic-gate 7: EXACT <c>(9) 2426*0Sstevel@tonic-gate 9: END(0) 2427*0Sstevel@tonic-gate 2428*0Sstevel@tonic-gatedescribes the compilation stage. C<STAR(4)> means that there is a 2429*0Sstevel@tonic-gatestarred object, in this case C<'a'>, and if it matches, goto line 4, 2430*0Sstevel@tonic-gatei.e., C<PLUS(7)>. The middle lines describe some heuristics and 2431*0Sstevel@tonic-gateoptimizations performed before a match: 2432*0Sstevel@tonic-gate 2433*0Sstevel@tonic-gate floating `bc' at 0..2147483647 (checking floating) minlen 2 2434*0Sstevel@tonic-gate Guessing start of match, REx `a*b+c' against `abc'... 2435*0Sstevel@tonic-gate Found floating substr `bc' at offset 1... 2436*0Sstevel@tonic-gate Guessed: match at offset 0 2437*0Sstevel@tonic-gate 2438*0Sstevel@tonic-gateThen the match is executed and the remaining lines describe the 2439*0Sstevel@tonic-gateprocess: 2440*0Sstevel@tonic-gate 2441*0Sstevel@tonic-gate Matching REx `a*b+c' against `abc' 2442*0Sstevel@tonic-gate Setting an EVAL scope, savestack=3 2443*0Sstevel@tonic-gate 0 <> <abc> | 1: STAR 2444*0Sstevel@tonic-gate EXACT <a> can match 1 times out of 32767... 2445*0Sstevel@tonic-gate Setting an EVAL scope, savestack=3 2446*0Sstevel@tonic-gate 1 <a> <bc> | 4: PLUS 2447*0Sstevel@tonic-gate EXACT <b> can match 1 times out of 32767... 2448*0Sstevel@tonic-gate Setting an EVAL scope, savestack=3 2449*0Sstevel@tonic-gate 2 <ab> <c> | 7: EXACT <c> 2450*0Sstevel@tonic-gate 3 <abc> <> | 9: END 2451*0Sstevel@tonic-gate Match successful! 2452*0Sstevel@tonic-gate Freeing REx: `a*b+c' 2453*0Sstevel@tonic-gate 2454*0Sstevel@tonic-gateEach step is of the form S<C<< n <x> <y> >> >, with C<< <x> >> the 2455*0Sstevel@tonic-gatepart of the string matched and C<< <y> >> the part not yet 2456*0Sstevel@tonic-gatematched. The S<C<< | 1: STAR >> > says that perl is at line number 1 2457*0Sstevel@tonic-gaten the compilation list above. See 2458*0Sstevel@tonic-gateL<perldebguts/"Debugging regular expressions"> for much more detail. 2459*0Sstevel@tonic-gate 2460*0Sstevel@tonic-gateAn alternative method of debugging regexps is to embed C<print> 2461*0Sstevel@tonic-gatestatements within the regexp. This provides a blow-by-blow account of 2462*0Sstevel@tonic-gatethe backtracking in an alternation: 2463*0Sstevel@tonic-gate 2464*0Sstevel@tonic-gate "that this" =~ m@(?{print "Start at position ", pos, "\n";}) 2465*0Sstevel@tonic-gate t(?{print "t1\n";}) 2466*0Sstevel@tonic-gate h(?{print "h1\n";}) 2467*0Sstevel@tonic-gate i(?{print "i1\n";}) 2468*0Sstevel@tonic-gate s(?{print "s1\n";}) 2469*0Sstevel@tonic-gate | 2470*0Sstevel@tonic-gate t(?{print "t2\n";}) 2471*0Sstevel@tonic-gate h(?{print "h2\n";}) 2472*0Sstevel@tonic-gate a(?{print "a2\n";}) 2473*0Sstevel@tonic-gate t(?{print "t2\n";}) 2474*0Sstevel@tonic-gate (?{print "Done at position ", pos, "\n";}) 2475*0Sstevel@tonic-gate @x; 2476*0Sstevel@tonic-gate 2477*0Sstevel@tonic-gateprints 2478*0Sstevel@tonic-gate 2479*0Sstevel@tonic-gate Start at position 0 2480*0Sstevel@tonic-gate t1 2481*0Sstevel@tonic-gate h1 2482*0Sstevel@tonic-gate t2 2483*0Sstevel@tonic-gate h2 2484*0Sstevel@tonic-gate a2 2485*0Sstevel@tonic-gate t2 2486*0Sstevel@tonic-gate Done at position 4 2487*0Sstevel@tonic-gate 2488*0Sstevel@tonic-gate=head1 BUGS 2489*0Sstevel@tonic-gate 2490*0Sstevel@tonic-gateCode expressions, conditional expressions, and independent expressions 2491*0Sstevel@tonic-gateare B<experimental>. Don't use them in production code. Yet. 2492*0Sstevel@tonic-gate 2493*0Sstevel@tonic-gate=head1 SEE ALSO 2494*0Sstevel@tonic-gate 2495*0Sstevel@tonic-gateThis is just a tutorial. For the full story on perl regular 2496*0Sstevel@tonic-gateexpressions, see the L<perlre> regular expressions reference page. 2497*0Sstevel@tonic-gate 2498*0Sstevel@tonic-gateFor more information on the matching C<m//> and substitution C<s///> 2499*0Sstevel@tonic-gateoperators, see L<perlop/"Regexp Quote-Like Operators">. For 2500*0Sstevel@tonic-gateinformation on the C<split> operation, see L<perlfunc/split>. 2501*0Sstevel@tonic-gate 2502*0Sstevel@tonic-gateFor an excellent all-around resource on the care and feeding of 2503*0Sstevel@tonic-gateregular expressions, see the book I<Mastering Regular Expressions> by 2504*0Sstevel@tonic-gateJeffrey Friedl (published by O'Reilly, ISBN 1556592-257-3). 2505*0Sstevel@tonic-gate 2506*0Sstevel@tonic-gate=head1 AUTHOR AND COPYRIGHT 2507*0Sstevel@tonic-gate 2508*0Sstevel@tonic-gateCopyright (c) 2000 Mark Kvale 2509*0Sstevel@tonic-gateAll rights reserved. 2510*0Sstevel@tonic-gate 2511*0Sstevel@tonic-gateThis document may be distributed under the same terms as Perl itself. 2512*0Sstevel@tonic-gate 2513*0Sstevel@tonic-gate=head2 Acknowledgments 2514*0Sstevel@tonic-gate 2515*0Sstevel@tonic-gateThe inspiration for the stop codon DNA example came from the ZIP 2516*0Sstevel@tonic-gatecode example in chapter 7 of I<Mastering Regular Expressions>. 2517*0Sstevel@tonic-gate 2518*0Sstevel@tonic-gateThe author would like to thank Jeff Pinyan, Andrew Johnson, Peter 2519*0Sstevel@tonic-gateHaworth, Ronald J Kimball, and Joe Smith for all their helpful 2520*0Sstevel@tonic-gatecomments. 2521*0Sstevel@tonic-gate 2522*0Sstevel@tonic-gate=cut 2523*0Sstevel@tonic-gate 2524