1*0Sstevel@tonic-gate=head1 NAME 2*0Sstevel@tonic-gate 3*0Sstevel@tonic-gateperlrequick - Perl regular expressions quick start 4*0Sstevel@tonic-gate 5*0Sstevel@tonic-gate=head1 DESCRIPTION 6*0Sstevel@tonic-gate 7*0Sstevel@tonic-gateThis page covers the very basics of understanding, creating and 8*0Sstevel@tonic-gateusing regular expressions ('regexes') in Perl. 9*0Sstevel@tonic-gate 10*0Sstevel@tonic-gate 11*0Sstevel@tonic-gate=head1 The Guide 12*0Sstevel@tonic-gate 13*0Sstevel@tonic-gate=head2 Simple word matching 14*0Sstevel@tonic-gate 15*0Sstevel@tonic-gateThe simplest regex is simply a word, or more generally, a string of 16*0Sstevel@tonic-gatecharacters. A regex consisting of a word matches any string that 17*0Sstevel@tonic-gatecontains that word: 18*0Sstevel@tonic-gate 19*0Sstevel@tonic-gate "Hello World" =~ /World/; # matches 20*0Sstevel@tonic-gate 21*0Sstevel@tonic-gateIn this statement, C<World> is a regex and the C<//> enclosing 22*0Sstevel@tonic-gateC</World/> tells perl to search a string for a match. The operator 23*0Sstevel@tonic-gateC<=~> associates the string with the regex match and produces a true 24*0Sstevel@tonic-gatevalue if the regex matched, or false if the regex did not match. In 25*0Sstevel@tonic-gateour case, C<World> matches the second word in C<"Hello World">, so the 26*0Sstevel@tonic-gateexpression is true. This idea has several variations. 27*0Sstevel@tonic-gate 28*0Sstevel@tonic-gateExpressions like this are useful in conditionals: 29*0Sstevel@tonic-gate 30*0Sstevel@tonic-gate print "It matches\n" if "Hello World" =~ /World/; 31*0Sstevel@tonic-gate 32*0Sstevel@tonic-gateThe sense of the match can be reversed by using C<!~> operator: 33*0Sstevel@tonic-gate 34*0Sstevel@tonic-gate print "It doesn't match\n" if "Hello World" !~ /World/; 35*0Sstevel@tonic-gate 36*0Sstevel@tonic-gateThe literal string in the regex can be replaced by a variable: 37*0Sstevel@tonic-gate 38*0Sstevel@tonic-gate $greeting = "World"; 39*0Sstevel@tonic-gate print "It matches\n" if "Hello World" =~ /$greeting/; 40*0Sstevel@tonic-gate 41*0Sstevel@tonic-gateIf you're matching against C<$_>, the C<$_ =~> part can be omitted: 42*0Sstevel@tonic-gate 43*0Sstevel@tonic-gate $_ = "Hello World"; 44*0Sstevel@tonic-gate print "It matches\n" if /World/; 45*0Sstevel@tonic-gate 46*0Sstevel@tonic-gateFinally, the C<//> default delimiters for a match can be changed to 47*0Sstevel@tonic-gatearbitrary delimiters by putting an C<'m'> out front: 48*0Sstevel@tonic-gate 49*0Sstevel@tonic-gate "Hello World" =~ m!World!; # matches, delimited by '!' 50*0Sstevel@tonic-gate "Hello World" =~ m{World}; # matches, note the matching '{}' 51*0Sstevel@tonic-gate "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', 52*0Sstevel@tonic-gate # '/' becomes an ordinary char 53*0Sstevel@tonic-gate 54*0Sstevel@tonic-gateRegexes must match a part of the string I<exactly> in order for the 55*0Sstevel@tonic-gatestatement to be true: 56*0Sstevel@tonic-gate 57*0Sstevel@tonic-gate "Hello World" =~ /world/; # doesn't match, case sensitive 58*0Sstevel@tonic-gate "Hello World" =~ /o W/; # matches, ' ' is an ordinary char 59*0Sstevel@tonic-gate "Hello World" =~ /World /; # doesn't match, no ' ' at end 60*0Sstevel@tonic-gate 61*0Sstevel@tonic-gateperl will always match at the earliest possible point in the string: 62*0Sstevel@tonic-gate 63*0Sstevel@tonic-gate "Hello World" =~ /o/; # matches 'o' in 'Hello' 64*0Sstevel@tonic-gate "That hat is red" =~ /hat/; # matches 'hat' in 'That' 65*0Sstevel@tonic-gate 66*0Sstevel@tonic-gateNot all characters can be used 'as is' in a match. Some characters, 67*0Sstevel@tonic-gatecalled B<metacharacters>, are reserved for use in regex notation. 68*0Sstevel@tonic-gateThe metacharacters are 69*0Sstevel@tonic-gate 70*0Sstevel@tonic-gate {}[]()^$.|*+?\ 71*0Sstevel@tonic-gate 72*0Sstevel@tonic-gateA metacharacter can be matched by putting a backslash before it: 73*0Sstevel@tonic-gate 74*0Sstevel@tonic-gate "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter 75*0Sstevel@tonic-gate "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + 76*0Sstevel@tonic-gate 'C:\WIN32' =~ /C:\\WIN/; # matches 77*0Sstevel@tonic-gate "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches 78*0Sstevel@tonic-gate 79*0Sstevel@tonic-gateIn the last regex, the forward slash C<'/'> is also backslashed, 80*0Sstevel@tonic-gatebecause it is used to delimit the regex. 81*0Sstevel@tonic-gate 82*0Sstevel@tonic-gateNon-printable ASCII characters are represented by B<escape sequences>. 83*0Sstevel@tonic-gateCommon examples are C<\t> for a tab, C<\n> for a newline, and C<\r> 84*0Sstevel@tonic-gatefor a carriage return. Arbitrary bytes are represented by octal 85*0Sstevel@tonic-gateescape sequences, e.g., C<\033>, or hexadecimal escape sequences, 86*0Sstevel@tonic-gatee.g., C<\x1B>: 87*0Sstevel@tonic-gate 88*0Sstevel@tonic-gate "1000\t2000" =~ m(0\t2) # matches 89*0Sstevel@tonic-gate "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat 90*0Sstevel@tonic-gate 91*0Sstevel@tonic-gateRegexes are treated mostly as double quoted strings, so variable 92*0Sstevel@tonic-gatesubstitution works: 93*0Sstevel@tonic-gate 94*0Sstevel@tonic-gate $foo = 'house'; 95*0Sstevel@tonic-gate 'cathouse' =~ /cat$foo/; # matches 96*0Sstevel@tonic-gate 'housecat' =~ /${foo}cat/; # matches 97*0Sstevel@tonic-gate 98*0Sstevel@tonic-gateWith all of the regexes above, if the regex matched anywhere in the 99*0Sstevel@tonic-gatestring, it was considered a match. To specify I<where> it should 100*0Sstevel@tonic-gatematch, we would use the B<anchor> metacharacters C<^> and C<$>. The 101*0Sstevel@tonic-gateanchor C<^> means match at the beginning of the string and the anchor 102*0Sstevel@tonic-gateC<$> means match at the end of the string, or before a newline at the 103*0Sstevel@tonic-gateend of the string. Some examples: 104*0Sstevel@tonic-gate 105*0Sstevel@tonic-gate "housekeeper" =~ /keeper/; # matches 106*0Sstevel@tonic-gate "housekeeper" =~ /^keeper/; # doesn't match 107*0Sstevel@tonic-gate "housekeeper" =~ /keeper$/; # matches 108*0Sstevel@tonic-gate "housekeeper\n" =~ /keeper$/; # matches 109*0Sstevel@tonic-gate "housekeeper" =~ /^housekeeper$/; # matches 110*0Sstevel@tonic-gate 111*0Sstevel@tonic-gate=head2 Using character classes 112*0Sstevel@tonic-gate 113*0Sstevel@tonic-gateA B<character class> allows a set of possible characters, rather than 114*0Sstevel@tonic-gatejust a single character, to match at a particular point in a regex. 115*0Sstevel@tonic-gateCharacter classes are denoted by brackets C<[...]>, with the set of 116*0Sstevel@tonic-gatecharacters to be possibly matched inside. Here are some examples: 117*0Sstevel@tonic-gate 118*0Sstevel@tonic-gate /cat/; # matches 'cat' 119*0Sstevel@tonic-gate /[bcr]at/; # matches 'bat', 'cat', or 'rat' 120*0Sstevel@tonic-gate "abc" =~ /[cab]/; # matches 'a' 121*0Sstevel@tonic-gate 122*0Sstevel@tonic-gateIn the last statement, even though C<'c'> is the first character in 123*0Sstevel@tonic-gatethe class, the earliest point at which the regex can match is C<'a'>. 124*0Sstevel@tonic-gate 125*0Sstevel@tonic-gate /[yY][eE][sS]/; # match 'yes' in a case-insensitive way 126*0Sstevel@tonic-gate # 'yes', 'Yes', 'YES', etc. 127*0Sstevel@tonic-gate /yes/i; # also match 'yes' in a case-insensitive way 128*0Sstevel@tonic-gate 129*0Sstevel@tonic-gateThe last example shows a match with an C<'i'> B<modifier>, which makes 130*0Sstevel@tonic-gatethe match case-insensitive. 131*0Sstevel@tonic-gate 132*0Sstevel@tonic-gateCharacter classes also have ordinary and special characters, but the 133*0Sstevel@tonic-gatesets of ordinary and special characters inside a character class are 134*0Sstevel@tonic-gatedifferent than those outside a character class. The special 135*0Sstevel@tonic-gatecharacters for a character class are C<-]\^$> and are matched using an 136*0Sstevel@tonic-gateescape: 137*0Sstevel@tonic-gate 138*0Sstevel@tonic-gate /[\]c]def/; # matches ']def' or 'cdef' 139*0Sstevel@tonic-gate $x = 'bcr'; 140*0Sstevel@tonic-gate /[$x]at/; # matches 'bat, 'cat', or 'rat' 141*0Sstevel@tonic-gate /[\$x]at/; # matches '$at' or 'xat' 142*0Sstevel@tonic-gate /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' 143*0Sstevel@tonic-gate 144*0Sstevel@tonic-gateThe special character C<'-'> acts as a range operator within character 145*0Sstevel@tonic-gateclasses, so that the unwieldy C<[0123456789]> and C<[abc...xyz]> 146*0Sstevel@tonic-gatebecome the svelte C<[0-9]> and C<[a-z]>: 147*0Sstevel@tonic-gate 148*0Sstevel@tonic-gate /item[0-9]/; # matches 'item0' or ... or 'item9' 149*0Sstevel@tonic-gate /[0-9a-fA-F]/; # matches a hexadecimal digit 150*0Sstevel@tonic-gate 151*0Sstevel@tonic-gateIf C<'-'> is the first or last character in a character class, it is 152*0Sstevel@tonic-gatetreated as an ordinary character. 153*0Sstevel@tonic-gate 154*0Sstevel@tonic-gateThe special character C<^> in the first position of a character class 155*0Sstevel@tonic-gatedenotes a B<negated character class>, which matches any character but 156*0Sstevel@tonic-gatethose in the brackets. Both C<[...]> and C<[^...]> must match a 157*0Sstevel@tonic-gatecharacter, or the match fails. Then 158*0Sstevel@tonic-gate 159*0Sstevel@tonic-gate /[^a]at/; # doesn't match 'aat' or 'at', but matches 160*0Sstevel@tonic-gate # all other 'bat', 'cat, '0at', '%at', etc. 161*0Sstevel@tonic-gate /[^0-9]/; # matches a non-numeric character 162*0Sstevel@tonic-gate /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary 163*0Sstevel@tonic-gate 164*0Sstevel@tonic-gatePerl has several abbreviations for common character classes: 165*0Sstevel@tonic-gate 166*0Sstevel@tonic-gate=over 4 167*0Sstevel@tonic-gate 168*0Sstevel@tonic-gate=item * 169*0Sstevel@tonic-gate 170*0Sstevel@tonic-gate\d is a digit and represents 171*0Sstevel@tonic-gate 172*0Sstevel@tonic-gate [0-9] 173*0Sstevel@tonic-gate 174*0Sstevel@tonic-gate=item * 175*0Sstevel@tonic-gate 176*0Sstevel@tonic-gate\s is a whitespace character and represents 177*0Sstevel@tonic-gate 178*0Sstevel@tonic-gate [\ \t\r\n\f] 179*0Sstevel@tonic-gate 180*0Sstevel@tonic-gate=item * 181*0Sstevel@tonic-gate 182*0Sstevel@tonic-gate\w is a word character (alphanumeric or _) and represents 183*0Sstevel@tonic-gate 184*0Sstevel@tonic-gate [0-9a-zA-Z_] 185*0Sstevel@tonic-gate 186*0Sstevel@tonic-gate=item * 187*0Sstevel@tonic-gate 188*0Sstevel@tonic-gate\D is a negated \d; it represents any character but a digit 189*0Sstevel@tonic-gate 190*0Sstevel@tonic-gate [^0-9] 191*0Sstevel@tonic-gate 192*0Sstevel@tonic-gate=item * 193*0Sstevel@tonic-gate 194*0Sstevel@tonic-gate\S is a negated \s; it represents any non-whitespace character 195*0Sstevel@tonic-gate 196*0Sstevel@tonic-gate [^\s] 197*0Sstevel@tonic-gate 198*0Sstevel@tonic-gate=item * 199*0Sstevel@tonic-gate 200*0Sstevel@tonic-gate\W is a negated \w; it represents any non-word character 201*0Sstevel@tonic-gate 202*0Sstevel@tonic-gate [^\w] 203*0Sstevel@tonic-gate 204*0Sstevel@tonic-gate=item * 205*0Sstevel@tonic-gate 206*0Sstevel@tonic-gateThe period '.' matches any character but "\n" 207*0Sstevel@tonic-gate 208*0Sstevel@tonic-gate=back 209*0Sstevel@tonic-gate 210*0Sstevel@tonic-gateThe C<\d\s\w\D\S\W> abbreviations can be used both inside and outside 211*0Sstevel@tonic-gateof character classes. Here are some in use: 212*0Sstevel@tonic-gate 213*0Sstevel@tonic-gate /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format 214*0Sstevel@tonic-gate /[\d\s]/; # matches any digit or whitespace character 215*0Sstevel@tonic-gate /\w\W\w/; # matches a word char, followed by a 216*0Sstevel@tonic-gate # non-word char, followed by a word char 217*0Sstevel@tonic-gate /..rt/; # matches any two chars, followed by 'rt' 218*0Sstevel@tonic-gate /end\./; # matches 'end.' 219*0Sstevel@tonic-gate /end[.]/; # same thing, matches 'end.' 220*0Sstevel@tonic-gate 221*0Sstevel@tonic-gateThe S<B<word anchor> > C<\b> matches a boundary between a word 222*0Sstevel@tonic-gatecharacter and a non-word character C<\w\W> or C<\W\w>: 223*0Sstevel@tonic-gate 224*0Sstevel@tonic-gate $x = "Housecat catenates house and cat"; 225*0Sstevel@tonic-gate $x =~ /\bcat/; # matches cat in 'catenates' 226*0Sstevel@tonic-gate $x =~ /cat\b/; # matches cat in 'housecat' 227*0Sstevel@tonic-gate $x =~ /\bcat\b/; # matches 'cat' at end of string 228*0Sstevel@tonic-gate 229*0Sstevel@tonic-gateIn the last example, the end of the string is considered a word 230*0Sstevel@tonic-gateboundary. 231*0Sstevel@tonic-gate 232*0Sstevel@tonic-gate=head2 Matching this or that 233*0Sstevel@tonic-gate 234*0Sstevel@tonic-gateWe can match different character strings with the B<alternation> 235*0Sstevel@tonic-gatemetacharacter C<'|'>. To match C<dog> or C<cat>, we form the regex 236*0Sstevel@tonic-gateC<dog|cat>. As before, perl will try to match the regex at the 237*0Sstevel@tonic-gateearliest possible point in the string. At each character position, 238*0Sstevel@tonic-gateperl will first try to match the first alternative, C<dog>. If 239*0Sstevel@tonic-gateC<dog> doesn't match, perl will then try the next alternative, C<cat>. 240*0Sstevel@tonic-gateIf C<cat> doesn't match either, then the match fails and perl moves to 241*0Sstevel@tonic-gatethe next position in the string. Some examples: 242*0Sstevel@tonic-gate 243*0Sstevel@tonic-gate "cats and dogs" =~ /cat|dog|bird/; # matches "cat" 244*0Sstevel@tonic-gate "cats and dogs" =~ /dog|cat|bird/; # matches "cat" 245*0Sstevel@tonic-gate 246*0Sstevel@tonic-gateEven though C<dog> is the first alternative in the second regex, 247*0Sstevel@tonic-gateC<cat> is able to match earlier in the string. 248*0Sstevel@tonic-gate 249*0Sstevel@tonic-gate "cats" =~ /c|ca|cat|cats/; # matches "c" 250*0Sstevel@tonic-gate "cats" =~ /cats|cat|ca|c/; # matches "cats" 251*0Sstevel@tonic-gate 252*0Sstevel@tonic-gateAt a given character position, the first alternative that allows the 253*0Sstevel@tonic-gateregex match to succeed will be the one that matches. Here, all the 254*0Sstevel@tonic-gatealternatives match at the first string position, so the first matches. 255*0Sstevel@tonic-gate 256*0Sstevel@tonic-gate=head2 Grouping things and hierarchical matching 257*0Sstevel@tonic-gate 258*0Sstevel@tonic-gateThe B<grouping> metacharacters C<()> allow a part of a regex to be 259*0Sstevel@tonic-gatetreated as a single unit. Parts of a regex are grouped by enclosing 260*0Sstevel@tonic-gatethem in parentheses. The regex C<house(cat|keeper)> means match 261*0Sstevel@tonic-gateC<house> followed by either C<cat> or C<keeper>. Some more examples 262*0Sstevel@tonic-gateare 263*0Sstevel@tonic-gate 264*0Sstevel@tonic-gate /(a|b)b/; # matches 'ab' or 'bb' 265*0Sstevel@tonic-gate /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere 266*0Sstevel@tonic-gate 267*0Sstevel@tonic-gate /house(cat|)/; # matches either 'housecat' or 'house' 268*0Sstevel@tonic-gate /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or 269*0Sstevel@tonic-gate # 'house'. Note groups can be nested. 270*0Sstevel@tonic-gate 271*0Sstevel@tonic-gate "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', 272*0Sstevel@tonic-gate # because '20\d\d' can't match 273*0Sstevel@tonic-gate 274*0Sstevel@tonic-gate=head2 Extracting matches 275*0Sstevel@tonic-gate 276*0Sstevel@tonic-gateThe grouping metacharacters C<()> also allow the extraction of the 277*0Sstevel@tonic-gateparts of a string that matched. For each grouping, the part that 278*0Sstevel@tonic-gatematched inside goes into the special variables C<$1>, C<$2>, etc. 279*0Sstevel@tonic-gateThey can be used just as ordinary variables: 280*0Sstevel@tonic-gate 281*0Sstevel@tonic-gate # extract hours, minutes, seconds 282*0Sstevel@tonic-gate $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format 283*0Sstevel@tonic-gate $hours = $1; 284*0Sstevel@tonic-gate $minutes = $2; 285*0Sstevel@tonic-gate $seconds = $3; 286*0Sstevel@tonic-gate 287*0Sstevel@tonic-gateIn list context, a match C</regex/> with groupings will return the 288*0Sstevel@tonic-gatelist of matched values C<($1,$2,...)>. So we could rewrite it as 289*0Sstevel@tonic-gate 290*0Sstevel@tonic-gate ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); 291*0Sstevel@tonic-gate 292*0Sstevel@tonic-gateIf the groupings in a regex are nested, C<$1> gets the group with the 293*0Sstevel@tonic-gateleftmost opening parenthesis, C<$2> the next opening parenthesis, 294*0Sstevel@tonic-gateetc. For example, here is a complex regex and the matching variables 295*0Sstevel@tonic-gateindicated below it: 296*0Sstevel@tonic-gate 297*0Sstevel@tonic-gate /(ab(cd|ef)((gi)|j))/; 298*0Sstevel@tonic-gate 1 2 34 299*0Sstevel@tonic-gate 300*0Sstevel@tonic-gateAssociated with the matching variables C<$1>, C<$2>, ... are 301*0Sstevel@tonic-gatethe B<backreferences> C<\1>, C<\2>, ... Backreferences are 302*0Sstevel@tonic-gatematching variables that can be used I<inside> a regex: 303*0Sstevel@tonic-gate 304*0Sstevel@tonic-gate /(\w\w\w)\s\1/; # find sequences like 'the the' in string 305*0Sstevel@tonic-gate 306*0Sstevel@tonic-gateC<$1>, C<$2>, ... should only be used outside of a regex, and C<\1>, 307*0Sstevel@tonic-gateC<\2>, ... only inside a regex. 308*0Sstevel@tonic-gate 309*0Sstevel@tonic-gate=head2 Matching repetitions 310*0Sstevel@tonic-gate 311*0Sstevel@tonic-gateThe B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us 312*0Sstevel@tonic-gateto determine the number of repeats of a portion of a regex we 313*0Sstevel@tonic-gateconsider to be a match. Quantifiers are put immediately after the 314*0Sstevel@tonic-gatecharacter, character class, or grouping that we want to specify. They 315*0Sstevel@tonic-gatehave the following meanings: 316*0Sstevel@tonic-gate 317*0Sstevel@tonic-gate=over 4 318*0Sstevel@tonic-gate 319*0Sstevel@tonic-gate=item * 320*0Sstevel@tonic-gate 321*0Sstevel@tonic-gateC<a?> = match 'a' 1 or 0 times 322*0Sstevel@tonic-gate 323*0Sstevel@tonic-gate=item * 324*0Sstevel@tonic-gate 325*0Sstevel@tonic-gateC<a*> = match 'a' 0 or more times, i.e., any number of times 326*0Sstevel@tonic-gate 327*0Sstevel@tonic-gate=item * 328*0Sstevel@tonic-gate 329*0Sstevel@tonic-gateC<a+> = match 'a' 1 or more times, i.e., at least once 330*0Sstevel@tonic-gate 331*0Sstevel@tonic-gate=item * 332*0Sstevel@tonic-gate 333*0Sstevel@tonic-gateC<a{n,m}> = match at least C<n> times, but not more than C<m> 334*0Sstevel@tonic-gatetimes. 335*0Sstevel@tonic-gate 336*0Sstevel@tonic-gate=item * 337*0Sstevel@tonic-gate 338*0Sstevel@tonic-gateC<a{n,}> = match at least C<n> or more times 339*0Sstevel@tonic-gate 340*0Sstevel@tonic-gate=item * 341*0Sstevel@tonic-gate 342*0Sstevel@tonic-gateC<a{n}> = match exactly C<n> times 343*0Sstevel@tonic-gate 344*0Sstevel@tonic-gate=back 345*0Sstevel@tonic-gate 346*0Sstevel@tonic-gateHere are some examples: 347*0Sstevel@tonic-gate 348*0Sstevel@tonic-gate /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and 349*0Sstevel@tonic-gate # any number of digits 350*0Sstevel@tonic-gate /(\w+)\s+\1/; # match doubled words of arbitrary length 351*0Sstevel@tonic-gate $year =~ /\d{2,4}/; # make sure year is at least 2 but not more 352*0Sstevel@tonic-gate # than 4 digits 353*0Sstevel@tonic-gate $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates 354*0Sstevel@tonic-gate 355*0Sstevel@tonic-gateThese quantifiers will try to match as much of the string as possible, 356*0Sstevel@tonic-gatewhile still allowing the regex to match. So we have 357*0Sstevel@tonic-gate 358*0Sstevel@tonic-gate $x = 'the cat in the hat'; 359*0Sstevel@tonic-gate $x =~ /^(.*)(at)(.*)$/; # matches, 360*0Sstevel@tonic-gate # $1 = 'the cat in the h' 361*0Sstevel@tonic-gate # $2 = 'at' 362*0Sstevel@tonic-gate # $3 = '' (0 matches) 363*0Sstevel@tonic-gate 364*0Sstevel@tonic-gateThe first quantifier C<.*> grabs as much of the string as possible 365*0Sstevel@tonic-gatewhile still having the regex match. The second quantifier C<.*> has 366*0Sstevel@tonic-gateno string left to it, so it matches 0 times. 367*0Sstevel@tonic-gate 368*0Sstevel@tonic-gate=head2 More matching 369*0Sstevel@tonic-gate 370*0Sstevel@tonic-gateThere are a few more things you might want to know about matching 371*0Sstevel@tonic-gateoperators. In the code 372*0Sstevel@tonic-gate 373*0Sstevel@tonic-gate $pattern = 'Seuss'; 374*0Sstevel@tonic-gate while (<>) { 375*0Sstevel@tonic-gate print if /$pattern/; 376*0Sstevel@tonic-gate } 377*0Sstevel@tonic-gate 378*0Sstevel@tonic-gateperl has to re-evaluate C<$pattern> each time through the loop. If 379*0Sstevel@tonic-gateC<$pattern> won't be changing, use the C<//o> modifier, to only 380*0Sstevel@tonic-gateperform variable substitutions once. If you don't want any 381*0Sstevel@tonic-gatesubstitutions at all, use the special delimiter C<m''>: 382*0Sstevel@tonic-gate 383*0Sstevel@tonic-gate @pattern = ('Seuss'); 384*0Sstevel@tonic-gate m/@pattern/; # matches 'Seuss' 385*0Sstevel@tonic-gate m'@pattern'; # matches the literal string '@pattern' 386*0Sstevel@tonic-gate 387*0Sstevel@tonic-gateThe global modifier C<//g> allows the matching operator to match 388*0Sstevel@tonic-gatewithin a string as many times as possible. In scalar context, 389*0Sstevel@tonic-gatesuccessive matches against a string will have C<//g> jump from match 390*0Sstevel@tonic-gateto match, keeping track of position in the string as it goes along. 391*0Sstevel@tonic-gateYou can get or set the position with the C<pos()> function. 392*0Sstevel@tonic-gateFor example, 393*0Sstevel@tonic-gate 394*0Sstevel@tonic-gate $x = "cat dog house"; # 3 words 395*0Sstevel@tonic-gate while ($x =~ /(\w+)/g) { 396*0Sstevel@tonic-gate print "Word is $1, ends at position ", pos $x, "\n"; 397*0Sstevel@tonic-gate } 398*0Sstevel@tonic-gate 399*0Sstevel@tonic-gateprints 400*0Sstevel@tonic-gate 401*0Sstevel@tonic-gate Word is cat, ends at position 3 402*0Sstevel@tonic-gate Word is dog, ends at position 7 403*0Sstevel@tonic-gate Word is house, ends at position 13 404*0Sstevel@tonic-gate 405*0Sstevel@tonic-gateA failed match or changing the target string resets the position. If 406*0Sstevel@tonic-gateyou don't want the position reset after failure to match, add the 407*0Sstevel@tonic-gateC<//c>, as in C</regex/gc>. 408*0Sstevel@tonic-gate 409*0Sstevel@tonic-gateIn list context, C<//g> returns a list of matched groupings, or if 410*0Sstevel@tonic-gatethere are no groupings, a list of matches to the whole regex. So 411*0Sstevel@tonic-gate 412*0Sstevel@tonic-gate @words = ($x =~ /(\w+)/g); # matches, 413*0Sstevel@tonic-gate # $word[0] = 'cat' 414*0Sstevel@tonic-gate # $word[1] = 'dog' 415*0Sstevel@tonic-gate # $word[2] = 'house' 416*0Sstevel@tonic-gate 417*0Sstevel@tonic-gate=head2 Search and replace 418*0Sstevel@tonic-gate 419*0Sstevel@tonic-gateSearch and replace is performed using C<s/regex/replacement/modifiers>. 420*0Sstevel@tonic-gateThe C<replacement> is a Perl double quoted string that replaces in the 421*0Sstevel@tonic-gatestring whatever is matched with the C<regex>. The operator C<=~> is 422*0Sstevel@tonic-gatealso used here to associate a string with C<s///>. If matching 423*0Sstevel@tonic-gateagainst C<$_>, the S<C<$_ =~> > can be dropped. If there is a match, 424*0Sstevel@tonic-gateC<s///> returns the number of substitutions made, otherwise it returns 425*0Sstevel@tonic-gatefalse. Here are a few examples: 426*0Sstevel@tonic-gate 427*0Sstevel@tonic-gate $x = "Time to feed the cat!"; 428*0Sstevel@tonic-gate $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" 429*0Sstevel@tonic-gate $y = "'quoted words'"; 430*0Sstevel@tonic-gate $y =~ s/^'(.*)'$/$1/; # strip single quotes, 431*0Sstevel@tonic-gate # $y contains "quoted words" 432*0Sstevel@tonic-gate 433*0Sstevel@tonic-gateWith the C<s///> operator, the matched variables C<$1>, C<$2>, etc. 434*0Sstevel@tonic-gateare immediately available for use in the replacement expression. With 435*0Sstevel@tonic-gatethe global modifier, C<s///g> will search and replace all occurrences 436*0Sstevel@tonic-gateof the regex in the string: 437*0Sstevel@tonic-gate 438*0Sstevel@tonic-gate $x = "I batted 4 for 4"; 439*0Sstevel@tonic-gate $x =~ s/4/four/; # $x contains "I batted four for 4" 440*0Sstevel@tonic-gate $x = "I batted 4 for 4"; 441*0Sstevel@tonic-gate $x =~ s/4/four/g; # $x contains "I batted four for four" 442*0Sstevel@tonic-gate 443*0Sstevel@tonic-gateThe evaluation modifier C<s///e> wraps an C<eval{...}> around the 444*0Sstevel@tonic-gatereplacement string and the evaluated result is substituted for the 445*0Sstevel@tonic-gatematched substring. Some examples: 446*0Sstevel@tonic-gate 447*0Sstevel@tonic-gate # reverse all the words in a string 448*0Sstevel@tonic-gate $x = "the cat in the hat"; 449*0Sstevel@tonic-gate $x =~ s/(\w+)/reverse $1/ge; # $x contains "eht tac ni eht tah" 450*0Sstevel@tonic-gate 451*0Sstevel@tonic-gate # convert percentage to decimal 452*0Sstevel@tonic-gate $x = "A 39% hit rate"; 453*0Sstevel@tonic-gate $x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate" 454*0Sstevel@tonic-gate 455*0Sstevel@tonic-gateThe last example shows that C<s///> can use other delimiters, such as 456*0Sstevel@tonic-gateC<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are used 457*0Sstevel@tonic-gateC<s'''>, then the regex and replacement are treated as single quoted 458*0Sstevel@tonic-gatestrings. 459*0Sstevel@tonic-gate 460*0Sstevel@tonic-gate=head2 The split operator 461*0Sstevel@tonic-gate 462*0Sstevel@tonic-gateC<split /regex/, string> splits C<string> into a list of substrings 463*0Sstevel@tonic-gateand returns that list. The regex determines the character sequence 464*0Sstevel@tonic-gatethat C<string> is split with respect to. For example, to split a 465*0Sstevel@tonic-gatestring into words, use 466*0Sstevel@tonic-gate 467*0Sstevel@tonic-gate $x = "Calvin and Hobbes"; 468*0Sstevel@tonic-gate @word = split /\s+/, $x; # $word[0] = 'Calvin' 469*0Sstevel@tonic-gate # $word[1] = 'and' 470*0Sstevel@tonic-gate # $word[2] = 'Hobbes' 471*0Sstevel@tonic-gate 472*0Sstevel@tonic-gateTo extract a comma-delimited list of numbers, use 473*0Sstevel@tonic-gate 474*0Sstevel@tonic-gate $x = "1.618,2.718, 3.142"; 475*0Sstevel@tonic-gate @const = split /,\s*/, $x; # $const[0] = '1.618' 476*0Sstevel@tonic-gate # $const[1] = '2.718' 477*0Sstevel@tonic-gate # $const[2] = '3.142' 478*0Sstevel@tonic-gate 479*0Sstevel@tonic-gateIf the empty regex C<//> is used, the string is split into individual 480*0Sstevel@tonic-gatecharacters. If the regex has groupings, then the list produced contains 481*0Sstevel@tonic-gatethe matched substrings from the groupings as well: 482*0Sstevel@tonic-gate 483*0Sstevel@tonic-gate $x = "/usr/bin"; 484*0Sstevel@tonic-gate @parts = split m!(/)!, $x; # $parts[0] = '' 485*0Sstevel@tonic-gate # $parts[1] = '/' 486*0Sstevel@tonic-gate # $parts[2] = 'usr' 487*0Sstevel@tonic-gate # $parts[3] = '/' 488*0Sstevel@tonic-gate # $parts[4] = 'bin' 489*0Sstevel@tonic-gate 490*0Sstevel@tonic-gateSince the first character of $x matched the regex, C<split> prepended 491*0Sstevel@tonic-gatean empty initial element to the list. 492*0Sstevel@tonic-gate 493*0Sstevel@tonic-gate=head1 BUGS 494*0Sstevel@tonic-gate 495*0Sstevel@tonic-gateNone. 496*0Sstevel@tonic-gate 497*0Sstevel@tonic-gate=head1 SEE ALSO 498*0Sstevel@tonic-gate 499*0Sstevel@tonic-gateThis is just a quick start guide. For a more in-depth tutorial on 500*0Sstevel@tonic-gateregexes, see L<perlretut> and for the reference page, see L<perlre>. 501*0Sstevel@tonic-gate 502*0Sstevel@tonic-gate=head1 AUTHOR AND COPYRIGHT 503*0Sstevel@tonic-gate 504*0Sstevel@tonic-gateCopyright (c) 2000 Mark Kvale 505*0Sstevel@tonic-gateAll rights reserved. 506*0Sstevel@tonic-gate 507*0Sstevel@tonic-gateThis document may be distributed under the same terms as Perl itself. 508*0Sstevel@tonic-gate 509*0Sstevel@tonic-gate=head2 Acknowledgments 510*0Sstevel@tonic-gate 511*0Sstevel@tonic-gateThe author would like to thank Mark-Jason Dominus, Tom Christiansen, 512*0Sstevel@tonic-gateIlya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful 513*0Sstevel@tonic-gatecomments. 514*0Sstevel@tonic-gate 515*0Sstevel@tonic-gate=cut 516*0Sstevel@tonic-gate 517