1=head1 NAME 2 3perlrequick - Perl regular expressions quick start 4 5=head1 DESCRIPTION 6 7This page covers the very basics of understanding, creating and 8using regular expressions ('regexes') in Perl. 9 10 11=head1 The Guide 12 13=head2 Simple word matching 14 15The simplest regex is simply a word, or more generally, a string of 16characters. A regex consisting of a word matches any string that 17contains that word: 18 19 "Hello World" =~ /World/; # matches 20 21In this statement, C<World> is a regex and the C<//> enclosing 22C</World/> tells perl to search a string for a match. The operator 23C<=~> associates the string with the regex match and produces a true 24value if the regex matched, or false if the regex did not match. In 25our case, C<World> matches the second word in C<"Hello World">, so the 26expression is true. This idea has several variations. 27 28Expressions like this are useful in conditionals: 29 30 print "It matches\n" if "Hello World" =~ /World/; 31 32The sense of the match can be reversed by using C<!~> operator: 33 34 print "It doesn't match\n" if "Hello World" !~ /World/; 35 36The literal string in the regex can be replaced by a variable: 37 38 $greeting = "World"; 39 print "It matches\n" if "Hello World" =~ /$greeting/; 40 41If you're matching against C<$_>, the C<$_ =~> part can be omitted: 42 43 $_ = "Hello World"; 44 print "It matches\n" if /World/; 45 46Finally, the C<//> default delimiters for a match can be changed to 47arbitrary delimiters by putting an C<'m'> out front: 48 49 "Hello World" =~ m!World!; # matches, delimited by '!' 50 "Hello World" =~ m{World}; # matches, note the matching '{}' 51 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', 52 # '/' becomes an ordinary char 53 54Regexes must match a part of the string I<exactly> in order for the 55statement to be true: 56 57 "Hello World" =~ /world/; # doesn't match, case sensitive 58 "Hello World" =~ /o W/; # matches, ' ' is an ordinary char 59 "Hello World" =~ /World /; # doesn't match, no ' ' at end 60 61perl will always match at the earliest possible point in the string: 62 63 "Hello World" =~ /o/; # matches 'o' in 'Hello' 64 "That hat is red" =~ /hat/; # matches 'hat' in 'That' 65 66Not all characters can be used 'as is' in a match. Some characters, 67called B<metacharacters>, are reserved for use in regex notation. 68The metacharacters are 69 70 {}[]()^$.|*+?\ 71 72A metacharacter can be matched by putting a backslash before it: 73 74 "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter 75 "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + 76 'C:\WIN32' =~ /C:\\WIN/; # matches 77 "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches 78 79In the last regex, the forward slash C<'/'> is also backslashed, 80because it is used to delimit the regex. 81 82Non-printable ASCII characters are represented by B<escape sequences>. 83Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r> 84for a carriage return. Arbitrary bytes are represented by octal 85escape sequences, e.g., C<\033>, or hexadecimal escape sequences, 86e.g., C<\x1B>: 87 88 "1000\t2000" =~ m(0\t2) # matches 89 "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat 90 91Regexes are treated mostly as double quoted strings, so variable 92substitution works: 93 94 $foo = 'house'; 95 'cathouse' =~ /cat$foo/; # matches 96 'housecat' =~ /${foo}cat/; # matches 97 98With all of the regexes above, if the regex matched anywhere in the 99string, it was considered a match. To specify I<where> it should 100match, we would use the B<anchor> metacharacters C<^> and C<$>. The 101anchor C<^> means match at the beginning of the string and the anchor 102C<$> means match at the end of the string, or before a newline at the 103end of the string. Some examples: 104 105 "housekeeper" =~ /keeper/; # matches 106 "housekeeper" =~ /^keeper/; # doesn't match 107 "housekeeper" =~ /keeper$/; # matches 108 "housekeeper\n" =~ /keeper$/; # matches 109 "housekeeper" =~ /^housekeeper$/; # matches 110 111=head2 Using character classes 112 113A B<character class> allows a set of possible characters, rather than 114just a single character, to match at a particular point in a regex. 115Character classes are denoted by brackets C<[...]>, with the set of 116characters to be possibly matched inside. Here are some examples: 117 118 /cat/; # matches 'cat' 119 /[bcr]at/; # matches 'bat', 'cat', or 'rat' 120 "abc" =~ /[cab]/; # matches 'a' 121 122In the last statement, even though C<'c'> is the first character in 123the class, the earliest point at which the regex can match is C<'a'>. 124 125 /[yY][eE][sS]/; # match 'yes' in a case-insensitive way 126 # 'yes', 'Yes', 'YES', etc. 127 /yes/i; # also match 'yes' in a case-insensitive way 128 129The last example shows a match with an C<'i'> B<modifier>, which makes 130the match case-insensitive. 131 132Character classes also have ordinary and special characters, but the 133sets of ordinary and special characters inside a character class are 134different than those outside a character class. The special 135characters for a character class are C<-]\^$> and are matched using an 136escape: 137 138 /[\]c]def/; # matches ']def' or 'cdef' 139 $x = 'bcr'; 140 /[$x]at/; # matches 'bat, 'cat', or 'rat' 141 /[\$x]at/; # matches '$at' or 'xat' 142 /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' 143 144The special character C<'-'> acts as a range operator within character 145classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]> 146become the svelte C<[0-9]> and C<[a-z]>: 147 148 /item[0-9]/; # matches 'item0' or ... or 'item9' 149 /[0-9a-fA-F]/; # matches a hexadecimal digit 150 151If C<'-'> is the first or last character in a character class, it is 152treated as an ordinary character. 153 154The special character C<^> in the first position of a character class 155denotes a B<negated character class>, which matches any character but 156those in the brackets. Both C<[...]> and C<[^...]> must match a 157character, or the match fails. Then 158 159 /[^a]at/; # doesn't match 'aat' or 'at', but matches 160 # all other 'bat', 'cat, '0at', '%at', etc. 161 /[^0-9]/; # matches a non-numeric character 162 /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary 163 164Perl has several abbreviations for common character classes: 165 166=over 4 167 168=item * 169 170\d is a digit and represents 171 172 [0-9] 173 174=item * 175 176\s is a whitespace character and represents 177 178 [\ \t\r\n\f] 179 180=item * 181 182\w is a word character (alphanumeric or _) and represents 183 184 [0-9a-zA-Z_] 185 186=item * 187 188\D is a negated \d; it represents any character but a digit 189 190 [^0-9] 191 192=item * 193 194\S is a negated \s; it represents any non-whitespace character 195 196 [^\s] 197 198=item * 199 200\W is a negated \w; it represents any non-word character 201 202 [^\w] 203 204=item * 205 206The period '.' matches any character but "\n" 207 208=back 209 210The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside 211of character classes. Here are some in use: 212 213 /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format 214 /[\d\s]/; # matches any digit or whitespace character 215 /\w\W\w/; # matches a word char, followed by a 216 # non-word char, followed by a word char 217 /..rt/; # matches any two chars, followed by 'rt' 218 /end\./; # matches 'end.' 219 /end[.]/; # same thing, matches 'end.' 220 221The S<B<word anchor> > C<\b> matches a boundary between a word 222character and a non-word character C<\w\W> or C<\W\w>: 223 224 $x = "Housecat catenates house and cat"; 225 $x =~ /\bcat/; # matches cat in 'catenates' 226 $x =~ /cat\b/; # matches cat in 'housecat' 227 $x =~ /\bcat\b/; # matches 'cat' at end of string 228 229In the last example, the end of the string is considered a word 230boundary. 231 232=head2 Matching this or that 233 234We can match different character strings with the B<alternation> 235metacharacter C<'|'>. To match C<dog> or C<cat>, we form the regex 236C<dog|cat>. As before, perl will try to match the regex at the 237earliest possible point in the string. At each character position, 238perl will first try to match the first alternative, C<dog>. If 239C<dog> doesn't match, perl will then try the next alternative, C<cat>. 240If C<cat> doesn't match either, then the match fails and perl moves to 241the next position in the string. Some examples: 242 243 "cats and dogs" =~ /cat|dog|bird/; # matches "cat" 244 "cats and dogs" =~ /dog|cat|bird/; # matches "cat" 245 246Even though C<dog> is the first alternative in the second regex, 247C<cat> is able to match earlier in the string. 248 249 "cats" =~ /c|ca|cat|cats/; # matches "c" 250 "cats" =~ /cats|cat|ca|c/; # matches "cats" 251 252At a given character position, the first alternative that allows the 253regex match to succeed will be the one that matches. Here, all the 254alternatives match at the first string position, so the first matches. 255 256=head2 Grouping things and hierarchical matching 257 258The B<grouping> metacharacters C<()> allow a part of a regex to be 259treated as a single unit. Parts of a regex are grouped by enclosing 260them in parentheses. The regex C<house(cat|keeper)> means match 261C<house> followed by either C<cat> or C<keeper>. Some more examples 262are 263 264 /(a|b)b/; # matches 'ab' or 'bb' 265 /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere 266 267 /house(cat|)/; # matches either 'housecat' or 'house' 268 /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or 269 # 'house'. Note groups can be nested. 270 271 "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', 272 # because '20\d\d' can't match 273 274=head2 Extracting matches 275 276The grouping metacharacters C<()> also allow the extraction of the 277parts of a string that matched. For each grouping, the part that 278matched inside goes into the special variables C<$1>, C<$2>, etc. 279They can be used just as ordinary variables: 280 281 # extract hours, minutes, seconds 282 $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format 283 $hours = $1; 284 $minutes = $2; 285 $seconds = $3; 286 287In list context, a match C</regex/> with groupings will return the 288list of matched values C<($1,$2,...)>. So we could rewrite it as 289 290 ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); 291 292If the groupings in a regex are nested, C<$1> gets the group with the 293leftmost opening parenthesis, C<$2> the next opening parenthesis, 294etc. For example, here is a complex regex and the matching variables 295indicated below it: 296 297 /(ab(cd|ef)((gi)|j))/; 298 1 2 34 299 300Associated with the matching variables C<$1>, C<$2>, ... are 301the B<backreferences> C<\1>, C<\2>, ... Backreferences are 302matching variables that can be used I<inside> a regex: 303 304 /(\w\w\w)\s\1/; # find sequences like 'the the' in string 305 306C<$1>, C<$2>, ... should only be used outside of a regex, and C<\1>, 307C<\2>, ... only inside a regex. 308 309=head2 Matching repetitions 310 311The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us 312to determine the number of repeats of a portion of a regex we 313consider to be a match. Quantifiers are put immediately after the 314character, character class, or grouping that we want to specify. They 315have the following meanings: 316 317=over 4 318 319=item * 320 321C<a?> = match 'a' 1 or 0 times 322 323=item * 324 325C<a*> = match 'a' 0 or more times, i.e., any number of times 326 327=item * 328 329C<a+> = match 'a' 1 or more times, i.e., at least once 330 331=item * 332 333C<a{n,m}> = match at least C<n> times, but not more than C<m> 334times. 335 336=item * 337 338C<a{n,}> = match at least C<n> or more times 339 340=item * 341 342C<a{n}> = match exactly C<n> times 343 344=back 345 346Here are some examples: 347 348 /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and 349 # any number of digits 350 /(\w+)\s+\1/; # match doubled words of arbitrary length 351 $year =~ /\d{2,4}/; # make sure year is at least 2 but not more 352 # than 4 digits 353 $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates 354 355These quantifiers will try to match as much of the string as possible, 356while still allowing the regex to match. So we have 357 358 $x = 'the cat in the hat'; 359 $x =~ /^(.*)(at)(.*)$/; # matches, 360 # $1 = 'the cat in the h' 361 # $2 = 'at' 362 # $3 = '' (0 matches) 363 364The first quantifier C<.*> grabs as much of the string as possible 365while still having the regex match. The second quantifier C<.*> has 366no string left to it, so it matches 0 times. 367 368=head2 More matching 369 370There are a few more things you might want to know about matching 371operators. In the code 372 373 $pattern = 'Seuss'; 374 while (<>) { 375 print if /$pattern/; 376 } 377 378perl has to re-evaluate C<$pattern> each time through the loop. If 379C<$pattern> won't be changing, use the C<//o> modifier, to only 380perform variable substitutions once. If you don't want any 381substitutions at all, use the special delimiter C<m''>: 382 383 @pattern = ('Seuss'); 384 m/@pattern/; # matches 'Seuss' 385 m'@pattern'; # matches the literal string '@pattern' 386 387The global modifier C<//g> allows the matching operator to match 388within a string as many times as possible. In scalar context, 389successive matches against a string will have C<//g> jump from match 390to match, keeping track of position in the string as it goes along. 391You can get or set the position with the C<pos()> function. 392For example, 393 394 $x = "cat dog house"; # 3 words 395 while ($x =~ /(\w+)/g) { 396 print "Word is $1, ends at position ", pos $x, "\n"; 397 } 398 399prints 400 401 Word is cat, ends at position 3 402 Word is dog, ends at position 7 403 Word is house, ends at position 13 404 405A failed match or changing the target string resets the position. If 406you don't want the position reset after failure to match, add the 407C<//c>, as in C</regex/gc>. 408 409In list context, C<//g> returns a list of matched groupings, or if 410there are no groupings, a list of matches to the whole regex. So 411 412 @words = ($x =~ /(\w+)/g); # matches, 413 # $word[0] = 'cat' 414 # $word[1] = 'dog' 415 # $word[2] = 'house' 416 417=head2 Search and replace 418 419Search and replace is performed using C<s/regex/replacement/modifiers>. 420The C<replacement> is a Perl double quoted string that replaces in the 421string whatever is matched with the C<regex>. The operator C<=~> is 422also used here to associate a string with C<s///>. If matching 423against C<$_>, the S<C<$_ =~> > can be dropped. If there is a match, 424C<s///> returns the number of substitutions made, otherwise it returns 425false. Here are a few examples: 426 427 $x = "Time to feed the cat!"; 428 $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" 429 $y = "'quoted words'"; 430 $y =~ s/^'(.*)'$/$1/; # strip single quotes, 431 # $y contains "quoted words" 432 433With the C<s///> operator, the matched variables C<$1>, C<$2>, etc. 434are immediately available for use in the replacement expression. With 435the global modifier, C<s///g> will search and replace all occurrences 436of the regex in the string: 437 438 $x = "I batted 4 for 4"; 439 $x =~ s/4/four/; # $x contains "I batted four for 4" 440 $x = "I batted 4 for 4"; 441 $x =~ s/4/four/g; # $x contains "I batted four for four" 442 443The evaluation modifier C<s///e> wraps an C<eval{...}> around the 444replacement string and the evaluated result is substituted for the 445matched substring. Some examples: 446 447 # reverse all the words in a string 448 $x = "the cat in the hat"; 449 $x =~ s/(\w+)/reverse $1/ge; # $x contains "eht tac ni eht tah" 450 451 # convert percentage to decimal 452 $x = "A 39% hit rate"; 453 $x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate" 454 455The last example shows that C<s///> can use other delimiters, such as 456C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are used 457C<s'''>, then the regex and replacement are treated as single quoted 458strings. 459 460=head2 The split operator 461 462C<split /regex/, string> splits C<string> into a list of substrings 463and returns that list. The regex determines the character sequence 464that C<string> is split with respect to. For example, to split a 465string into words, use 466 467 $x = "Calvin and Hobbes"; 468 @word = split /\s+/, $x; # $word[0] = 'Calvin' 469 # $word[1] = 'and' 470 # $word[2] = 'Hobbes' 471 472To extract a comma-delimited list of numbers, use 473 474 $x = "1.618,2.718, 3.142"; 475 @const = split /,\s*/, $x; # $const[0] = '1.618' 476 # $const[1] = '2.718' 477 # $const[2] = '3.142' 478 479If the empty regex C<//> is used, the string is split into individual 480characters. If the regex has groupings, then the list produced contains 481the matched substrings from the groupings as well: 482 483 $x = "/usr/bin"; 484 @parts = split m!(/)!, $x; # $parts[0] = '' 485 # $parts[1] = '/' 486 # $parts[2] = 'usr' 487 # $parts[3] = '/' 488 # $parts[4] = 'bin' 489 490Since the first character of $x matched the regex, C<split> prepended 491an empty initial element to the list. 492 493=head1 BUGS 494 495None. 496 497=head1 SEE ALSO 498 499This is just a quick start guide. For a more in-depth tutorial on 500regexes, see L<perlretut> and for the reference page, see L<perlre>. 501 502=head1 AUTHOR AND COPYRIGHT 503 504Copyright (c) 2000 Mark Kvale 505All rights reserved. 506 507This document may be distributed under the same terms as Perl itself. 508 509=head2 Acknowledgments 510 511The author would like to thank Mark-Jason Dominus, Tom Christiansen, 512Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful 513comments. 514 515=cut 516 517