xref: /onnv-gate/usr/src/cmd/perl/5.8.4/distrib/pod/perlretut.pod (revision 0:68f95e015346)
1*0Sstevel@tonic-gate=head1 NAME
2*0Sstevel@tonic-gate
3*0Sstevel@tonic-gateperlretut - Perl regular expressions tutorial
4*0Sstevel@tonic-gate
5*0Sstevel@tonic-gate=head1 DESCRIPTION
6*0Sstevel@tonic-gate
7*0Sstevel@tonic-gateThis page provides a basic tutorial on understanding, creating and
8*0Sstevel@tonic-gateusing regular expressions in Perl.  It serves as a complement to the
9*0Sstevel@tonic-gatereference page on regular expressions L<perlre>.  Regular expressions
10*0Sstevel@tonic-gateare an integral part of the C<m//>, C<s///>, C<qr//> and C<split>
11*0Sstevel@tonic-gateoperators and so this tutorial also overlaps with
12*0Sstevel@tonic-gateL<perlop/"Regexp Quote-Like Operators"> and L<perlfunc/split>.
13*0Sstevel@tonic-gate
14*0Sstevel@tonic-gatePerl is widely renowned for excellence in text processing, and regular
15*0Sstevel@tonic-gateexpressions are one of the big factors behind this fame.  Perl regular
16*0Sstevel@tonic-gateexpressions display an efficiency and flexibility unknown in most
17*0Sstevel@tonic-gateother computer languages.  Mastering even the basics of regular
18*0Sstevel@tonic-gateexpressions will allow you to manipulate text with surprising ease.
19*0Sstevel@tonic-gate
20*0Sstevel@tonic-gateWhat is a regular expression?  A regular expression is simply a string
21*0Sstevel@tonic-gatethat describes a pattern.  Patterns are in common use these days;
22*0Sstevel@tonic-gateexamples are the patterns typed into a search engine to find web pages
23*0Sstevel@tonic-gateand the patterns used to list files in a directory, e.g., C<ls *.txt>
24*0Sstevel@tonic-gateor C<dir *.*>.  In Perl, the patterns described by regular expressions
25*0Sstevel@tonic-gateare used to search strings, extract desired parts of strings, and to
26*0Sstevel@tonic-gatedo search and replace operations.
27*0Sstevel@tonic-gate
28*0Sstevel@tonic-gateRegular expressions have the undeserved reputation of being abstract
29*0Sstevel@tonic-gateand difficult to understand.  Regular expressions are constructed using
30*0Sstevel@tonic-gatesimple concepts like conditionals and loops and are no more difficult
31*0Sstevel@tonic-gateto understand than the corresponding C<if> conditionals and C<while>
32*0Sstevel@tonic-gateloops in the Perl language itself.  In fact, the main challenge in
33*0Sstevel@tonic-gatelearning regular expressions is just getting used to the terse
34*0Sstevel@tonic-gatenotation used to express these concepts.
35*0Sstevel@tonic-gate
36*0Sstevel@tonic-gateThis tutorial flattens the learning curve by discussing regular
37*0Sstevel@tonic-gateexpression concepts, along with their notation, one at a time and with
38*0Sstevel@tonic-gatemany examples.  The first part of the tutorial will progress from the
39*0Sstevel@tonic-gatesimplest word searches to the basic regular expression concepts.  If
40*0Sstevel@tonic-gateyou master the first part, you will have all the tools needed to solve
41*0Sstevel@tonic-gateabout 98% of your needs.  The second part of the tutorial is for those
42*0Sstevel@tonic-gatecomfortable with the basics and hungry for more power tools.  It
43*0Sstevel@tonic-gatediscusses the more advanced regular expression operators and
44*0Sstevel@tonic-gateintroduces the latest cutting edge innovations in 5.6.0.
45*0Sstevel@tonic-gate
46*0Sstevel@tonic-gateA note: to save time, 'regular expression' is often abbreviated as
47*0Sstevel@tonic-gateregexp or regex.  Regexp is a more natural abbreviation than regex, but
48*0Sstevel@tonic-gateis harder to pronounce.  The Perl pod documentation is evenly split on
49*0Sstevel@tonic-gateregexp vs regex; in Perl, there is more than one way to abbreviate it.
50*0Sstevel@tonic-gateWe'll use regexp in this tutorial.
51*0Sstevel@tonic-gate
52*0Sstevel@tonic-gate=head1 Part 1: The basics
53*0Sstevel@tonic-gate
54*0Sstevel@tonic-gate=head2 Simple word matching
55*0Sstevel@tonic-gate
56*0Sstevel@tonic-gateThe simplest regexp is simply a word, or more generally, a string of
57*0Sstevel@tonic-gatecharacters.  A regexp consisting of a word matches any string that
58*0Sstevel@tonic-gatecontains that word:
59*0Sstevel@tonic-gate
60*0Sstevel@tonic-gate    "Hello World" =~ /World/;  # matches
61*0Sstevel@tonic-gate
62*0Sstevel@tonic-gateWhat is this perl statement all about? C<"Hello World"> is a simple
63*0Sstevel@tonic-gatedouble quoted string.  C<World> is the regular expression and the
64*0Sstevel@tonic-gateC<//> enclosing C</World/> tells perl to search a string for a match.
65*0Sstevel@tonic-gateThe operator C<=~> associates the string with the regexp match and
66*0Sstevel@tonic-gateproduces a true value if the regexp matched, or false if the regexp
67*0Sstevel@tonic-gatedid not match.  In our case, C<World> matches the second word in
68*0Sstevel@tonic-gateC<"Hello World">, so the expression is true.  Expressions like this
69*0Sstevel@tonic-gateare useful in conditionals:
70*0Sstevel@tonic-gate
71*0Sstevel@tonic-gate    if ("Hello World" =~ /World/) {
72*0Sstevel@tonic-gate        print "It matches\n";
73*0Sstevel@tonic-gate    }
74*0Sstevel@tonic-gate    else {
75*0Sstevel@tonic-gate        print "It doesn't match\n";
76*0Sstevel@tonic-gate    }
77*0Sstevel@tonic-gate
78*0Sstevel@tonic-gateThere are useful variations on this theme.  The sense of the match can
79*0Sstevel@tonic-gatebe reversed by using C<!~> operator:
80*0Sstevel@tonic-gate
81*0Sstevel@tonic-gate    if ("Hello World" !~ /World/) {
82*0Sstevel@tonic-gate        print "It doesn't match\n";
83*0Sstevel@tonic-gate    }
84*0Sstevel@tonic-gate    else {
85*0Sstevel@tonic-gate        print "It matches\n";
86*0Sstevel@tonic-gate    }
87*0Sstevel@tonic-gate
88*0Sstevel@tonic-gateThe literal string in the regexp can be replaced by a variable:
89*0Sstevel@tonic-gate
90*0Sstevel@tonic-gate    $greeting = "World";
91*0Sstevel@tonic-gate    if ("Hello World" =~ /$greeting/) {
92*0Sstevel@tonic-gate        print "It matches\n";
93*0Sstevel@tonic-gate    }
94*0Sstevel@tonic-gate    else {
95*0Sstevel@tonic-gate        print "It doesn't match\n";
96*0Sstevel@tonic-gate    }
97*0Sstevel@tonic-gate
98*0Sstevel@tonic-gateIf you're matching against the special default variable C<$_>, the
99*0Sstevel@tonic-gateC<$_ =~> part can be omitted:
100*0Sstevel@tonic-gate
101*0Sstevel@tonic-gate    $_ = "Hello World";
102*0Sstevel@tonic-gate    if (/World/) {
103*0Sstevel@tonic-gate        print "It matches\n";
104*0Sstevel@tonic-gate    }
105*0Sstevel@tonic-gate    else {
106*0Sstevel@tonic-gate        print "It doesn't match\n";
107*0Sstevel@tonic-gate    }
108*0Sstevel@tonic-gate
109*0Sstevel@tonic-gateAnd finally, the C<//> default delimiters for a match can be changed
110*0Sstevel@tonic-gateto arbitrary delimiters by putting an C<'m'> out front:
111*0Sstevel@tonic-gate
112*0Sstevel@tonic-gate    "Hello World" =~ m!World!;   # matches, delimited by '!'
113*0Sstevel@tonic-gate    "Hello World" =~ m{World};   # matches, note the matching '{}'
114*0Sstevel@tonic-gate    "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
115*0Sstevel@tonic-gate                                 # '/' becomes an ordinary char
116*0Sstevel@tonic-gate
117*0Sstevel@tonic-gateC</World/>, C<m!World!>, and C<m{World}> all represent the
118*0Sstevel@tonic-gatesame thing.  When, e.g., C<""> is used as a delimiter, the forward
119*0Sstevel@tonic-gateslash C<'/'> becomes an ordinary character and can be used in a regexp
120*0Sstevel@tonic-gatewithout trouble.
121*0Sstevel@tonic-gate
122*0Sstevel@tonic-gateLet's consider how different regexps would match C<"Hello World">:
123*0Sstevel@tonic-gate
124*0Sstevel@tonic-gate    "Hello World" =~ /world/;  # doesn't match
125*0Sstevel@tonic-gate    "Hello World" =~ /o W/;    # matches
126*0Sstevel@tonic-gate    "Hello World" =~ /oW/;     # doesn't match
127*0Sstevel@tonic-gate    "Hello World" =~ /World /; # doesn't match
128*0Sstevel@tonic-gate
129*0Sstevel@tonic-gateThe first regexp C<world> doesn't match because regexps are
130*0Sstevel@tonic-gatecase-sensitive.  The second regexp matches because the substring
131*0Sstevel@tonic-gateS<C<'o W'> > occurs in the string S<C<"Hello World"> >.  The space
132*0Sstevel@tonic-gatecharacter ' ' is treated like any other character in a regexp and is
133*0Sstevel@tonic-gateneeded to match in this case.  The lack of a space character is the
134*0Sstevel@tonic-gatereason the third regexp C<'oW'> doesn't match.  The fourth regexp
135*0Sstevel@tonic-gateC<'World '> doesn't match because there is a space at the end of the
136*0Sstevel@tonic-gateregexp, but not at the end of the string.  The lesson here is that
137*0Sstevel@tonic-gateregexps must match a part of the string I<exactly> in order for the
138*0Sstevel@tonic-gatestatement to be true.
139*0Sstevel@tonic-gate
140*0Sstevel@tonic-gateIf a regexp matches in more than one place in the string, perl will
141*0Sstevel@tonic-gatealways match at the earliest possible point in the string:
142*0Sstevel@tonic-gate
143*0Sstevel@tonic-gate    "Hello World" =~ /o/;       # matches 'o' in 'Hello'
144*0Sstevel@tonic-gate    "That hat is red" =~ /hat/; # matches 'hat' in 'That'
145*0Sstevel@tonic-gate
146*0Sstevel@tonic-gateWith respect to character matching, there are a few more points you
147*0Sstevel@tonic-gateneed to know about.   First of all, not all characters can be used 'as
148*0Sstevel@tonic-gateis' in a match.  Some characters, called B<metacharacters>, are reserved
149*0Sstevel@tonic-gatefor use in regexp notation.  The metacharacters are
150*0Sstevel@tonic-gate
151*0Sstevel@tonic-gate    {}[]()^$.|*+?\
152*0Sstevel@tonic-gate
153*0Sstevel@tonic-gateThe significance of each of these will be explained
154*0Sstevel@tonic-gatein the rest of the tutorial, but for now, it is important only to know
155*0Sstevel@tonic-gatethat a metacharacter can be matched by putting a backslash before it:
156*0Sstevel@tonic-gate
157*0Sstevel@tonic-gate    "2+2=4" =~ /2+2/;    # doesn't match, + is a metacharacter
158*0Sstevel@tonic-gate    "2+2=4" =~ /2\+2/;   # matches, \+ is treated like an ordinary +
159*0Sstevel@tonic-gate    "The interval is [0,1)." =~ /[0,1)./     # is a syntax error!
160*0Sstevel@tonic-gate    "The interval is [0,1)." =~ /\[0,1\)\./  # matches
161*0Sstevel@tonic-gate    "/usr/bin/perl" =~ /\/usr\/bin\/perl/;  # matches
162*0Sstevel@tonic-gate
163*0Sstevel@tonic-gateIn the last regexp, the forward slash C<'/'> is also backslashed,
164*0Sstevel@tonic-gatebecause it is used to delimit the regexp.  This can lead to LTS
165*0Sstevel@tonic-gate(leaning toothpick syndrome), however, and it is often more readable
166*0Sstevel@tonic-gateto change delimiters.
167*0Sstevel@tonic-gate
168*0Sstevel@tonic-gate    "/usr/bin/perl" =~ m!/usr/bin/perl!;    # easier to read
169*0Sstevel@tonic-gate
170*0Sstevel@tonic-gateThe backslash character C<'\'> is a metacharacter itself and needs to
171*0Sstevel@tonic-gatebe backslashed:
172*0Sstevel@tonic-gate
173*0Sstevel@tonic-gate    'C:\WIN32' =~ /C:\\WIN/;   # matches
174*0Sstevel@tonic-gate
175*0Sstevel@tonic-gateIn addition to the metacharacters, there are some ASCII characters
176*0Sstevel@tonic-gatewhich don't have printable character equivalents and are instead
177*0Sstevel@tonic-gaterepresented by B<escape sequences>.  Common examples are C<\t> for a
178*0Sstevel@tonic-gatetab, C<\n> for a newline, C<\r> for a carriage return and C<\a> for a
179*0Sstevel@tonic-gatebell.  If your string is better thought of as a sequence of arbitrary
180*0Sstevel@tonic-gatebytes, the octal escape sequence, e.g., C<\033>, or hexadecimal escape
181*0Sstevel@tonic-gatesequence, e.g., C<\x1B> may be a more natural representation for your
182*0Sstevel@tonic-gatebytes.  Here are some examples of escapes:
183*0Sstevel@tonic-gate
184*0Sstevel@tonic-gate    "1000\t2000" =~ m(0\t2)   # matches
185*0Sstevel@tonic-gate    "1000\n2000" =~ /0\n20/   # matches
186*0Sstevel@tonic-gate    "1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000"
187*0Sstevel@tonic-gate    "cat"        =~ /\143\x61\x74/ # matches, but a weird way to spell cat
188*0Sstevel@tonic-gate
189*0Sstevel@tonic-gateIf you've been around Perl a while, all this talk of escape sequences
190*0Sstevel@tonic-gatemay seem familiar.  Similar escape sequences are used in double-quoted
191*0Sstevel@tonic-gatestrings and in fact the regexps in Perl are mostly treated as
192*0Sstevel@tonic-gatedouble-quoted strings.  This means that variables can be used in
193*0Sstevel@tonic-gateregexps as well.  Just like double-quoted strings, the values of the
194*0Sstevel@tonic-gatevariables in the regexp will be substituted in before the regexp is
195*0Sstevel@tonic-gateevaluated for matching purposes.  So we have:
196*0Sstevel@tonic-gate
197*0Sstevel@tonic-gate    $foo = 'house';
198*0Sstevel@tonic-gate    'housecat' =~ /$foo/;      # matches
199*0Sstevel@tonic-gate    'cathouse' =~ /cat$foo/;   # matches
200*0Sstevel@tonic-gate    'housecat' =~ /${foo}cat/; # matches
201*0Sstevel@tonic-gate
202*0Sstevel@tonic-gateSo far, so good.  With the knowledge above you can already perform
203*0Sstevel@tonic-gatesearches with just about any literal string regexp you can dream up.
204*0Sstevel@tonic-gateHere is a I<very simple> emulation of the Unix grep program:
205*0Sstevel@tonic-gate
206*0Sstevel@tonic-gate    % cat > simple_grep
207*0Sstevel@tonic-gate    #!/usr/bin/perl
208*0Sstevel@tonic-gate    $regexp = shift;
209*0Sstevel@tonic-gate    while (<>) {
210*0Sstevel@tonic-gate        print if /$regexp/;
211*0Sstevel@tonic-gate    }
212*0Sstevel@tonic-gate    ^D
213*0Sstevel@tonic-gate
214*0Sstevel@tonic-gate    % chmod +x simple_grep
215*0Sstevel@tonic-gate
216*0Sstevel@tonic-gate    % simple_grep abba /usr/dict/words
217*0Sstevel@tonic-gate    Babbage
218*0Sstevel@tonic-gate    cabbage
219*0Sstevel@tonic-gate    cabbages
220*0Sstevel@tonic-gate    sabbath
221*0Sstevel@tonic-gate    Sabbathize
222*0Sstevel@tonic-gate    Sabbathizes
223*0Sstevel@tonic-gate    sabbatical
224*0Sstevel@tonic-gate    scabbard
225*0Sstevel@tonic-gate    scabbards
226*0Sstevel@tonic-gate
227*0Sstevel@tonic-gateThis program is easy to understand.  C<#!/usr/bin/perl> is the standard
228*0Sstevel@tonic-gateway to invoke a perl program from the shell.
229*0Sstevel@tonic-gateS<C<$regexp = shift;> > saves the first command line argument as the
230*0Sstevel@tonic-gateregexp to be used, leaving the rest of the command line arguments to
231*0Sstevel@tonic-gatebe treated as files.  S<C<< while (<>) >> > loops over all the lines in
232*0Sstevel@tonic-gateall the files.  For each line, S<C<print if /$regexp/;> > prints the
233*0Sstevel@tonic-gateline if the regexp matches the line.  In this line, both C<print> and
234*0Sstevel@tonic-gateC</$regexp/> use the default variable C<$_> implicitly.
235*0Sstevel@tonic-gate
236*0Sstevel@tonic-gateWith all of the regexps above, if the regexp matched anywhere in the
237*0Sstevel@tonic-gatestring, it was considered a match.  Sometimes, however, we'd like to
238*0Sstevel@tonic-gatespecify I<where> in the string the regexp should try to match.  To do
239*0Sstevel@tonic-gatethis, we would use the B<anchor> metacharacters C<^> and C<$>.  The
240*0Sstevel@tonic-gateanchor C<^> means match at the beginning of the string and the anchor
241*0Sstevel@tonic-gateC<$> means match at the end of the string, or before a newline at the
242*0Sstevel@tonic-gateend of the string.  Here is how they are used:
243*0Sstevel@tonic-gate
244*0Sstevel@tonic-gate    "housekeeper" =~ /keeper/;    # matches
245*0Sstevel@tonic-gate    "housekeeper" =~ /^keeper/;   # doesn't match
246*0Sstevel@tonic-gate    "housekeeper" =~ /keeper$/;   # matches
247*0Sstevel@tonic-gate    "housekeeper\n" =~ /keeper$/; # matches
248*0Sstevel@tonic-gate
249*0Sstevel@tonic-gateThe second regexp doesn't match because C<^> constrains C<keeper> to
250*0Sstevel@tonic-gatematch only at the beginning of the string, but C<"housekeeper"> has
251*0Sstevel@tonic-gatekeeper starting in the middle.  The third regexp does match, since the
252*0Sstevel@tonic-gateC<$> constrains C<keeper> to match only at the end of the string.
253*0Sstevel@tonic-gate
254*0Sstevel@tonic-gateWhen both C<^> and C<$> are used at the same time, the regexp has to
255*0Sstevel@tonic-gatematch both the beginning and the end of the string, i.e., the regexp
256*0Sstevel@tonic-gatematches the whole string.  Consider
257*0Sstevel@tonic-gate
258*0Sstevel@tonic-gate    "keeper" =~ /^keep$/;      # doesn't match
259*0Sstevel@tonic-gate    "keeper" =~ /^keeper$/;    # matches
260*0Sstevel@tonic-gate    ""       =~ /^$/;          # ^$ matches an empty string
261*0Sstevel@tonic-gate
262*0Sstevel@tonic-gateThe first regexp doesn't match because the string has more to it than
263*0Sstevel@tonic-gateC<keep>.  Since the second regexp is exactly the string, it
264*0Sstevel@tonic-gatematches.  Using both C<^> and C<$> in a regexp forces the complete
265*0Sstevel@tonic-gatestring to match, so it gives you complete control over which strings
266*0Sstevel@tonic-gatematch and which don't.  Suppose you are looking for a fellow named
267*0Sstevel@tonic-gatebert, off in a string by himself:
268*0Sstevel@tonic-gate
269*0Sstevel@tonic-gate    "dogbert" =~ /bert/;   # matches, but not what you want
270*0Sstevel@tonic-gate
271*0Sstevel@tonic-gate    "dilbert" =~ /^bert/;  # doesn't match, but ..
272*0Sstevel@tonic-gate    "bertram" =~ /^bert/;  # matches, so still not good enough
273*0Sstevel@tonic-gate
274*0Sstevel@tonic-gate    "bertram" =~ /^bert$/; # doesn't match, good
275*0Sstevel@tonic-gate    "dilbert" =~ /^bert$/; # doesn't match, good
276*0Sstevel@tonic-gate    "bert"    =~ /^bert$/; # matches, perfect
277*0Sstevel@tonic-gate
278*0Sstevel@tonic-gateOf course, in the case of a literal string, one could just as easily
279*0Sstevel@tonic-gateuse the string equivalence S<C<$string eq 'bert'> > and it would be
280*0Sstevel@tonic-gatemore efficient.   The  C<^...$> regexp really becomes useful when we
281*0Sstevel@tonic-gateadd in the more powerful regexp tools below.
282*0Sstevel@tonic-gate
283*0Sstevel@tonic-gate=head2 Using character classes
284*0Sstevel@tonic-gate
285*0Sstevel@tonic-gateAlthough one can already do quite a lot with the literal string
286*0Sstevel@tonic-gateregexps above, we've only scratched the surface of regular expression
287*0Sstevel@tonic-gatetechnology.  In this and subsequent sections we will introduce regexp
288*0Sstevel@tonic-gateconcepts (and associated metacharacter notations) that will allow a
289*0Sstevel@tonic-gateregexp to not just represent a single character sequence, but a I<whole
290*0Sstevel@tonic-gateclass> of them.
291*0Sstevel@tonic-gate
292*0Sstevel@tonic-gateOne such concept is that of a B<character class>.  A character class
293*0Sstevel@tonic-gateallows a set of possible characters, rather than just a single
294*0Sstevel@tonic-gatecharacter, to match at a particular point in a regexp.  Character
295*0Sstevel@tonic-gateclasses are denoted by brackets C<[...]>, with the set of characters
296*0Sstevel@tonic-gateto be possibly matched inside.  Here are some examples:
297*0Sstevel@tonic-gate
298*0Sstevel@tonic-gate    /cat/;       # matches 'cat'
299*0Sstevel@tonic-gate    /[bcr]at/;   # matches 'bat, 'cat', or 'rat'
300*0Sstevel@tonic-gate    /item[0123456789]/;  # matches 'item0' or ... or 'item9'
301*0Sstevel@tonic-gate    "abc" =~ /[cab]/;    # matches 'a'
302*0Sstevel@tonic-gate
303*0Sstevel@tonic-gateIn the last statement, even though C<'c'> is the first character in
304*0Sstevel@tonic-gatethe class, C<'a'> matches because the first character position in the
305*0Sstevel@tonic-gatestring is the earliest point at which the regexp can match.
306*0Sstevel@tonic-gate
307*0Sstevel@tonic-gate    /[yY][eE][sS]/;      # match 'yes' in a case-insensitive way
308*0Sstevel@tonic-gate                         # 'yes', 'Yes', 'YES', etc.
309*0Sstevel@tonic-gate
310*0Sstevel@tonic-gateThis regexp displays a common task: perform a case-insensitive
311*0Sstevel@tonic-gatematch.  Perl provides away of avoiding all those brackets by simply
312*0Sstevel@tonic-gateappending an C<'i'> to the end of the match.  Then C</[yY][eE][sS]/;>
313*0Sstevel@tonic-gatecan be rewritten as C</yes/i;>.  The C<'i'> stands for
314*0Sstevel@tonic-gatecase-insensitive and is an example of a B<modifier> of the matching
315*0Sstevel@tonic-gateoperation.  We will meet other modifiers later in the tutorial.
316*0Sstevel@tonic-gate
317*0Sstevel@tonic-gateWe saw in the section above that there were ordinary characters, which
318*0Sstevel@tonic-gaterepresented themselves, and special characters, which needed a
319*0Sstevel@tonic-gatebackslash C<\> to represent themselves.  The same is true in a
320*0Sstevel@tonic-gatecharacter class, but the sets of ordinary and special characters
321*0Sstevel@tonic-gateinside a character class are different than those outside a character
322*0Sstevel@tonic-gateclass.  The special characters for a character class are C<-]\^$>.  C<]>
323*0Sstevel@tonic-gateis special because it denotes the end of a character class.  C<$> is
324*0Sstevel@tonic-gatespecial because it denotes a scalar variable.  C<\> is special because
325*0Sstevel@tonic-gateit is used in escape sequences, just like above.  Here is how the
326*0Sstevel@tonic-gatespecial characters C<]$\> are handled:
327*0Sstevel@tonic-gate
328*0Sstevel@tonic-gate   /[\]c]def/; # matches ']def' or 'cdef'
329*0Sstevel@tonic-gate   $x = 'bcr';
330*0Sstevel@tonic-gate   /[$x]at/;   # matches 'bat', 'cat', or 'rat'
331*0Sstevel@tonic-gate   /[\$x]at/;  # matches '$at' or 'xat'
332*0Sstevel@tonic-gate   /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
333*0Sstevel@tonic-gate
334*0Sstevel@tonic-gateThe last two are a little tricky.  in C<[\$x]>, the backslash protects
335*0Sstevel@tonic-gatethe dollar sign, so the character class has two members C<$> and C<x>.
336*0Sstevel@tonic-gateIn C<[\\$x]>, the backslash is protected, so C<$x> is treated as a
337*0Sstevel@tonic-gatevariable and substituted in double quote fashion.
338*0Sstevel@tonic-gate
339*0Sstevel@tonic-gateThe special character C<'-'> acts as a range operator within character
340*0Sstevel@tonic-gateclasses, so that a contiguous set of characters can be written as a
341*0Sstevel@tonic-gaterange.  With ranges, the unwieldy C<[0123456789]> and C<[abc...xyz]>
342*0Sstevel@tonic-gatebecome the svelte C<[0-9]> and C<[a-z]>.  Some examples are
343*0Sstevel@tonic-gate
344*0Sstevel@tonic-gate    /item[0-9]/;  # matches 'item0' or ... or 'item9'
345*0Sstevel@tonic-gate    /[0-9bx-z]aa/;  # matches '0aa', ..., '9aa',
346*0Sstevel@tonic-gate                    # 'baa', 'xaa', 'yaa', or 'zaa'
347*0Sstevel@tonic-gate    /[0-9a-fA-F]/;  # matches a hexadecimal digit
348*0Sstevel@tonic-gate    /[0-9a-zA-Z_]/; # matches a "word" character,
349*0Sstevel@tonic-gate                    # like those in a perl variable name
350*0Sstevel@tonic-gate
351*0Sstevel@tonic-gateIf C<'-'> is the first or last character in a character class, it is
352*0Sstevel@tonic-gatetreated as an ordinary character; C<[-ab]>, C<[ab-]> and C<[a\-b]> are
353*0Sstevel@tonic-gateall equivalent.
354*0Sstevel@tonic-gate
355*0Sstevel@tonic-gateThe special character C<^> in the first position of a character class
356*0Sstevel@tonic-gatedenotes a B<negated character class>, which matches any character but
357*0Sstevel@tonic-gatethose in the brackets.  Both C<[...]> and C<[^...]> must match a
358*0Sstevel@tonic-gatecharacter, or the match fails.  Then
359*0Sstevel@tonic-gate
360*0Sstevel@tonic-gate    /[^a]at/;  # doesn't match 'aat' or 'at', but matches
361*0Sstevel@tonic-gate               # all other 'bat', 'cat, '0at', '%at', etc.
362*0Sstevel@tonic-gate    /[^0-9]/;  # matches a non-numeric character
363*0Sstevel@tonic-gate    /[a^]at/;  # matches 'aat' or '^at'; here '^' is ordinary
364*0Sstevel@tonic-gate
365*0Sstevel@tonic-gateNow, even C<[0-9]> can be a bother the write multiple times, so in the
366*0Sstevel@tonic-gateinterest of saving keystrokes and making regexps more readable, Perl
367*0Sstevel@tonic-gatehas several abbreviations for common character classes:
368*0Sstevel@tonic-gate
369*0Sstevel@tonic-gate=over 4
370*0Sstevel@tonic-gate
371*0Sstevel@tonic-gate=item *
372*0Sstevel@tonic-gate
373*0Sstevel@tonic-gate\d is a digit and represents [0-9]
374*0Sstevel@tonic-gate
375*0Sstevel@tonic-gate=item *
376*0Sstevel@tonic-gate
377*0Sstevel@tonic-gate\s is a whitespace character and represents [\ \t\r\n\f]
378*0Sstevel@tonic-gate
379*0Sstevel@tonic-gate=item *
380*0Sstevel@tonic-gate
381*0Sstevel@tonic-gate\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_]
382*0Sstevel@tonic-gate
383*0Sstevel@tonic-gate=item *
384*0Sstevel@tonic-gate
385*0Sstevel@tonic-gate\D is a negated \d; it represents any character but a digit [^0-9]
386*0Sstevel@tonic-gate
387*0Sstevel@tonic-gate=item *
388*0Sstevel@tonic-gate
389*0Sstevel@tonic-gate\S is a negated \s; it represents any non-whitespace character [^\s]
390*0Sstevel@tonic-gate
391*0Sstevel@tonic-gate=item *
392*0Sstevel@tonic-gate
393*0Sstevel@tonic-gate\W is a negated \w; it represents any non-word character [^\w]
394*0Sstevel@tonic-gate
395*0Sstevel@tonic-gate=item *
396*0Sstevel@tonic-gate
397*0Sstevel@tonic-gateThe period '.' matches any character but "\n"
398*0Sstevel@tonic-gate
399*0Sstevel@tonic-gate=back
400*0Sstevel@tonic-gate
401*0Sstevel@tonic-gateThe C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
402*0Sstevel@tonic-gateof character classes.  Here are some in use:
403*0Sstevel@tonic-gate
404*0Sstevel@tonic-gate    /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
405*0Sstevel@tonic-gate    /[\d\s]/;         # matches any digit or whitespace character
406*0Sstevel@tonic-gate    /\w\W\w/;         # matches a word char, followed by a
407*0Sstevel@tonic-gate                      # non-word char, followed by a word char
408*0Sstevel@tonic-gate    /..rt/;           # matches any two chars, followed by 'rt'
409*0Sstevel@tonic-gate    /end\./;          # matches 'end.'
410*0Sstevel@tonic-gate    /end[.]/;         # same thing, matches 'end.'
411*0Sstevel@tonic-gate
412*0Sstevel@tonic-gateBecause a period is a metacharacter, it needs to be escaped to match
413*0Sstevel@tonic-gateas an ordinary period. Because, for example, C<\d> and C<\w> are sets
414*0Sstevel@tonic-gateof characters, it is incorrect to think of C<[^\d\w]> as C<[\D\W]>; in
415*0Sstevel@tonic-gatefact C<[^\d\w]> is the same as C<[^\w]>, which is the same as
416*0Sstevel@tonic-gateC<[\W]>. Think DeMorgan's laws.
417*0Sstevel@tonic-gate
418*0Sstevel@tonic-gateAn anchor useful in basic regexps is the S<B<word anchor> >
419*0Sstevel@tonic-gateC<\b>.  This matches a boundary between a word character and a non-word
420*0Sstevel@tonic-gatecharacter C<\w\W> or C<\W\w>:
421*0Sstevel@tonic-gate
422*0Sstevel@tonic-gate    $x = "Housecat catenates house and cat";
423*0Sstevel@tonic-gate    $x =~ /cat/;    # matches cat in 'housecat'
424*0Sstevel@tonic-gate    $x =~ /\bcat/;  # matches cat in 'catenates'
425*0Sstevel@tonic-gate    $x =~ /cat\b/;  # matches cat in 'housecat'
426*0Sstevel@tonic-gate    $x =~ /\bcat\b/;  # matches 'cat' at end of string
427*0Sstevel@tonic-gate
428*0Sstevel@tonic-gateNote in the last example, the end of the string is considered a word
429*0Sstevel@tonic-gateboundary.
430*0Sstevel@tonic-gate
431*0Sstevel@tonic-gateYou might wonder why C<'.'> matches everything but C<"\n"> - why not
432*0Sstevel@tonic-gateevery character? The reason is that often one is matching against
433*0Sstevel@tonic-gatelines and would like to ignore the newline characters.  For instance,
434*0Sstevel@tonic-gatewhile the string C<"\n"> represents one line, we would like to think
435*0Sstevel@tonic-gateof as empty.  Then
436*0Sstevel@tonic-gate
437*0Sstevel@tonic-gate    ""   =~ /^$/;    # matches
438*0Sstevel@tonic-gate    "\n" =~ /^$/;    # matches, "\n" is ignored
439*0Sstevel@tonic-gate
440*0Sstevel@tonic-gate    ""   =~ /./;      # doesn't match; it needs a char
441*0Sstevel@tonic-gate    ""   =~ /^.$/;    # doesn't match; it needs a char
442*0Sstevel@tonic-gate    "\n" =~ /^.$/;    # doesn't match; it needs a char other than "\n"
443*0Sstevel@tonic-gate    "a"  =~ /^.$/;    # matches
444*0Sstevel@tonic-gate    "a\n"  =~ /^.$/;  # matches, ignores the "\n"
445*0Sstevel@tonic-gate
446*0Sstevel@tonic-gateThis behavior is convenient, because we usually want to ignore
447*0Sstevel@tonic-gatenewlines when we count and match characters in a line.  Sometimes,
448*0Sstevel@tonic-gatehowever, we want to keep track of newlines.  We might even want C<^>
449*0Sstevel@tonic-gateand C<$> to anchor at the beginning and end of lines within the
450*0Sstevel@tonic-gatestring, rather than just the beginning and end of the string.  Perl
451*0Sstevel@tonic-gateallows us to choose between ignoring and paying attention to newlines
452*0Sstevel@tonic-gateby using the C<//s> and C<//m> modifiers.  C<//s> and C<//m> stand for
453*0Sstevel@tonic-gatesingle line and multi-line and they determine whether a string is to
454*0Sstevel@tonic-gatebe treated as one continuous string, or as a set of lines.  The two
455*0Sstevel@tonic-gatemodifiers affect two aspects of how the regexp is interpreted: 1) how
456*0Sstevel@tonic-gatethe C<'.'> character class is defined, and 2) where the anchors C<^>
457*0Sstevel@tonic-gateand C<$> are able to match.  Here are the four possible combinations:
458*0Sstevel@tonic-gate
459*0Sstevel@tonic-gate=over 4
460*0Sstevel@tonic-gate
461*0Sstevel@tonic-gate=item *
462*0Sstevel@tonic-gate
463*0Sstevel@tonic-gateno modifiers (//): Default behavior.  C<'.'> matches any character
464*0Sstevel@tonic-gateexcept C<"\n">.  C<^> matches only at the beginning of the string and
465*0Sstevel@tonic-gateC<$> matches only at the end or before a newline at the end.
466*0Sstevel@tonic-gate
467*0Sstevel@tonic-gate=item *
468*0Sstevel@tonic-gate
469*0Sstevel@tonic-gates modifier (//s): Treat string as a single long line.  C<'.'> matches
470*0Sstevel@tonic-gateany character, even C<"\n">.  C<^> matches only at the beginning of
471*0Sstevel@tonic-gatethe string and C<$> matches only at the end or before a newline at the
472*0Sstevel@tonic-gateend.
473*0Sstevel@tonic-gate
474*0Sstevel@tonic-gate=item *
475*0Sstevel@tonic-gate
476*0Sstevel@tonic-gatem modifier (//m): Treat string as a set of multiple lines.  C<'.'>
477*0Sstevel@tonic-gatematches any character except C<"\n">.  C<^> and C<$> are able to match
478*0Sstevel@tonic-gateat the start or end of I<any> line within the string.
479*0Sstevel@tonic-gate
480*0Sstevel@tonic-gate=item *
481*0Sstevel@tonic-gate
482*0Sstevel@tonic-gateboth s and m modifiers (//sm): Treat string as a single long line, but
483*0Sstevel@tonic-gatedetect multiple lines.  C<'.'> matches any character, even
484*0Sstevel@tonic-gateC<"\n">.  C<^> and C<$>, however, are able to match at the start or end
485*0Sstevel@tonic-gateof I<any> line within the string.
486*0Sstevel@tonic-gate
487*0Sstevel@tonic-gate=back
488*0Sstevel@tonic-gate
489*0Sstevel@tonic-gateHere are examples of C<//s> and C<//m> in action:
490*0Sstevel@tonic-gate
491*0Sstevel@tonic-gate    $x = "There once was a girl\nWho programmed in Perl\n";
492*0Sstevel@tonic-gate
493*0Sstevel@tonic-gate    $x =~ /^Who/;   # doesn't match, "Who" not at start of string
494*0Sstevel@tonic-gate    $x =~ /^Who/s;  # doesn't match, "Who" not at start of string
495*0Sstevel@tonic-gate    $x =~ /^Who/m;  # matches, "Who" at start of second line
496*0Sstevel@tonic-gate    $x =~ /^Who/sm; # matches, "Who" at start of second line
497*0Sstevel@tonic-gate
498*0Sstevel@tonic-gate    $x =~ /girl.Who/;   # doesn't match, "." doesn't match "\n"
499*0Sstevel@tonic-gate    $x =~ /girl.Who/s;  # matches, "." matches "\n"
500*0Sstevel@tonic-gate    $x =~ /girl.Who/m;  # doesn't match, "." doesn't match "\n"
501*0Sstevel@tonic-gate    $x =~ /girl.Who/sm; # matches, "." matches "\n"
502*0Sstevel@tonic-gate
503*0Sstevel@tonic-gateMost of the time, the default behavior is what is want, but C<//s> and
504*0Sstevel@tonic-gateC<//m> are occasionally very useful.  If C<//m> is being used, the start
505*0Sstevel@tonic-gateof the string can still be matched with C<\A> and the end of string
506*0Sstevel@tonic-gatecan still be matched with the anchors C<\Z> (matches both the end and
507*0Sstevel@tonic-gatethe newline before, like C<$>), and C<\z> (matches only the end):
508*0Sstevel@tonic-gate
509*0Sstevel@tonic-gate    $x =~ /^Who/m;   # matches, "Who" at start of second line
510*0Sstevel@tonic-gate    $x =~ /\AWho/m;  # doesn't match, "Who" is not at start of string
511*0Sstevel@tonic-gate
512*0Sstevel@tonic-gate    $x =~ /girl$/m;  # matches, "girl" at end of first line
513*0Sstevel@tonic-gate    $x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string
514*0Sstevel@tonic-gate
515*0Sstevel@tonic-gate    $x =~ /Perl\Z/m; # matches, "Perl" is at newline before end
516*0Sstevel@tonic-gate    $x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of string
517*0Sstevel@tonic-gate
518*0Sstevel@tonic-gateWe now know how to create choices among classes of characters in a
519*0Sstevel@tonic-gateregexp.  What about choices among words or character strings? Such
520*0Sstevel@tonic-gatechoices are described in the next section.
521*0Sstevel@tonic-gate
522*0Sstevel@tonic-gate=head2 Matching this or that
523*0Sstevel@tonic-gate
524*0Sstevel@tonic-gateSometimes we would like to our regexp to be able to match different
525*0Sstevel@tonic-gatepossible words or character strings.  This is accomplished by using
526*0Sstevel@tonic-gatethe B<alternation> metacharacter C<|>.  To match C<dog> or C<cat>, we
527*0Sstevel@tonic-gateform the regexp C<dog|cat>.  As before, perl will try to match the
528*0Sstevel@tonic-gateregexp at the earliest possible point in the string.  At each
529*0Sstevel@tonic-gatecharacter position, perl will first try to match the first
530*0Sstevel@tonic-gatealternative, C<dog>.  If C<dog> doesn't match, perl will then try the
531*0Sstevel@tonic-gatenext alternative, C<cat>.  If C<cat> doesn't match either, then the
532*0Sstevel@tonic-gatematch fails and perl moves to the next position in the string.  Some
533*0Sstevel@tonic-gateexamples:
534*0Sstevel@tonic-gate
535*0Sstevel@tonic-gate    "cats and dogs" =~ /cat|dog|bird/;  # matches "cat"
536*0Sstevel@tonic-gate    "cats and dogs" =~ /dog|cat|bird/;  # matches "cat"
537*0Sstevel@tonic-gate
538*0Sstevel@tonic-gateEven though C<dog> is the first alternative in the second regexp,
539*0Sstevel@tonic-gateC<cat> is able to match earlier in the string.
540*0Sstevel@tonic-gate
541*0Sstevel@tonic-gate    "cats"          =~ /c|ca|cat|cats/; # matches "c"
542*0Sstevel@tonic-gate    "cats"          =~ /cats|cat|ca|c/; # matches "cats"
543*0Sstevel@tonic-gate
544*0Sstevel@tonic-gateHere, all the alternatives match at the first string position, so the
545*0Sstevel@tonic-gatefirst alternative is the one that matches.  If some of the
546*0Sstevel@tonic-gatealternatives are truncations of the others, put the longest ones first
547*0Sstevel@tonic-gateto give them a chance to match.
548*0Sstevel@tonic-gate
549*0Sstevel@tonic-gate    "cab" =~ /a|b|c/ # matches "c"
550*0Sstevel@tonic-gate                     # /a|b|c/ == /[abc]/
551*0Sstevel@tonic-gate
552*0Sstevel@tonic-gateThe last example points out that character classes are like
553*0Sstevel@tonic-gatealternations of characters.  At a given character position, the first
554*0Sstevel@tonic-gatealternative that allows the regexp match to succeed will be the one
555*0Sstevel@tonic-gatethat matches.
556*0Sstevel@tonic-gate
557*0Sstevel@tonic-gate=head2 Grouping things and hierarchical matching
558*0Sstevel@tonic-gate
559*0Sstevel@tonic-gateAlternation allows a regexp to choose among alternatives, but by
560*0Sstevel@tonic-gateitself it unsatisfying.  The reason is that each alternative is a whole
561*0Sstevel@tonic-gateregexp, but sometime we want alternatives for just part of a
562*0Sstevel@tonic-gateregexp.  For instance, suppose we want to search for housecats or
563*0Sstevel@tonic-gatehousekeepers.  The regexp C<housecat|housekeeper> fits the bill, but is
564*0Sstevel@tonic-gateinefficient because we had to type C<house> twice.  It would be nice to
565*0Sstevel@tonic-gatehave parts of the regexp be constant, like C<house>, and some
566*0Sstevel@tonic-gateparts have alternatives, like C<cat|keeper>.
567*0Sstevel@tonic-gate
568*0Sstevel@tonic-gateThe B<grouping> metacharacters C<()> solve this problem.  Grouping
569*0Sstevel@tonic-gateallows parts of a regexp to be treated as a single unit.  Parts of a
570*0Sstevel@tonic-gateregexp are grouped by enclosing them in parentheses.  Thus we could solve
571*0Sstevel@tonic-gatethe C<housecat|housekeeper> by forming the regexp as
572*0Sstevel@tonic-gateC<house(cat|keeper)>.  The regexp C<house(cat|keeper)> means match
573*0Sstevel@tonic-gateC<house> followed by either C<cat> or C<keeper>.  Some more examples
574*0Sstevel@tonic-gateare
575*0Sstevel@tonic-gate
576*0Sstevel@tonic-gate    /(a|b)b/;    # matches 'ab' or 'bb'
577*0Sstevel@tonic-gate    /(ac|b)b/;   # matches 'acb' or 'bb'
578*0Sstevel@tonic-gate    /(^a|b)c/;   # matches 'ac' at start of string or 'bc' anywhere
579*0Sstevel@tonic-gate    /(a|[bc])d/; # matches 'ad', 'bd', or 'cd'
580*0Sstevel@tonic-gate
581*0Sstevel@tonic-gate    /house(cat|)/;  # matches either 'housecat' or 'house'
582*0Sstevel@tonic-gate    /house(cat(s|)|)/;  # matches either 'housecats' or 'housecat' or
583*0Sstevel@tonic-gate                        # 'house'.  Note groups can be nested.
584*0Sstevel@tonic-gate
585*0Sstevel@tonic-gate    /(19|20|)\d\d/;  # match years 19xx, 20xx, or the Y2K problem, xx
586*0Sstevel@tonic-gate    "20" =~ /(19|20|)\d\d/;  # matches the null alternative '()\d\d',
587*0Sstevel@tonic-gate                             # because '20\d\d' can't match
588*0Sstevel@tonic-gate
589*0Sstevel@tonic-gateAlternations behave the same way in groups as out of them: at a given
590*0Sstevel@tonic-gatestring position, the leftmost alternative that allows the regexp to
591*0Sstevel@tonic-gatematch is taken.  So in the last example at the first string position,
592*0Sstevel@tonic-gateC<"20"> matches the second alternative, but there is nothing left over
593*0Sstevel@tonic-gateto match the next two digits C<\d\d>.  So perl moves on to the next
594*0Sstevel@tonic-gatealternative, which is the null alternative and that works, since
595*0Sstevel@tonic-gateC<"20"> is two digits.
596*0Sstevel@tonic-gate
597*0Sstevel@tonic-gateThe process of trying one alternative, seeing if it matches, and
598*0Sstevel@tonic-gatemoving on to the next alternative if it doesn't, is called
599*0Sstevel@tonic-gateB<backtracking>.  The term 'backtracking' comes from the idea that
600*0Sstevel@tonic-gatematching a regexp is like a walk in the woods.  Successfully matching
601*0Sstevel@tonic-gatea regexp is like arriving at a destination.  There are many possible
602*0Sstevel@tonic-gatetrailheads, one for each string position, and each one is tried in
603*0Sstevel@tonic-gateorder, left to right.  From each trailhead there may be many paths,
604*0Sstevel@tonic-gatesome of which get you there, and some which are dead ends.  When you
605*0Sstevel@tonic-gatewalk along a trail and hit a dead end, you have to backtrack along the
606*0Sstevel@tonic-gatetrail to an earlier point to try another trail.  If you hit your
607*0Sstevel@tonic-gatedestination, you stop immediately and forget about trying all the
608*0Sstevel@tonic-gateother trails.  You are persistent, and only if you have tried all the
609*0Sstevel@tonic-gatetrails from all the trailheads and not arrived at your destination, do
610*0Sstevel@tonic-gateyou declare failure.  To be concrete, here is a step-by-step analysis
611*0Sstevel@tonic-gateof what perl does when it tries to match the regexp
612*0Sstevel@tonic-gate
613*0Sstevel@tonic-gate    "abcde" =~ /(abd|abc)(df|d|de)/;
614*0Sstevel@tonic-gate
615*0Sstevel@tonic-gate=over 4
616*0Sstevel@tonic-gate
617*0Sstevel@tonic-gate=item 0
618*0Sstevel@tonic-gate
619*0Sstevel@tonic-gateStart with the first letter in the string 'a'.
620*0Sstevel@tonic-gate
621*0Sstevel@tonic-gate=item 1
622*0Sstevel@tonic-gate
623*0Sstevel@tonic-gateTry the first alternative in the first group 'abd'.
624*0Sstevel@tonic-gate
625*0Sstevel@tonic-gate=item 2
626*0Sstevel@tonic-gate
627*0Sstevel@tonic-gateMatch 'a' followed by 'b'. So far so good.
628*0Sstevel@tonic-gate
629*0Sstevel@tonic-gate=item 3
630*0Sstevel@tonic-gate
631*0Sstevel@tonic-gate'd' in the regexp doesn't match 'c' in the string - a dead
632*0Sstevel@tonic-gateend.  So backtrack two characters and pick the second alternative in
633*0Sstevel@tonic-gatethe first group 'abc'.
634*0Sstevel@tonic-gate
635*0Sstevel@tonic-gate=item 4
636*0Sstevel@tonic-gate
637*0Sstevel@tonic-gateMatch 'a' followed by 'b' followed by 'c'.  We are on a roll
638*0Sstevel@tonic-gateand have satisfied the first group. Set $1 to 'abc'.
639*0Sstevel@tonic-gate
640*0Sstevel@tonic-gate=item 5
641*0Sstevel@tonic-gate
642*0Sstevel@tonic-gateMove on to the second group and pick the first alternative
643*0Sstevel@tonic-gate'df'.
644*0Sstevel@tonic-gate
645*0Sstevel@tonic-gate=item 6
646*0Sstevel@tonic-gate
647*0Sstevel@tonic-gateMatch the 'd'.
648*0Sstevel@tonic-gate
649*0Sstevel@tonic-gate=item 7
650*0Sstevel@tonic-gate
651*0Sstevel@tonic-gate'f' in the regexp doesn't match 'e' in the string, so a dead
652*0Sstevel@tonic-gateend.  Backtrack one character and pick the second alternative in the
653*0Sstevel@tonic-gatesecond group 'd'.
654*0Sstevel@tonic-gate
655*0Sstevel@tonic-gate=item 8
656*0Sstevel@tonic-gate
657*0Sstevel@tonic-gate'd' matches. The second grouping is satisfied, so set $2 to
658*0Sstevel@tonic-gate'd'.
659*0Sstevel@tonic-gate
660*0Sstevel@tonic-gate=item 9
661*0Sstevel@tonic-gate
662*0Sstevel@tonic-gateWe are at the end of the regexp, so we are done! We have
663*0Sstevel@tonic-gatematched 'abcd' out of the string "abcde".
664*0Sstevel@tonic-gate
665*0Sstevel@tonic-gate=back
666*0Sstevel@tonic-gate
667*0Sstevel@tonic-gateThere are a couple of things to note about this analysis.  First, the
668*0Sstevel@tonic-gatethird alternative in the second group 'de' also allows a match, but we
669*0Sstevel@tonic-gatestopped before we got to it - at a given character position, leftmost
670*0Sstevel@tonic-gatewins.  Second, we were able to get a match at the first character
671*0Sstevel@tonic-gateposition of the string 'a'.  If there were no matches at the first
672*0Sstevel@tonic-gateposition, perl would move to the second character position 'b' and
673*0Sstevel@tonic-gateattempt the match all over again.  Only when all possible paths at all
674*0Sstevel@tonic-gatepossible character positions have been exhausted does perl give
675*0Sstevel@tonic-gateup and declare S<C<$string =~ /(abd|abc)(df|d|de)/;> > to be false.
676*0Sstevel@tonic-gate
677*0Sstevel@tonic-gateEven with all this work, regexp matching happens remarkably fast.  To
678*0Sstevel@tonic-gatespeed things up, during compilation stage, perl compiles the regexp
679*0Sstevel@tonic-gateinto a compact sequence of opcodes that can often fit inside a
680*0Sstevel@tonic-gateprocessor cache.  When the code is executed, these opcodes can then run
681*0Sstevel@tonic-gateat full throttle and search very quickly.
682*0Sstevel@tonic-gate
683*0Sstevel@tonic-gate=head2 Extracting matches
684*0Sstevel@tonic-gate
685*0Sstevel@tonic-gateThe grouping metacharacters C<()> also serve another completely
686*0Sstevel@tonic-gatedifferent function: they allow the extraction of the parts of a string
687*0Sstevel@tonic-gatethat matched.  This is very useful to find out what matched and for
688*0Sstevel@tonic-gatetext processing in general.  For each grouping, the part that matched
689*0Sstevel@tonic-gateinside goes into the special variables C<$1>, C<$2>, etc.  They can be
690*0Sstevel@tonic-gateused just as ordinary variables:
691*0Sstevel@tonic-gate
692*0Sstevel@tonic-gate    # extract hours, minutes, seconds
693*0Sstevel@tonic-gate    if ($time =~ /(\d\d):(\d\d):(\d\d)/) {    # match hh:mm:ss format
694*0Sstevel@tonic-gate	$hours = $1;
695*0Sstevel@tonic-gate	$minutes = $2;
696*0Sstevel@tonic-gate	$seconds = $3;
697*0Sstevel@tonic-gate    }
698*0Sstevel@tonic-gate
699*0Sstevel@tonic-gateNow, we know that in scalar context,
700*0Sstevel@tonic-gateS<C<$time =~ /(\d\d):(\d\d):(\d\d)/> > returns a true or false
701*0Sstevel@tonic-gatevalue.  In list context, however, it returns the list of matched values
702*0Sstevel@tonic-gateC<($1,$2,$3)>.  So we could write the code more compactly as
703*0Sstevel@tonic-gate
704*0Sstevel@tonic-gate    # extract hours, minutes, seconds
705*0Sstevel@tonic-gate    ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
706*0Sstevel@tonic-gate
707*0Sstevel@tonic-gateIf the groupings in a regexp are nested, C<$1> gets the group with the
708*0Sstevel@tonic-gateleftmost opening parenthesis, C<$2> the next opening parenthesis,
709*0Sstevel@tonic-gateetc.  For example, here is a complex regexp and the matching variables
710*0Sstevel@tonic-gateindicated below it:
711*0Sstevel@tonic-gate
712*0Sstevel@tonic-gate    /(ab(cd|ef)((gi)|j))/;
713*0Sstevel@tonic-gate     1  2      34
714*0Sstevel@tonic-gate
715*0Sstevel@tonic-gateso that if the regexp matched, e.g., C<$2> would contain 'cd' or 'ef'. For
716*0Sstevel@tonic-gateconvenience, perl sets C<$+> to the string held by the highest numbered
717*0Sstevel@tonic-gateC<$1>, C<$2>, ... that got assigned (and, somewhat related, C<$^N> to the
718*0Sstevel@tonic-gatevalue of the C<$1>, C<$2>, ... most-recently assigned; i.e. the C<$1>,
719*0Sstevel@tonic-gateC<$2>, ... associated with the rightmost closing parenthesis used in the
720*0Sstevel@tonic-gatematch).
721*0Sstevel@tonic-gate
722*0Sstevel@tonic-gateClosely associated with the matching variables C<$1>, C<$2>, ... are
723*0Sstevel@tonic-gatethe B<backreferences> C<\1>, C<\2>, ... .  Backreferences are simply
724*0Sstevel@tonic-gatematching variables that can be used I<inside> a regexp.  This is a
725*0Sstevel@tonic-gatereally nice feature - what matches later in a regexp can depend on
726*0Sstevel@tonic-gatewhat matched earlier in the regexp.  Suppose we wanted to look
727*0Sstevel@tonic-gatefor doubled words in text, like 'the the'.  The following regexp finds
728*0Sstevel@tonic-gateall 3-letter doubles with a space in between:
729*0Sstevel@tonic-gate
730*0Sstevel@tonic-gate    /(\w\w\w)\s\1/;
731*0Sstevel@tonic-gate
732*0Sstevel@tonic-gateThe grouping assigns a value to \1, so that the same 3 letter sequence
733*0Sstevel@tonic-gateis used for both parts.  Here are some words with repeated parts:
734*0Sstevel@tonic-gate
735*0Sstevel@tonic-gate    % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\1$' /usr/dict/words
736*0Sstevel@tonic-gate    beriberi
737*0Sstevel@tonic-gate    booboo
738*0Sstevel@tonic-gate    coco
739*0Sstevel@tonic-gate    mama
740*0Sstevel@tonic-gate    murmur
741*0Sstevel@tonic-gate    papa
742*0Sstevel@tonic-gate
743*0Sstevel@tonic-gateThe regexp has a single grouping which considers 4-letter
744*0Sstevel@tonic-gatecombinations, then 3-letter combinations, etc.  and uses C<\1> to look for
745*0Sstevel@tonic-gatea repeat.  Although C<$1> and C<\1> represent the same thing, care should be
746*0Sstevel@tonic-gatetaken to use matched variables C<$1>, C<$2>, ... only outside a regexp
747*0Sstevel@tonic-gateand backreferences C<\1>, C<\2>, ... only inside a regexp; not doing
748*0Sstevel@tonic-gateso may lead to surprising and/or undefined results.
749*0Sstevel@tonic-gate
750*0Sstevel@tonic-gateIn addition to what was matched, Perl 5.6.0 also provides the
751*0Sstevel@tonic-gatepositions of what was matched with the C<@-> and C<@+>
752*0Sstevel@tonic-gatearrays. C<$-[0]> is the position of the start of the entire match and
753*0Sstevel@tonic-gateC<$+[0]> is the position of the end. Similarly, C<$-[n]> is the
754*0Sstevel@tonic-gateposition of the start of the C<$n> match and C<$+[n]> is the position
755*0Sstevel@tonic-gateof the end. If C<$n> is undefined, so are C<$-[n]> and C<$+[n]>. Then
756*0Sstevel@tonic-gatethis code
757*0Sstevel@tonic-gate
758*0Sstevel@tonic-gate    $x = "Mmm...donut, thought Homer";
759*0Sstevel@tonic-gate    $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches
760*0Sstevel@tonic-gate    foreach $expr (1..$#-) {
761*0Sstevel@tonic-gate        print "Match $expr: '${$expr}' at position ($-[$expr],$+[$expr])\n";
762*0Sstevel@tonic-gate    }
763*0Sstevel@tonic-gate
764*0Sstevel@tonic-gateprints
765*0Sstevel@tonic-gate
766*0Sstevel@tonic-gate    Match 1: 'Mmm' at position (0,3)
767*0Sstevel@tonic-gate    Match 2: 'donut' at position (6,11)
768*0Sstevel@tonic-gate
769*0Sstevel@tonic-gateEven if there are no groupings in a regexp, it is still possible to
770*0Sstevel@tonic-gatefind out what exactly matched in a string.  If you use them, perl
771*0Sstevel@tonic-gatewill set C<$`> to the part of the string before the match, will set C<$&>
772*0Sstevel@tonic-gateto the part of the string that matched, and will set C<$'> to the part
773*0Sstevel@tonic-gateof the string after the match.  An example:
774*0Sstevel@tonic-gate
775*0Sstevel@tonic-gate    $x = "the cat caught the mouse";
776*0Sstevel@tonic-gate    $x =~ /cat/;  # $` = 'the ', $& = 'cat', $' = ' caught the mouse'
777*0Sstevel@tonic-gate    $x =~ /the/;  # $` = '', $& = 'the', $' = ' cat caught the mouse'
778*0Sstevel@tonic-gate
779*0Sstevel@tonic-gateIn the second match, S<C<$` = ''> > because the regexp matched at the
780*0Sstevel@tonic-gatefirst character position in the string and stopped, it never saw the
781*0Sstevel@tonic-gatesecond 'the'.  It is important to note that using C<$`> and C<$'>
782*0Sstevel@tonic-gateslows down regexp matching quite a bit, and C< $& > slows it down to a
783*0Sstevel@tonic-gatelesser extent, because if they are used in one regexp in a program,
784*0Sstevel@tonic-gatethey are generated for <all> regexps in the program.  So if raw
785*0Sstevel@tonic-gateperformance is a goal of your application, they should be avoided.
786*0Sstevel@tonic-gateIf you need them, use C<@-> and C<@+> instead:
787*0Sstevel@tonic-gate
788*0Sstevel@tonic-gate    $` is the same as substr( $x, 0, $-[0] )
789*0Sstevel@tonic-gate    $& is the same as substr( $x, $-[0], $+[0]-$-[0] )
790*0Sstevel@tonic-gate    $' is the same as substr( $x, $+[0] )
791*0Sstevel@tonic-gate
792*0Sstevel@tonic-gate=head2 Matching repetitions
793*0Sstevel@tonic-gate
794*0Sstevel@tonic-gateThe examples in the previous section display an annoying weakness.  We
795*0Sstevel@tonic-gatewere only matching 3-letter words, or syllables of 4 letters or
796*0Sstevel@tonic-gateless.  We'd like to be able to match words or syllables of any length,
797*0Sstevel@tonic-gatewithout writing out tedious alternatives like
798*0Sstevel@tonic-gateC<\w\w\w\w|\w\w\w|\w\w|\w>.
799*0Sstevel@tonic-gate
800*0Sstevel@tonic-gateThis is exactly the problem the B<quantifier> metacharacters C<?>,
801*0Sstevel@tonic-gateC<*>, C<+>, and C<{}> were created for.  They allow us to determine the
802*0Sstevel@tonic-gatenumber of repeats of a portion of a regexp we consider to be a
803*0Sstevel@tonic-gatematch.  Quantifiers are put immediately after the character, character
804*0Sstevel@tonic-gateclass, or grouping that we want to specify.  They have the following
805*0Sstevel@tonic-gatemeanings:
806*0Sstevel@tonic-gate
807*0Sstevel@tonic-gate=over 4
808*0Sstevel@tonic-gate
809*0Sstevel@tonic-gate=item *
810*0Sstevel@tonic-gate
811*0Sstevel@tonic-gateC<a?> = match 'a' 1 or 0 times
812*0Sstevel@tonic-gate
813*0Sstevel@tonic-gate=item *
814*0Sstevel@tonic-gate
815*0Sstevel@tonic-gateC<a*> = match 'a' 0 or more times, i.e., any number of times
816*0Sstevel@tonic-gate
817*0Sstevel@tonic-gate=item *
818*0Sstevel@tonic-gate
819*0Sstevel@tonic-gateC<a+> = match 'a' 1 or more times, i.e., at least once
820*0Sstevel@tonic-gate
821*0Sstevel@tonic-gate=item *
822*0Sstevel@tonic-gate
823*0Sstevel@tonic-gateC<a{n,m}> = match at least C<n> times, but not more than C<m>
824*0Sstevel@tonic-gatetimes.
825*0Sstevel@tonic-gate
826*0Sstevel@tonic-gate=item *
827*0Sstevel@tonic-gate
828*0Sstevel@tonic-gateC<a{n,}> = match at least C<n> or more times
829*0Sstevel@tonic-gate
830*0Sstevel@tonic-gate=item *
831*0Sstevel@tonic-gate
832*0Sstevel@tonic-gateC<a{n}> = match exactly C<n> times
833*0Sstevel@tonic-gate
834*0Sstevel@tonic-gate=back
835*0Sstevel@tonic-gate
836*0Sstevel@tonic-gateHere are some examples:
837*0Sstevel@tonic-gate
838*0Sstevel@tonic-gate    /[a-z]+\s+\d*/;  # match a lowercase word, at least some space, and
839*0Sstevel@tonic-gate                     # any number of digits
840*0Sstevel@tonic-gate    /(\w+)\s+\1/;    # match doubled words of arbitrary length
841*0Sstevel@tonic-gate    /y(es)?/i;       # matches 'y', 'Y', or a case-insensitive 'yes'
842*0Sstevel@tonic-gate    $year =~ /\d{2,4}/;  # make sure year is at least 2 but not more
843*0Sstevel@tonic-gate                         # than 4 digits
844*0Sstevel@tonic-gate    $year =~ /\d{4}|\d{2}/;    # better match; throw out 3 digit dates
845*0Sstevel@tonic-gate    $year =~ /\d{2}(\d{2})?/;  # same thing written differently. However,
846*0Sstevel@tonic-gate                               # this produces $1 and the other does not.
847*0Sstevel@tonic-gate
848*0Sstevel@tonic-gate    % simple_grep '^(\w+)\1$' /usr/dict/words   # isn't this easier?
849*0Sstevel@tonic-gate    beriberi
850*0Sstevel@tonic-gate    booboo
851*0Sstevel@tonic-gate    coco
852*0Sstevel@tonic-gate    mama
853*0Sstevel@tonic-gate    murmur
854*0Sstevel@tonic-gate    papa
855*0Sstevel@tonic-gate
856*0Sstevel@tonic-gateFor all of these quantifiers, perl will try to match as much of the
857*0Sstevel@tonic-gatestring as possible, while still allowing the regexp to succeed.  Thus
858*0Sstevel@tonic-gatewith C</a?.../>, perl will first try to match the regexp with the C<a>
859*0Sstevel@tonic-gatepresent; if that fails, perl will try to match the regexp without the
860*0Sstevel@tonic-gateC<a> present.  For the quantifier C<*>, we get the following:
861*0Sstevel@tonic-gate
862*0Sstevel@tonic-gate    $x = "the cat in the hat";
863*0Sstevel@tonic-gate    $x =~ /^(.*)(cat)(.*)$/; # matches,
864*0Sstevel@tonic-gate                             # $1 = 'the '
865*0Sstevel@tonic-gate                             # $2 = 'cat'
866*0Sstevel@tonic-gate                             # $3 = ' in the hat'
867*0Sstevel@tonic-gate
868*0Sstevel@tonic-gateWhich is what we might expect, the match finds the only C<cat> in the
869*0Sstevel@tonic-gatestring and locks onto it.  Consider, however, this regexp:
870*0Sstevel@tonic-gate
871*0Sstevel@tonic-gate    $x =~ /^(.*)(at)(.*)$/; # matches,
872*0Sstevel@tonic-gate                            # $1 = 'the cat in the h'
873*0Sstevel@tonic-gate                            # $2 = 'at'
874*0Sstevel@tonic-gate                            # $3 = ''   (0 matches)
875*0Sstevel@tonic-gate
876*0Sstevel@tonic-gateOne might initially guess that perl would find the C<at> in C<cat> and
877*0Sstevel@tonic-gatestop there, but that wouldn't give the longest possible string to the
878*0Sstevel@tonic-gatefirst quantifier C<.*>.  Instead, the first quantifier C<.*> grabs as
879*0Sstevel@tonic-gatemuch of the string as possible while still having the regexp match.  In
880*0Sstevel@tonic-gatethis example, that means having the C<at> sequence with the final C<at>
881*0Sstevel@tonic-gatein the string.  The other important principle illustrated here is that
882*0Sstevel@tonic-gatewhen there are two or more elements in a regexp, the I<leftmost>
883*0Sstevel@tonic-gatequantifier, if there is one, gets to grab as much the string as
884*0Sstevel@tonic-gatepossible, leaving the rest of the regexp to fight over scraps.  Thus in
885*0Sstevel@tonic-gateour example, the first quantifier C<.*> grabs most of the string, while
886*0Sstevel@tonic-gatethe second quantifier C<.*> gets the empty string.   Quantifiers that
887*0Sstevel@tonic-gategrab as much of the string as possible are called B<maximal match> or
888*0Sstevel@tonic-gateB<greedy> quantifiers.
889*0Sstevel@tonic-gate
890*0Sstevel@tonic-gateWhen a regexp can match a string in several different ways, we can use
891*0Sstevel@tonic-gatethe principles above to predict which way the regexp will match:
892*0Sstevel@tonic-gate
893*0Sstevel@tonic-gate=over 4
894*0Sstevel@tonic-gate
895*0Sstevel@tonic-gate=item *
896*0Sstevel@tonic-gate
897*0Sstevel@tonic-gatePrinciple 0: Taken as a whole, any regexp will be matched at the
898*0Sstevel@tonic-gateearliest possible position in the string.
899*0Sstevel@tonic-gate
900*0Sstevel@tonic-gate=item *
901*0Sstevel@tonic-gate
902*0Sstevel@tonic-gatePrinciple 1: In an alternation C<a|b|c...>, the leftmost alternative
903*0Sstevel@tonic-gatethat allows a match for the whole regexp will be the one used.
904*0Sstevel@tonic-gate
905*0Sstevel@tonic-gate=item *
906*0Sstevel@tonic-gate
907*0Sstevel@tonic-gatePrinciple 2: The maximal matching quantifiers C<?>, C<*>, C<+> and
908*0Sstevel@tonic-gateC<{n,m}> will in general match as much of the string as possible while
909*0Sstevel@tonic-gatestill allowing the whole regexp to match.
910*0Sstevel@tonic-gate
911*0Sstevel@tonic-gate=item *
912*0Sstevel@tonic-gate
913*0Sstevel@tonic-gatePrinciple 3: If there are two or more elements in a regexp, the
914*0Sstevel@tonic-gateleftmost greedy quantifier, if any, will match as much of the string
915*0Sstevel@tonic-gateas possible while still allowing the whole regexp to match.  The next
916*0Sstevel@tonic-gateleftmost greedy quantifier, if any, will try to match as much of the
917*0Sstevel@tonic-gatestring remaining available to it as possible, while still allowing the
918*0Sstevel@tonic-gatewhole regexp to match.  And so on, until all the regexp elements are
919*0Sstevel@tonic-gatesatisfied.
920*0Sstevel@tonic-gate
921*0Sstevel@tonic-gate=back
922*0Sstevel@tonic-gate
923*0Sstevel@tonic-gateAs we have seen above, Principle 0 overrides the others - the regexp
924*0Sstevel@tonic-gatewill be matched as early as possible, with the other principles
925*0Sstevel@tonic-gatedetermining how the regexp matches at that earliest character
926*0Sstevel@tonic-gateposition.
927*0Sstevel@tonic-gate
928*0Sstevel@tonic-gateHere is an example of these principles in action:
929*0Sstevel@tonic-gate
930*0Sstevel@tonic-gate    $x = "The programming republic of Perl";
931*0Sstevel@tonic-gate    $x =~ /^(.+)(e|r)(.*)$/;  # matches,
932*0Sstevel@tonic-gate                              # $1 = 'The programming republic of Pe'
933*0Sstevel@tonic-gate                              # $2 = 'r'
934*0Sstevel@tonic-gate                              # $3 = 'l'
935*0Sstevel@tonic-gate
936*0Sstevel@tonic-gateThis regexp matches at the earliest string position, C<'T'>.  One
937*0Sstevel@tonic-gatemight think that C<e>, being leftmost in the alternation, would be
938*0Sstevel@tonic-gatematched, but C<r> produces the longest string in the first quantifier.
939*0Sstevel@tonic-gate
940*0Sstevel@tonic-gate    $x =~ /(m{1,2})(.*)$/;  # matches,
941*0Sstevel@tonic-gate                            # $1 = 'mm'
942*0Sstevel@tonic-gate                            # $2 = 'ing republic of Perl'
943*0Sstevel@tonic-gate
944*0Sstevel@tonic-gateHere, The earliest possible match is at the first C<'m'> in
945*0Sstevel@tonic-gateC<programming>. C<m{1,2}> is the first quantifier, so it gets to match
946*0Sstevel@tonic-gatea maximal C<mm>.
947*0Sstevel@tonic-gate
948*0Sstevel@tonic-gate    $x =~ /.*(m{1,2})(.*)$/;  # matches,
949*0Sstevel@tonic-gate                              # $1 = 'm'
950*0Sstevel@tonic-gate                              # $2 = 'ing republic of Perl'
951*0Sstevel@tonic-gate
952*0Sstevel@tonic-gateHere, the regexp matches at the start of the string. The first
953*0Sstevel@tonic-gatequantifier C<.*> grabs as much as possible, leaving just a single
954*0Sstevel@tonic-gateC<'m'> for the second quantifier C<m{1,2}>.
955*0Sstevel@tonic-gate
956*0Sstevel@tonic-gate    $x =~ /(.?)(m{1,2})(.*)$/;  # matches,
957*0Sstevel@tonic-gate                                # $1 = 'a'
958*0Sstevel@tonic-gate                                # $2 = 'mm'
959*0Sstevel@tonic-gate                                # $3 = 'ing republic of Perl'
960*0Sstevel@tonic-gate
961*0Sstevel@tonic-gateHere, C<.?> eats its maximal one character at the earliest possible
962*0Sstevel@tonic-gateposition in the string, C<'a'> in C<programming>, leaving C<m{1,2}>
963*0Sstevel@tonic-gatethe opportunity to match both C<m>'s. Finally,
964*0Sstevel@tonic-gate
965*0Sstevel@tonic-gate    "aXXXb" =~ /(X*)/; # matches with $1 = ''
966*0Sstevel@tonic-gate
967*0Sstevel@tonic-gatebecause it can match zero copies of C<'X'> at the beginning of the
968*0Sstevel@tonic-gatestring.  If you definitely want to match at least one C<'X'>, use
969*0Sstevel@tonic-gateC<X+>, not C<X*>.
970*0Sstevel@tonic-gate
971*0Sstevel@tonic-gateSometimes greed is not good.  At times, we would like quantifiers to
972*0Sstevel@tonic-gatematch a I<minimal> piece of string, rather than a maximal piece.  For
973*0Sstevel@tonic-gatethis purpose, Larry Wall created the S<B<minimal match> > or
974*0Sstevel@tonic-gateB<non-greedy> quantifiers C<??>,C<*?>, C<+?>, and C<{}?>.  These are
975*0Sstevel@tonic-gatethe usual quantifiers with a C<?> appended to them.  They have the
976*0Sstevel@tonic-gatefollowing meanings:
977*0Sstevel@tonic-gate
978*0Sstevel@tonic-gate=over 4
979*0Sstevel@tonic-gate
980*0Sstevel@tonic-gate=item *
981*0Sstevel@tonic-gate
982*0Sstevel@tonic-gateC<a??> = match 'a' 0 or 1 times. Try 0 first, then 1.
983*0Sstevel@tonic-gate
984*0Sstevel@tonic-gate=item *
985*0Sstevel@tonic-gate
986*0Sstevel@tonic-gateC<a*?> = match 'a' 0 or more times, i.e., any number of times,
987*0Sstevel@tonic-gatebut as few times as possible
988*0Sstevel@tonic-gate
989*0Sstevel@tonic-gate=item *
990*0Sstevel@tonic-gate
991*0Sstevel@tonic-gateC<a+?> = match 'a' 1 or more times, i.e., at least once, but
992*0Sstevel@tonic-gateas few times as possible
993*0Sstevel@tonic-gate
994*0Sstevel@tonic-gate=item *
995*0Sstevel@tonic-gate
996*0Sstevel@tonic-gateC<a{n,m}?> = match at least C<n> times, not more than C<m>
997*0Sstevel@tonic-gatetimes, as few times as possible
998*0Sstevel@tonic-gate
999*0Sstevel@tonic-gate=item *
1000*0Sstevel@tonic-gate
1001*0Sstevel@tonic-gateC<a{n,}?> = match at least C<n> times, but as few times as
1002*0Sstevel@tonic-gatepossible
1003*0Sstevel@tonic-gate
1004*0Sstevel@tonic-gate=item *
1005*0Sstevel@tonic-gate
1006*0Sstevel@tonic-gateC<a{n}?> = match exactly C<n> times.  Because we match exactly
1007*0Sstevel@tonic-gateC<n> times, C<a{n}?> is equivalent to C<a{n}> and is just there for
1008*0Sstevel@tonic-gatenotational consistency.
1009*0Sstevel@tonic-gate
1010*0Sstevel@tonic-gate=back
1011*0Sstevel@tonic-gate
1012*0Sstevel@tonic-gateLet's look at the example above, but with minimal quantifiers:
1013*0Sstevel@tonic-gate
1014*0Sstevel@tonic-gate    $x = "The programming republic of Perl";
1015*0Sstevel@tonic-gate    $x =~ /^(.+?)(e|r)(.*)$/; # matches,
1016*0Sstevel@tonic-gate                              # $1 = 'Th'
1017*0Sstevel@tonic-gate                              # $2 = 'e'
1018*0Sstevel@tonic-gate                              # $3 = ' programming republic of Perl'
1019*0Sstevel@tonic-gate
1020*0Sstevel@tonic-gateThe minimal string that will allow both the start of the string C<^>
1021*0Sstevel@tonic-gateand the alternation to match is C<Th>, with the alternation C<e|r>
1022*0Sstevel@tonic-gatematching C<e>.  The second quantifier C<.*> is free to gobble up the
1023*0Sstevel@tonic-gaterest of the string.
1024*0Sstevel@tonic-gate
1025*0Sstevel@tonic-gate    $x =~ /(m{1,2}?)(.*?)$/;  # matches,
1026*0Sstevel@tonic-gate                              # $1 = 'm'
1027*0Sstevel@tonic-gate                              # $2 = 'ming republic of Perl'
1028*0Sstevel@tonic-gate
1029*0Sstevel@tonic-gateThe first string position that this regexp can match is at the first
1030*0Sstevel@tonic-gateC<'m'> in C<programming>. At this position, the minimal C<m{1,2}?>
1031*0Sstevel@tonic-gatematches just one C<'m'>.  Although the second quantifier C<.*?> would
1032*0Sstevel@tonic-gateprefer to match no characters, it is constrained by the end-of-string
1033*0Sstevel@tonic-gateanchor C<$> to match the rest of the string.
1034*0Sstevel@tonic-gate
1035*0Sstevel@tonic-gate    $x =~ /(.*?)(m{1,2}?)(.*)$/;  # matches,
1036*0Sstevel@tonic-gate                                  # $1 = 'The progra'
1037*0Sstevel@tonic-gate                                  # $2 = 'm'
1038*0Sstevel@tonic-gate                                  # $3 = 'ming republic of Perl'
1039*0Sstevel@tonic-gate
1040*0Sstevel@tonic-gateIn this regexp, you might expect the first minimal quantifier C<.*?>
1041*0Sstevel@tonic-gateto match the empty string, because it is not constrained by a C<^>
1042*0Sstevel@tonic-gateanchor to match the beginning of the word.  Principle 0 applies here,
1043*0Sstevel@tonic-gatehowever.  Because it is possible for the whole regexp to match at the
1044*0Sstevel@tonic-gatestart of the string, it I<will> match at the start of the string.  Thus
1045*0Sstevel@tonic-gatethe first quantifier has to match everything up to the first C<m>.  The
1046*0Sstevel@tonic-gatesecond minimal quantifier matches just one C<m> and the third
1047*0Sstevel@tonic-gatequantifier matches the rest of the string.
1048*0Sstevel@tonic-gate
1049*0Sstevel@tonic-gate    $x =~ /(.??)(m{1,2})(.*)$/;  # matches,
1050*0Sstevel@tonic-gate                                 # $1 = 'a'
1051*0Sstevel@tonic-gate                                 # $2 = 'mm'
1052*0Sstevel@tonic-gate                                 # $3 = 'ing republic of Perl'
1053*0Sstevel@tonic-gate
1054*0Sstevel@tonic-gateJust as in the previous regexp, the first quantifier C<.??> can match
1055*0Sstevel@tonic-gateearliest at position C<'a'>, so it does.  The second quantifier is
1056*0Sstevel@tonic-gategreedy, so it matches C<mm>, and the third matches the rest of the
1057*0Sstevel@tonic-gatestring.
1058*0Sstevel@tonic-gate
1059*0Sstevel@tonic-gateWe can modify principle 3 above to take into account non-greedy
1060*0Sstevel@tonic-gatequantifiers:
1061*0Sstevel@tonic-gate
1062*0Sstevel@tonic-gate=over 4
1063*0Sstevel@tonic-gate
1064*0Sstevel@tonic-gate=item *
1065*0Sstevel@tonic-gate
1066*0Sstevel@tonic-gatePrinciple 3: If there are two or more elements in a regexp, the
1067*0Sstevel@tonic-gateleftmost greedy (non-greedy) quantifier, if any, will match as much
1068*0Sstevel@tonic-gate(little) of the string as possible while still allowing the whole
1069*0Sstevel@tonic-gateregexp to match.  The next leftmost greedy (non-greedy) quantifier, if
1070*0Sstevel@tonic-gateany, will try to match as much (little) of the string remaining
1071*0Sstevel@tonic-gateavailable to it as possible, while still allowing the whole regexp to
1072*0Sstevel@tonic-gatematch.  And so on, until all the regexp elements are satisfied.
1073*0Sstevel@tonic-gate
1074*0Sstevel@tonic-gate=back
1075*0Sstevel@tonic-gate
1076*0Sstevel@tonic-gateJust like alternation, quantifiers are also susceptible to
1077*0Sstevel@tonic-gatebacktracking.  Here is a step-by-step analysis of the example
1078*0Sstevel@tonic-gate
1079*0Sstevel@tonic-gate    $x = "the cat in the hat";
1080*0Sstevel@tonic-gate    $x =~ /^(.*)(at)(.*)$/; # matches,
1081*0Sstevel@tonic-gate                            # $1 = 'the cat in the h'
1082*0Sstevel@tonic-gate                            # $2 = 'at'
1083*0Sstevel@tonic-gate                            # $3 = ''   (0 matches)
1084*0Sstevel@tonic-gate
1085*0Sstevel@tonic-gate=over 4
1086*0Sstevel@tonic-gate
1087*0Sstevel@tonic-gate=item 0
1088*0Sstevel@tonic-gate
1089*0Sstevel@tonic-gateStart with the first letter in the string 't'.
1090*0Sstevel@tonic-gate
1091*0Sstevel@tonic-gate=item 1
1092*0Sstevel@tonic-gate
1093*0Sstevel@tonic-gateThe first quantifier '.*' starts out by matching the whole
1094*0Sstevel@tonic-gatestring 'the cat in the hat'.
1095*0Sstevel@tonic-gate
1096*0Sstevel@tonic-gate=item 2
1097*0Sstevel@tonic-gate
1098*0Sstevel@tonic-gate'a' in the regexp element 'at' doesn't match the end of the
1099*0Sstevel@tonic-gatestring.  Backtrack one character.
1100*0Sstevel@tonic-gate
1101*0Sstevel@tonic-gate=item 3
1102*0Sstevel@tonic-gate
1103*0Sstevel@tonic-gate'a' in the regexp element 'at' still doesn't match the last
1104*0Sstevel@tonic-gateletter of the string 't', so backtrack one more character.
1105*0Sstevel@tonic-gate
1106*0Sstevel@tonic-gate=item 4
1107*0Sstevel@tonic-gate
1108*0Sstevel@tonic-gateNow we can match the 'a' and the 't'.
1109*0Sstevel@tonic-gate
1110*0Sstevel@tonic-gate=item 5
1111*0Sstevel@tonic-gate
1112*0Sstevel@tonic-gateMove on to the third element '.*'.  Since we are at the end of
1113*0Sstevel@tonic-gatethe string and '.*' can match 0 times, assign it the empty string.
1114*0Sstevel@tonic-gate
1115*0Sstevel@tonic-gate=item 6
1116*0Sstevel@tonic-gate
1117*0Sstevel@tonic-gateWe are done!
1118*0Sstevel@tonic-gate
1119*0Sstevel@tonic-gate=back
1120*0Sstevel@tonic-gate
1121*0Sstevel@tonic-gateMost of the time, all this moving forward and backtracking happens
1122*0Sstevel@tonic-gatequickly and searching is fast.   There are some pathological regexps,
1123*0Sstevel@tonic-gatehowever, whose execution time exponentially grows with the size of the
1124*0Sstevel@tonic-gatestring.  A typical structure that blows up in your face is of the form
1125*0Sstevel@tonic-gate
1126*0Sstevel@tonic-gate    /(a|b+)*/;
1127*0Sstevel@tonic-gate
1128*0Sstevel@tonic-gateThe problem is the nested indeterminate quantifiers.  There are many
1129*0Sstevel@tonic-gatedifferent ways of partitioning a string of length n between the C<+>
1130*0Sstevel@tonic-gateand C<*>: one repetition with C<b+> of length n, two repetitions with
1131*0Sstevel@tonic-gatethe first C<b+> length k and the second with length n-k, m repetitions
1132*0Sstevel@tonic-gatewhose bits add up to length n, etc.  In fact there are an exponential
1133*0Sstevel@tonic-gatenumber of ways to partition a string as a function of length.  A
1134*0Sstevel@tonic-gateregexp may get lucky and match early in the process, but if there is
1135*0Sstevel@tonic-gateno match, perl will try I<every> possibility before giving up.  So be
1136*0Sstevel@tonic-gatecareful with nested C<*>'s, C<{n,m}>'s, and C<+>'s.  The book
1137*0Sstevel@tonic-gateI<Mastering regular expressions> by Jeffrey Friedl gives a wonderful
1138*0Sstevel@tonic-gatediscussion of this and other efficiency issues.
1139*0Sstevel@tonic-gate
1140*0Sstevel@tonic-gate=head2 Building a regexp
1141*0Sstevel@tonic-gate
1142*0Sstevel@tonic-gateAt this point, we have all the basic regexp concepts covered, so let's
1143*0Sstevel@tonic-gategive a more involved example of a regular expression.  We will build a
1144*0Sstevel@tonic-gateregexp that matches numbers.
1145*0Sstevel@tonic-gate
1146*0Sstevel@tonic-gateThe first task in building a regexp is to decide what we want to match
1147*0Sstevel@tonic-gateand what we want to exclude.  In our case, we want to match both
1148*0Sstevel@tonic-gateintegers and floating point numbers and we want to reject any string
1149*0Sstevel@tonic-gatethat isn't a number.
1150*0Sstevel@tonic-gate
1151*0Sstevel@tonic-gateThe next task is to break the problem down into smaller problems that
1152*0Sstevel@tonic-gateare easily converted into a regexp.
1153*0Sstevel@tonic-gate
1154*0Sstevel@tonic-gateThe simplest case is integers.  These consist of a sequence of digits,
1155*0Sstevel@tonic-gatewith an optional sign in front.  The digits we can represent with
1156*0Sstevel@tonic-gateC<\d+> and the sign can be matched with C<[+-]>.  Thus the integer
1157*0Sstevel@tonic-gateregexp is
1158*0Sstevel@tonic-gate
1159*0Sstevel@tonic-gate    /[+-]?\d+/;  # matches integers
1160*0Sstevel@tonic-gate
1161*0Sstevel@tonic-gateA floating point number potentially has a sign, an integral part, a
1162*0Sstevel@tonic-gatedecimal point, a fractional part, and an exponent.  One or more of these
1163*0Sstevel@tonic-gateparts is optional, so we need to check out the different
1164*0Sstevel@tonic-gatepossibilities.  Floating point numbers which are in proper form include
1165*0Sstevel@tonic-gate123., 0.345, .34, -1e6, and 25.4E-72.  As with integers, the sign out
1166*0Sstevel@tonic-gatefront is completely optional and can be matched by C<[+-]?>.  We can
1167*0Sstevel@tonic-gatesee that if there is no exponent, floating point numbers must have a
1168*0Sstevel@tonic-gatedecimal point, otherwise they are integers.  We might be tempted to
1169*0Sstevel@tonic-gatemodel these with C<\d*\.\d*>, but this would also match just a single
1170*0Sstevel@tonic-gatedecimal point, which is not a number.  So the three cases of floating
1171*0Sstevel@tonic-gatepoint number sans exponent are
1172*0Sstevel@tonic-gate
1173*0Sstevel@tonic-gate   /[+-]?\d+\./;  # 1., 321., etc.
1174*0Sstevel@tonic-gate   /[+-]?\.\d+/;  # .1, .234, etc.
1175*0Sstevel@tonic-gate   /[+-]?\d+\.\d+/;  # 1.0, 30.56, etc.
1176*0Sstevel@tonic-gate
1177*0Sstevel@tonic-gateThese can be combined into a single regexp with a three-way alternation:
1178*0Sstevel@tonic-gate
1179*0Sstevel@tonic-gate   /[+-]?(\d+\.\d+|\d+\.|\.\d+)/;  # floating point, no exponent
1180*0Sstevel@tonic-gate
1181*0Sstevel@tonic-gateIn this alternation, it is important to put C<'\d+\.\d+'> before
1182*0Sstevel@tonic-gateC<'\d+\.'>.  If C<'\d+\.'> were first, the regexp would happily match that
1183*0Sstevel@tonic-gateand ignore the fractional part of the number.
1184*0Sstevel@tonic-gate
1185*0Sstevel@tonic-gateNow consider floating point numbers with exponents.  The key
1186*0Sstevel@tonic-gateobservation here is that I<both> integers and numbers with decimal
1187*0Sstevel@tonic-gatepoints are allowed in front of an exponent.  Then exponents, like the
1188*0Sstevel@tonic-gateoverall sign, are independent of whether we are matching numbers with
1189*0Sstevel@tonic-gateor without decimal points, and can be 'decoupled' from the
1190*0Sstevel@tonic-gatemantissa.  The overall form of the regexp now becomes clear:
1191*0Sstevel@tonic-gate
1192*0Sstevel@tonic-gate    /^(optional sign)(integer | f.p. mantissa)(optional exponent)$/;
1193*0Sstevel@tonic-gate
1194*0Sstevel@tonic-gateThe exponent is an C<e> or C<E>, followed by an integer.  So the
1195*0Sstevel@tonic-gateexponent regexp is
1196*0Sstevel@tonic-gate
1197*0Sstevel@tonic-gate   /[eE][+-]?\d+/;  # exponent
1198*0Sstevel@tonic-gate
1199*0Sstevel@tonic-gatePutting all the parts together, we get a regexp that matches numbers:
1200*0Sstevel@tonic-gate
1201*0Sstevel@tonic-gate   /^[+-]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/;  # Ta da!
1202*0Sstevel@tonic-gate
1203*0Sstevel@tonic-gateLong regexps like this may impress your friends, but can be hard to
1204*0Sstevel@tonic-gatedecipher.  In complex situations like this, the C<//x> modifier for a
1205*0Sstevel@tonic-gatematch is invaluable.  It allows one to put nearly arbitrary whitespace
1206*0Sstevel@tonic-gateand comments into a regexp without affecting their meaning.  Using it,
1207*0Sstevel@tonic-gatewe can rewrite our 'extended' regexp in the more pleasing form
1208*0Sstevel@tonic-gate
1209*0Sstevel@tonic-gate   /^
1210*0Sstevel@tonic-gate      [+-]?         # first, match an optional sign
1211*0Sstevel@tonic-gate      (             # then match integers or f.p. mantissas:
1212*0Sstevel@tonic-gate          \d+\.\d+  # mantissa of the form a.b
1213*0Sstevel@tonic-gate         |\d+\.     # mantissa of the form a.
1214*0Sstevel@tonic-gate         |\.\d+     # mantissa of the form .b
1215*0Sstevel@tonic-gate         |\d+       # integer of the form a
1216*0Sstevel@tonic-gate      )
1217*0Sstevel@tonic-gate      ([eE][+-]?\d+)?  # finally, optionally match an exponent
1218*0Sstevel@tonic-gate   $/x;
1219*0Sstevel@tonic-gate
1220*0Sstevel@tonic-gateIf whitespace is mostly irrelevant, how does one include space
1221*0Sstevel@tonic-gatecharacters in an extended regexp? The answer is to backslash it
1222*0Sstevel@tonic-gateS<C<'\ '> > or put it in a character class S<C<[ ]> >.  The same thing
1223*0Sstevel@tonic-gategoes for pound signs, use C<\#> or C<[#]>.  For instance, Perl allows
1224*0Sstevel@tonic-gatea space between the sign and the mantissa/integer, and we could add
1225*0Sstevel@tonic-gatethis to our regexp as follows:
1226*0Sstevel@tonic-gate
1227*0Sstevel@tonic-gate   /^
1228*0Sstevel@tonic-gate      [+-]?\ *      # first, match an optional sign *and space*
1229*0Sstevel@tonic-gate      (             # then match integers or f.p. mantissas:
1230*0Sstevel@tonic-gate          \d+\.\d+  # mantissa of the form a.b
1231*0Sstevel@tonic-gate         |\d+\.     # mantissa of the form a.
1232*0Sstevel@tonic-gate         |\.\d+     # mantissa of the form .b
1233*0Sstevel@tonic-gate         |\d+       # integer of the form a
1234*0Sstevel@tonic-gate      )
1235*0Sstevel@tonic-gate      ([eE][+-]?\d+)?  # finally, optionally match an exponent
1236*0Sstevel@tonic-gate   $/x;
1237*0Sstevel@tonic-gate
1238*0Sstevel@tonic-gateIn this form, it is easier to see a way to simplify the
1239*0Sstevel@tonic-gatealternation.  Alternatives 1, 2, and 4 all start with C<\d+>, so it
1240*0Sstevel@tonic-gatecould be factored out:
1241*0Sstevel@tonic-gate
1242*0Sstevel@tonic-gate   /^
1243*0Sstevel@tonic-gate      [+-]?\ *      # first, match an optional sign
1244*0Sstevel@tonic-gate      (             # then match integers or f.p. mantissas:
1245*0Sstevel@tonic-gate          \d+       # start out with a ...
1246*0Sstevel@tonic-gate          (
1247*0Sstevel@tonic-gate              \.\d* # mantissa of the form a.b or a.
1248*0Sstevel@tonic-gate          )?        # ? takes care of integers of the form a
1249*0Sstevel@tonic-gate         |\.\d+     # mantissa of the form .b
1250*0Sstevel@tonic-gate      )
1251*0Sstevel@tonic-gate      ([eE][+-]?\d+)?  # finally, optionally match an exponent
1252*0Sstevel@tonic-gate   $/x;
1253*0Sstevel@tonic-gate
1254*0Sstevel@tonic-gateor written in the compact form,
1255*0Sstevel@tonic-gate
1256*0Sstevel@tonic-gate    /^[+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?$/;
1257*0Sstevel@tonic-gate
1258*0Sstevel@tonic-gateThis is our final regexp.  To recap, we built a regexp by
1259*0Sstevel@tonic-gate
1260*0Sstevel@tonic-gate=over 4
1261*0Sstevel@tonic-gate
1262*0Sstevel@tonic-gate=item *
1263*0Sstevel@tonic-gate
1264*0Sstevel@tonic-gatespecifying the task in detail,
1265*0Sstevel@tonic-gate
1266*0Sstevel@tonic-gate=item *
1267*0Sstevel@tonic-gate
1268*0Sstevel@tonic-gatebreaking down the problem into smaller parts,
1269*0Sstevel@tonic-gate
1270*0Sstevel@tonic-gate=item *
1271*0Sstevel@tonic-gate
1272*0Sstevel@tonic-gatetranslating the small parts into regexps,
1273*0Sstevel@tonic-gate
1274*0Sstevel@tonic-gate=item *
1275*0Sstevel@tonic-gate
1276*0Sstevel@tonic-gatecombining the regexps,
1277*0Sstevel@tonic-gate
1278*0Sstevel@tonic-gate=item *
1279*0Sstevel@tonic-gate
1280*0Sstevel@tonic-gateand optimizing the final combined regexp.
1281*0Sstevel@tonic-gate
1282*0Sstevel@tonic-gate=back
1283*0Sstevel@tonic-gate
1284*0Sstevel@tonic-gateThese are also the typical steps involved in writing a computer
1285*0Sstevel@tonic-gateprogram.  This makes perfect sense, because regular expressions are
1286*0Sstevel@tonic-gateessentially programs written a little computer language that specifies
1287*0Sstevel@tonic-gatepatterns.
1288*0Sstevel@tonic-gate
1289*0Sstevel@tonic-gate=head2 Using regular expressions in Perl
1290*0Sstevel@tonic-gate
1291*0Sstevel@tonic-gateThe last topic of Part 1 briefly covers how regexps are used in Perl
1292*0Sstevel@tonic-gateprograms.  Where do they fit into Perl syntax?
1293*0Sstevel@tonic-gate
1294*0Sstevel@tonic-gateWe have already introduced the matching operator in its default
1295*0Sstevel@tonic-gateC</regexp/> and arbitrary delimiter C<m!regexp!> forms.  We have used
1296*0Sstevel@tonic-gatethe binding operator C<=~> and its negation C<!~> to test for string
1297*0Sstevel@tonic-gatematches.  Associated with the matching operator, we have discussed the
1298*0Sstevel@tonic-gatesingle line C<//s>, multi-line C<//m>, case-insensitive C<//i> and
1299*0Sstevel@tonic-gateextended C<//x> modifiers.
1300*0Sstevel@tonic-gate
1301*0Sstevel@tonic-gateThere are a few more things you might want to know about matching
1302*0Sstevel@tonic-gateoperators.  First, we pointed out earlier that variables in regexps are
1303*0Sstevel@tonic-gatesubstituted before the regexp is evaluated:
1304*0Sstevel@tonic-gate
1305*0Sstevel@tonic-gate    $pattern = 'Seuss';
1306*0Sstevel@tonic-gate    while (<>) {
1307*0Sstevel@tonic-gate        print if /$pattern/;
1308*0Sstevel@tonic-gate    }
1309*0Sstevel@tonic-gate
1310*0Sstevel@tonic-gateThis will print any lines containing the word C<Seuss>.  It is not as
1311*0Sstevel@tonic-gateefficient as it could be, however, because perl has to re-evaluate
1312*0Sstevel@tonic-gateC<$pattern> each time through the loop.  If C<$pattern> won't be
1313*0Sstevel@tonic-gatechanging over the lifetime of the script, we can add the C<//o>
1314*0Sstevel@tonic-gatemodifier, which directs perl to only perform variable substitutions
1315*0Sstevel@tonic-gateonce:
1316*0Sstevel@tonic-gate
1317*0Sstevel@tonic-gate    #!/usr/bin/perl
1318*0Sstevel@tonic-gate    #    Improved simple_grep
1319*0Sstevel@tonic-gate    $regexp = shift;
1320*0Sstevel@tonic-gate    while (<>) {
1321*0Sstevel@tonic-gate        print if /$regexp/o;  # a good deal faster
1322*0Sstevel@tonic-gate    }
1323*0Sstevel@tonic-gate
1324*0Sstevel@tonic-gateIf you change C<$pattern> after the first substitution happens, perl
1325*0Sstevel@tonic-gatewill ignore it.  If you don't want any substitutions at all, use the
1326*0Sstevel@tonic-gatespecial delimiter C<m''>:
1327*0Sstevel@tonic-gate
1328*0Sstevel@tonic-gate    @pattern = ('Seuss');
1329*0Sstevel@tonic-gate    while (<>) {
1330*0Sstevel@tonic-gate        print if m'@pattern';  # matches literal '@pattern', not 'Seuss'
1331*0Sstevel@tonic-gate    }
1332*0Sstevel@tonic-gate
1333*0Sstevel@tonic-gateC<m''> acts like single quotes on a regexp; all other C<m> delimiters
1334*0Sstevel@tonic-gateact like double quotes.  If the regexp evaluates to the empty string,
1335*0Sstevel@tonic-gatethe regexp in the I<last successful match> is used instead.  So we have
1336*0Sstevel@tonic-gate
1337*0Sstevel@tonic-gate    "dog" =~ /d/;  # 'd' matches
1338*0Sstevel@tonic-gate    "dogbert =~ //;  # this matches the 'd' regexp used before
1339*0Sstevel@tonic-gate
1340*0Sstevel@tonic-gateThe final two modifiers C<//g> and C<//c> concern multiple matches.
1341*0Sstevel@tonic-gateThe modifier C<//g> stands for global matching and allows the
1342*0Sstevel@tonic-gatematching operator to match within a string as many times as possible.
1343*0Sstevel@tonic-gateIn scalar context, successive invocations against a string will have
1344*0Sstevel@tonic-gate`C<//g> jump from match to match, keeping track of position in the
1345*0Sstevel@tonic-gatestring as it goes along.  You can get or set the position with the
1346*0Sstevel@tonic-gateC<pos()> function.
1347*0Sstevel@tonic-gate
1348*0Sstevel@tonic-gateThe use of C<//g> is shown in the following example.  Suppose we have
1349*0Sstevel@tonic-gatea string that consists of words separated by spaces.  If we know how
1350*0Sstevel@tonic-gatemany words there are in advance, we could extract the words using
1351*0Sstevel@tonic-gategroupings:
1352*0Sstevel@tonic-gate
1353*0Sstevel@tonic-gate    $x = "cat dog house"; # 3 words
1354*0Sstevel@tonic-gate    $x =~ /^\s*(\w+)\s+(\w+)\s+(\w+)\s*$/; # matches,
1355*0Sstevel@tonic-gate                                           # $1 = 'cat'
1356*0Sstevel@tonic-gate                                           # $2 = 'dog'
1357*0Sstevel@tonic-gate                                           # $3 = 'house'
1358*0Sstevel@tonic-gate
1359*0Sstevel@tonic-gateBut what if we had an indeterminate number of words? This is the sort
1360*0Sstevel@tonic-gateof task C<//g> was made for.  To extract all words, form the simple
1361*0Sstevel@tonic-gateregexp C<(\w+)> and loop over all matches with C</(\w+)/g>:
1362*0Sstevel@tonic-gate
1363*0Sstevel@tonic-gate    while ($x =~ /(\w+)/g) {
1364*0Sstevel@tonic-gate        print "Word is $1, ends at position ", pos $x, "\n";
1365*0Sstevel@tonic-gate    }
1366*0Sstevel@tonic-gate
1367*0Sstevel@tonic-gateprints
1368*0Sstevel@tonic-gate
1369*0Sstevel@tonic-gate    Word is cat, ends at position 3
1370*0Sstevel@tonic-gate    Word is dog, ends at position 7
1371*0Sstevel@tonic-gate    Word is house, ends at position 13
1372*0Sstevel@tonic-gate
1373*0Sstevel@tonic-gateA failed match or changing the target string resets the position.  If
1374*0Sstevel@tonic-gateyou don't want the position reset after failure to match, add the
1375*0Sstevel@tonic-gateC<//c>, as in C</regexp/gc>.  The current position in the string is
1376*0Sstevel@tonic-gateassociated with the string, not the regexp.  This means that different
1377*0Sstevel@tonic-gatestrings have different positions and their respective positions can be
1378*0Sstevel@tonic-gateset or read independently.
1379*0Sstevel@tonic-gate
1380*0Sstevel@tonic-gateIn list context, C<//g> returns a list of matched groupings, or if
1381*0Sstevel@tonic-gatethere are no groupings, a list of matches to the whole regexp.  So if
1382*0Sstevel@tonic-gatewe wanted just the words, we could use
1383*0Sstevel@tonic-gate
1384*0Sstevel@tonic-gate    @words = ($x =~ /(\w+)/g);  # matches,
1385*0Sstevel@tonic-gate                                # $word[0] = 'cat'
1386*0Sstevel@tonic-gate                                # $word[1] = 'dog'
1387*0Sstevel@tonic-gate                                # $word[2] = 'house'
1388*0Sstevel@tonic-gate
1389*0Sstevel@tonic-gateClosely associated with the C<//g> modifier is the C<\G> anchor.  The
1390*0Sstevel@tonic-gateC<\G> anchor matches at the point where the previous C<//g> match left
1391*0Sstevel@tonic-gateoff.  C<\G> allows us to easily do context-sensitive matching:
1392*0Sstevel@tonic-gate
1393*0Sstevel@tonic-gate    $metric = 1;  # use metric units
1394*0Sstevel@tonic-gate    ...
1395*0Sstevel@tonic-gate    $x = <FILE>;  # read in measurement
1396*0Sstevel@tonic-gate    $x =~ /^([+-]?\d+)\s*/g;  # get magnitude
1397*0Sstevel@tonic-gate    $weight = $1;
1398*0Sstevel@tonic-gate    if ($metric) { # error checking
1399*0Sstevel@tonic-gate        print "Units error!" unless $x =~ /\Gkg\./g;
1400*0Sstevel@tonic-gate    }
1401*0Sstevel@tonic-gate    else {
1402*0Sstevel@tonic-gate        print "Units error!" unless $x =~ /\Glbs\./g;
1403*0Sstevel@tonic-gate    }
1404*0Sstevel@tonic-gate    $x =~ /\G\s+(widget|sprocket)/g;  # continue processing
1405*0Sstevel@tonic-gate
1406*0Sstevel@tonic-gateThe combination of C<//g> and C<\G> allows us to process the string a
1407*0Sstevel@tonic-gatebit at a time and use arbitrary Perl logic to decide what to do next.
1408*0Sstevel@tonic-gateCurrently, the C<\G> anchor is only fully supported when used to anchor
1409*0Sstevel@tonic-gateto the start of the pattern.
1410*0Sstevel@tonic-gate
1411*0Sstevel@tonic-gateC<\G> is also invaluable in processing fixed length records with
1412*0Sstevel@tonic-gateregexps.  Suppose we have a snippet of coding region DNA, encoded as
1413*0Sstevel@tonic-gatebase pair letters C<ATCGTTGAAT...> and we want to find all the stop
1414*0Sstevel@tonic-gatecodons C<TGA>.  In a coding region, codons are 3-letter sequences, so
1415*0Sstevel@tonic-gatewe can think of the DNA snippet as a sequence of 3-letter records.  The
1416*0Sstevel@tonic-gatenaive regexp
1417*0Sstevel@tonic-gate
1418*0Sstevel@tonic-gate    # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC"
1419*0Sstevel@tonic-gate    $dna = "ATCGTTGAATGCAAATGACATGAC";
1420*0Sstevel@tonic-gate    $dna =~ /TGA/;
1421*0Sstevel@tonic-gate
1422*0Sstevel@tonic-gatedoesn't work; it may match a C<TGA>, but there is no guarantee that
1423*0Sstevel@tonic-gatethe match is aligned with codon boundaries, e.g., the substring
1424*0Sstevel@tonic-gateS<C<GTT GAA> > gives a match.  A better solution is
1425*0Sstevel@tonic-gate
1426*0Sstevel@tonic-gate    while ($dna =~ /(\w\w\w)*?TGA/g) {  # note the minimal *?
1427*0Sstevel@tonic-gate        print "Got a TGA stop codon at position ", pos $dna, "\n";
1428*0Sstevel@tonic-gate    }
1429*0Sstevel@tonic-gate
1430*0Sstevel@tonic-gatewhich prints
1431*0Sstevel@tonic-gate
1432*0Sstevel@tonic-gate    Got a TGA stop codon at position 18
1433*0Sstevel@tonic-gate    Got a TGA stop codon at position 23
1434*0Sstevel@tonic-gate
1435*0Sstevel@tonic-gatePosition 18 is good, but position 23 is bogus.  What happened?
1436*0Sstevel@tonic-gate
1437*0Sstevel@tonic-gateThe answer is that our regexp works well until we get past the last
1438*0Sstevel@tonic-gatereal match.  Then the regexp will fail to match a synchronized C<TGA>
1439*0Sstevel@tonic-gateand start stepping ahead one character position at a time, not what we
1440*0Sstevel@tonic-gatewant.  The solution is to use C<\G> to anchor the match to the codon
1441*0Sstevel@tonic-gatealignment:
1442*0Sstevel@tonic-gate
1443*0Sstevel@tonic-gate    while ($dna =~ /\G(\w\w\w)*?TGA/g) {
1444*0Sstevel@tonic-gate        print "Got a TGA stop codon at position ", pos $dna, "\n";
1445*0Sstevel@tonic-gate    }
1446*0Sstevel@tonic-gate
1447*0Sstevel@tonic-gateThis prints
1448*0Sstevel@tonic-gate
1449*0Sstevel@tonic-gate    Got a TGA stop codon at position 18
1450*0Sstevel@tonic-gate
1451*0Sstevel@tonic-gatewhich is the correct answer.  This example illustrates that it is
1452*0Sstevel@tonic-gateimportant not only to match what is desired, but to reject what is not
1453*0Sstevel@tonic-gatedesired.
1454*0Sstevel@tonic-gate
1455*0Sstevel@tonic-gateB<search and replace>
1456*0Sstevel@tonic-gate
1457*0Sstevel@tonic-gateRegular expressions also play a big role in B<search and replace>
1458*0Sstevel@tonic-gateoperations in Perl.  Search and replace is accomplished with the
1459*0Sstevel@tonic-gateC<s///> operator.  The general form is
1460*0Sstevel@tonic-gateC<s/regexp/replacement/modifiers>, with everything we know about
1461*0Sstevel@tonic-gateregexps and modifiers applying in this case as well.  The
1462*0Sstevel@tonic-gateC<replacement> is a Perl double quoted string that replaces in the
1463*0Sstevel@tonic-gatestring whatever is matched with the C<regexp>.  The operator C<=~> is
1464*0Sstevel@tonic-gatealso used here to associate a string with C<s///>.  If matching
1465*0Sstevel@tonic-gateagainst C<$_>, the S<C<$_ =~> > can be dropped.  If there is a match,
1466*0Sstevel@tonic-gateC<s///> returns the number of substitutions made, otherwise it returns
1467*0Sstevel@tonic-gatefalse.  Here are a few examples:
1468*0Sstevel@tonic-gate
1469*0Sstevel@tonic-gate    $x = "Time to feed the cat!";
1470*0Sstevel@tonic-gate    $x =~ s/cat/hacker/;   # $x contains "Time to feed the hacker!"
1471*0Sstevel@tonic-gate    if ($x =~ s/^(Time.*hacker)!$/$1 now!/) {
1472*0Sstevel@tonic-gate        $more_insistent = 1;
1473*0Sstevel@tonic-gate    }
1474*0Sstevel@tonic-gate    $y = "'quoted words'";
1475*0Sstevel@tonic-gate    $y =~ s/^'(.*)'$/$1/;  # strip single quotes,
1476*0Sstevel@tonic-gate                           # $y contains "quoted words"
1477*0Sstevel@tonic-gate
1478*0Sstevel@tonic-gateIn the last example, the whole string was matched, but only the part
1479*0Sstevel@tonic-gateinside the single quotes was grouped.  With the C<s///> operator, the
1480*0Sstevel@tonic-gatematched variables C<$1>, C<$2>, etc.  are immediately available for use
1481*0Sstevel@tonic-gatein the replacement expression, so we use C<$1> to replace the quoted
1482*0Sstevel@tonic-gatestring with just what was quoted.  With the global modifier, C<s///g>
1483*0Sstevel@tonic-gatewill search and replace all occurrences of the regexp in the string:
1484*0Sstevel@tonic-gate
1485*0Sstevel@tonic-gate    $x = "I batted 4 for 4";
1486*0Sstevel@tonic-gate    $x =~ s/4/four/;   # doesn't do it all:
1487*0Sstevel@tonic-gate                       # $x contains "I batted four for 4"
1488*0Sstevel@tonic-gate    $x = "I batted 4 for 4";
1489*0Sstevel@tonic-gate    $x =~ s/4/four/g;  # does it all:
1490*0Sstevel@tonic-gate                       # $x contains "I batted four for four"
1491*0Sstevel@tonic-gate
1492*0Sstevel@tonic-gateIf you prefer 'regex' over 'regexp' in this tutorial, you could use
1493*0Sstevel@tonic-gatethe following program to replace it:
1494*0Sstevel@tonic-gate
1495*0Sstevel@tonic-gate    % cat > simple_replace
1496*0Sstevel@tonic-gate    #!/usr/bin/perl
1497*0Sstevel@tonic-gate    $regexp = shift;
1498*0Sstevel@tonic-gate    $replacement = shift;
1499*0Sstevel@tonic-gate    while (<>) {
1500*0Sstevel@tonic-gate        s/$regexp/$replacement/go;
1501*0Sstevel@tonic-gate        print;
1502*0Sstevel@tonic-gate    }
1503*0Sstevel@tonic-gate    ^D
1504*0Sstevel@tonic-gate
1505*0Sstevel@tonic-gate    % simple_replace regexp regex perlretut.pod
1506*0Sstevel@tonic-gate
1507*0Sstevel@tonic-gateIn C<simple_replace> we used the C<s///g> modifier to replace all
1508*0Sstevel@tonic-gateoccurrences of the regexp on each line and the C<s///o> modifier to
1509*0Sstevel@tonic-gatecompile the regexp only once.  As with C<simple_grep>, both the
1510*0Sstevel@tonic-gateC<print> and the C<s/$regexp/$replacement/go> use C<$_> implicitly.
1511*0Sstevel@tonic-gate
1512*0Sstevel@tonic-gateA modifier available specifically to search and replace is the
1513*0Sstevel@tonic-gateC<s///e> evaluation modifier.  C<s///e> wraps an C<eval{...}> around
1514*0Sstevel@tonic-gatethe replacement string and the evaluated result is substituted for the
1515*0Sstevel@tonic-gatematched substring.  C<s///e> is useful if you need to do a bit of
1516*0Sstevel@tonic-gatecomputation in the process of replacing text.  This example counts
1517*0Sstevel@tonic-gatecharacter frequencies in a line:
1518*0Sstevel@tonic-gate
1519*0Sstevel@tonic-gate    $x = "Bill the cat";
1520*0Sstevel@tonic-gate    $x =~ s/(.)/$chars{$1}++;$1/eg;  # final $1 replaces char with itself
1521*0Sstevel@tonic-gate    print "frequency of '$_' is $chars{$_}\n"
1522*0Sstevel@tonic-gate        foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars);
1523*0Sstevel@tonic-gate
1524*0Sstevel@tonic-gateThis prints
1525*0Sstevel@tonic-gate
1526*0Sstevel@tonic-gate    frequency of ' ' is 2
1527*0Sstevel@tonic-gate    frequency of 't' is 2
1528*0Sstevel@tonic-gate    frequency of 'l' is 2
1529*0Sstevel@tonic-gate    frequency of 'B' is 1
1530*0Sstevel@tonic-gate    frequency of 'c' is 1
1531*0Sstevel@tonic-gate    frequency of 'e' is 1
1532*0Sstevel@tonic-gate    frequency of 'h' is 1
1533*0Sstevel@tonic-gate    frequency of 'i' is 1
1534*0Sstevel@tonic-gate    frequency of 'a' is 1
1535*0Sstevel@tonic-gate
1536*0Sstevel@tonic-gateAs with the match C<m//> operator, C<s///> can use other delimiters,
1537*0Sstevel@tonic-gatesuch as C<s!!!> and C<s{}{}>, and even C<s{}//>.  If single quotes are
1538*0Sstevel@tonic-gateused C<s'''>, then the regexp and replacement are treated as single
1539*0Sstevel@tonic-gatequoted strings and there are no substitutions.  C<s///> in list context
1540*0Sstevel@tonic-gatereturns the same thing as in scalar context, i.e., the number of
1541*0Sstevel@tonic-gatematches.
1542*0Sstevel@tonic-gate
1543*0Sstevel@tonic-gateB<The split operator>
1544*0Sstevel@tonic-gate
1545*0Sstevel@tonic-gateThe B<C<split> > function can also optionally use a matching operator
1546*0Sstevel@tonic-gateC<m//> to split a string.  C<split /regexp/, string, limit> splits
1547*0Sstevel@tonic-gateC<string> into a list of substrings and returns that list.  The regexp
1548*0Sstevel@tonic-gateis used to match the character sequence that the C<string> is split
1549*0Sstevel@tonic-gatewith respect to.  The C<limit>, if present, constrains splitting into
1550*0Sstevel@tonic-gateno more than C<limit> number of strings.  For example, to split a
1551*0Sstevel@tonic-gatestring into words, use
1552*0Sstevel@tonic-gate
1553*0Sstevel@tonic-gate    $x = "Calvin and Hobbes";
1554*0Sstevel@tonic-gate    @words = split /\s+/, $x;  # $word[0] = 'Calvin'
1555*0Sstevel@tonic-gate                               # $word[1] = 'and'
1556*0Sstevel@tonic-gate                               # $word[2] = 'Hobbes'
1557*0Sstevel@tonic-gate
1558*0Sstevel@tonic-gateIf the empty regexp C<//> is used, the regexp always matches and
1559*0Sstevel@tonic-gatethe string is split into individual characters.  If the regexp has
1560*0Sstevel@tonic-gategroupings, then list produced contains the matched substrings from the
1561*0Sstevel@tonic-gategroupings as well.  For instance,
1562*0Sstevel@tonic-gate
1563*0Sstevel@tonic-gate    $x = "/usr/bin/perl";
1564*0Sstevel@tonic-gate    @dirs = split m!/!, $x;  # $dirs[0] = ''
1565*0Sstevel@tonic-gate                             # $dirs[1] = 'usr'
1566*0Sstevel@tonic-gate                             # $dirs[2] = 'bin'
1567*0Sstevel@tonic-gate                             # $dirs[3] = 'perl'
1568*0Sstevel@tonic-gate    @parts = split m!(/)!, $x;  # $parts[0] = ''
1569*0Sstevel@tonic-gate                                # $parts[1] = '/'
1570*0Sstevel@tonic-gate                                # $parts[2] = 'usr'
1571*0Sstevel@tonic-gate                                # $parts[3] = '/'
1572*0Sstevel@tonic-gate                                # $parts[4] = 'bin'
1573*0Sstevel@tonic-gate                                # $parts[5] = '/'
1574*0Sstevel@tonic-gate                                # $parts[6] = 'perl'
1575*0Sstevel@tonic-gate
1576*0Sstevel@tonic-gateSince the first character of $x matched the regexp, C<split> prepended
1577*0Sstevel@tonic-gatean empty initial element to the list.
1578*0Sstevel@tonic-gate
1579*0Sstevel@tonic-gateIf you have read this far, congratulations! You now have all the basic
1580*0Sstevel@tonic-gatetools needed to use regular expressions to solve a wide range of text
1581*0Sstevel@tonic-gateprocessing problems.  If this is your first time through the tutorial,
1582*0Sstevel@tonic-gatewhy not stop here and play around with regexps a while...  S<Part 2>
1583*0Sstevel@tonic-gateconcerns the more esoteric aspects of regular expressions and those
1584*0Sstevel@tonic-gateconcepts certainly aren't needed right at the start.
1585*0Sstevel@tonic-gate
1586*0Sstevel@tonic-gate=head1 Part 2: Power tools
1587*0Sstevel@tonic-gate
1588*0Sstevel@tonic-gateOK, you know the basics of regexps and you want to know more.  If
1589*0Sstevel@tonic-gatematching regular expressions is analogous to a walk in the woods, then
1590*0Sstevel@tonic-gatethe tools discussed in Part 1 are analogous to topo maps and a
1591*0Sstevel@tonic-gatecompass, basic tools we use all the time.  Most of the tools in part 2
1592*0Sstevel@tonic-gateare analogous to flare guns and satellite phones.  They aren't used
1593*0Sstevel@tonic-gatetoo often on a hike, but when we are stuck, they can be invaluable.
1594*0Sstevel@tonic-gate
1595*0Sstevel@tonic-gateWhat follows are the more advanced, less used, or sometimes esoteric
1596*0Sstevel@tonic-gatecapabilities of perl regexps.  In Part 2, we will assume you are
1597*0Sstevel@tonic-gatecomfortable with the basics and concentrate on the new features.
1598*0Sstevel@tonic-gate
1599*0Sstevel@tonic-gate=head2 More on characters, strings, and character classes
1600*0Sstevel@tonic-gate
1601*0Sstevel@tonic-gateThere are a number of escape sequences and character classes that we
1602*0Sstevel@tonic-gatehaven't covered yet.
1603*0Sstevel@tonic-gate
1604*0Sstevel@tonic-gateThere are several escape sequences that convert characters or strings
1605*0Sstevel@tonic-gatebetween upper and lower case.  C<\l> and C<\u> convert the next
1606*0Sstevel@tonic-gatecharacter to lower or upper case, respectively:
1607*0Sstevel@tonic-gate
1608*0Sstevel@tonic-gate    $x = "perl";
1609*0Sstevel@tonic-gate    $string =~ /\u$x/;  # matches 'Perl' in $string
1610*0Sstevel@tonic-gate    $x = "M(rs?|s)\\."; # note the double backslash
1611*0Sstevel@tonic-gate    $string =~ /\l$x/;  # matches 'mr.', 'mrs.', and 'ms.',
1612*0Sstevel@tonic-gate
1613*0Sstevel@tonic-gateC<\L> and C<\U> converts a whole substring, delimited by C<\L> or
1614*0Sstevel@tonic-gateC<\U> and C<\E>, to lower or upper case:
1615*0Sstevel@tonic-gate
1616*0Sstevel@tonic-gate    $x = "This word is in lower case:\L SHOUT\E";
1617*0Sstevel@tonic-gate    $x =~ /shout/;       # matches
1618*0Sstevel@tonic-gate    $x = "I STILL KEYPUNCH CARDS FOR MY 360"
1619*0Sstevel@tonic-gate    $x =~ /\Ukeypunch/;  # matches punch card string
1620*0Sstevel@tonic-gate
1621*0Sstevel@tonic-gateIf there is no C<\E>, case is converted until the end of the
1622*0Sstevel@tonic-gatestring. The regexps C<\L\u$word> or C<\u\L$word> convert the first
1623*0Sstevel@tonic-gatecharacter of C<$word> to uppercase and the rest of the characters to
1624*0Sstevel@tonic-gatelowercase.
1625*0Sstevel@tonic-gate
1626*0Sstevel@tonic-gateControl characters can be escaped with C<\c>, so that a control-Z
1627*0Sstevel@tonic-gatecharacter would be matched with C<\cZ>.  The escape sequence
1628*0Sstevel@tonic-gateC<\Q>...C<\E> quotes, or protects most non-alphabetic characters.   For
1629*0Sstevel@tonic-gateinstance,
1630*0Sstevel@tonic-gate
1631*0Sstevel@tonic-gate    $x = "\QThat !^*&%~& cat!";
1632*0Sstevel@tonic-gate    $x =~ /\Q!^*&%~&\E/;  # check for rough language
1633*0Sstevel@tonic-gate
1634*0Sstevel@tonic-gateIt does not protect C<$> or C<@>, so that variables can still be
1635*0Sstevel@tonic-gatesubstituted.
1636*0Sstevel@tonic-gate
1637*0Sstevel@tonic-gateWith the advent of 5.6.0, perl regexps can handle more than just the
1638*0Sstevel@tonic-gatestandard ASCII character set.  Perl now supports B<Unicode>, a standard
1639*0Sstevel@tonic-gatefor encoding the character sets from many of the world's written
1640*0Sstevel@tonic-gatelanguages.  Unicode does this by allowing characters to be more than
1641*0Sstevel@tonic-gateone byte wide.  Perl uses the UTF-8 encoding, in which ASCII characters
1642*0Sstevel@tonic-gateare still encoded as one byte, but characters greater than C<chr(127)>
1643*0Sstevel@tonic-gatemay be stored as two or more bytes.
1644*0Sstevel@tonic-gate
1645*0Sstevel@tonic-gateWhat does this mean for regexps? Well, regexp users don't need to know
1646*0Sstevel@tonic-gatemuch about perl's internal representation of strings.  But they do need
1647*0Sstevel@tonic-gateto know 1) how to represent Unicode characters in a regexp and 2) when
1648*0Sstevel@tonic-gatea matching operation will treat the string to be searched as a
1649*0Sstevel@tonic-gatesequence of bytes (the old way) or as a sequence of Unicode characters
1650*0Sstevel@tonic-gate(the new way).  The answer to 1) is that Unicode characters greater
1651*0Sstevel@tonic-gatethan C<chr(127)> may be represented using the C<\x{hex}> notation,
1652*0Sstevel@tonic-gatewith C<hex> a hexadecimal integer:
1653*0Sstevel@tonic-gate
1654*0Sstevel@tonic-gate    /\x{263a}/;  # match a Unicode smiley face :)
1655*0Sstevel@tonic-gate
1656*0Sstevel@tonic-gateUnicode characters in the range of 128-255 use two hexadecimal digits
1657*0Sstevel@tonic-gatewith braces: C<\x{ab}>.  Note that this is different than C<\xab>,
1658*0Sstevel@tonic-gatewhich is just a hexadecimal byte with no Unicode significance.
1659*0Sstevel@tonic-gate
1660*0Sstevel@tonic-gateB<NOTE>: in Perl 5.6.0 it used to be that one needed to say C<use
1661*0Sstevel@tonic-gateutf8> to use any Unicode features.  This is no more the case: for
1662*0Sstevel@tonic-gatealmost all Unicode processing, the explicit C<utf8> pragma is not
1663*0Sstevel@tonic-gateneeded.  (The only case where it matters is if your Perl script is in
1664*0Sstevel@tonic-gateUnicode and encoded in UTF-8, then an explicit C<use utf8> is needed.)
1665*0Sstevel@tonic-gate
1666*0Sstevel@tonic-gateFiguring out the hexadecimal sequence of a Unicode character you want
1667*0Sstevel@tonic-gateor deciphering someone else's hexadecimal Unicode regexp is about as
1668*0Sstevel@tonic-gatemuch fun as programming in machine code.  So another way to specify
1669*0Sstevel@tonic-gateUnicode characters is to use the S<B<named character> > escape
1670*0Sstevel@tonic-gatesequence C<\N{name}>.  C<name> is a name for the Unicode character, as
1671*0Sstevel@tonic-gatespecified in the Unicode standard.  For instance, if we wanted to
1672*0Sstevel@tonic-gaterepresent or match the astrological sign for the planet Mercury, we
1673*0Sstevel@tonic-gatecould use
1674*0Sstevel@tonic-gate
1675*0Sstevel@tonic-gate    use charnames ":full"; # use named chars with Unicode full names
1676*0Sstevel@tonic-gate    $x = "abc\N{MERCURY}def";
1677*0Sstevel@tonic-gate    $x =~ /\N{MERCURY}/;   # matches
1678*0Sstevel@tonic-gate
1679*0Sstevel@tonic-gateOne can also use short names or restrict names to a certain alphabet:
1680*0Sstevel@tonic-gate
1681*0Sstevel@tonic-gate    use charnames ':full';
1682*0Sstevel@tonic-gate    print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n";
1683*0Sstevel@tonic-gate
1684*0Sstevel@tonic-gate    use charnames ":short";
1685*0Sstevel@tonic-gate    print "\N{greek:Sigma} is an upper-case sigma.\n";
1686*0Sstevel@tonic-gate
1687*0Sstevel@tonic-gate    use charnames qw(greek);
1688*0Sstevel@tonic-gate    print "\N{sigma} is Greek sigma\n";
1689*0Sstevel@tonic-gate
1690*0Sstevel@tonic-gateA list of full names is found in the file Names.txt in the
1691*0Sstevel@tonic-gatelib/perl5/5.X.X/unicore directory.
1692*0Sstevel@tonic-gate
1693*0Sstevel@tonic-gateThe answer to requirement 2), as of 5.6.0, is that if a regexp
1694*0Sstevel@tonic-gatecontains Unicode characters, the string is searched as a sequence of
1695*0Sstevel@tonic-gateUnicode characters.  Otherwise, the string is searched as a sequence of
1696*0Sstevel@tonic-gatebytes.  If the string is being searched as a sequence of Unicode
1697*0Sstevel@tonic-gatecharacters, but matching a single byte is required, we can use the C<\C>
1698*0Sstevel@tonic-gateescape sequence.  C<\C> is a character class akin to C<.> except that
1699*0Sstevel@tonic-gateit matches I<any> byte 0-255.  So
1700*0Sstevel@tonic-gate
1701*0Sstevel@tonic-gate    use charnames ":full"; # use named chars with Unicode full names
1702*0Sstevel@tonic-gate    $x = "a";
1703*0Sstevel@tonic-gate    $x =~ /\C/;  # matches 'a', eats one byte
1704*0Sstevel@tonic-gate    $x = "";
1705*0Sstevel@tonic-gate    $x =~ /\C/;  # doesn't match, no bytes to match
1706*0Sstevel@tonic-gate    $x = "\N{MERCURY}";  # two-byte Unicode character
1707*0Sstevel@tonic-gate    $x =~ /\C/;  # matches, but dangerous!
1708*0Sstevel@tonic-gate
1709*0Sstevel@tonic-gateThe last regexp matches, but is dangerous because the string
1710*0Sstevel@tonic-gateI<character> position is no longer synchronized to the string I<byte>
1711*0Sstevel@tonic-gateposition.  This generates the warning 'Malformed UTF-8
1712*0Sstevel@tonic-gatecharacter'.  The C<\C> is best used for matching the binary data in strings
1713*0Sstevel@tonic-gatewith binary data intermixed with Unicode characters.
1714*0Sstevel@tonic-gate
1715*0Sstevel@tonic-gateLet us now discuss the rest of the character classes.  Just as with
1716*0Sstevel@tonic-gateUnicode characters, there are named Unicode character classes
1717*0Sstevel@tonic-gaterepresented by the C<\p{name}> escape sequence.  Closely associated is
1718*0Sstevel@tonic-gatethe C<\P{name}> character class, which is the negation of the
1719*0Sstevel@tonic-gateC<\p{name}> class.  For example, to match lower and uppercase
1720*0Sstevel@tonic-gatecharacters,
1721*0Sstevel@tonic-gate
1722*0Sstevel@tonic-gate    use charnames ":full"; # use named chars with Unicode full names
1723*0Sstevel@tonic-gate    $x = "BOB";
1724*0Sstevel@tonic-gate    $x =~ /^\p{IsUpper}/;   # matches, uppercase char class
1725*0Sstevel@tonic-gate    $x =~ /^\P{IsUpper}/;   # doesn't match, char class sans uppercase
1726*0Sstevel@tonic-gate    $x =~ /^\p{IsLower}/;   # doesn't match, lowercase char class
1727*0Sstevel@tonic-gate    $x =~ /^\P{IsLower}/;   # matches, char class sans lowercase
1728*0Sstevel@tonic-gate
1729*0Sstevel@tonic-gateHere is the association between some Perl named classes and the
1730*0Sstevel@tonic-gatetraditional Unicode classes:
1731*0Sstevel@tonic-gate
1732*0Sstevel@tonic-gate    Perl class name  Unicode class name or regular expression
1733*0Sstevel@tonic-gate
1734*0Sstevel@tonic-gate    IsAlpha          /^[LM]/
1735*0Sstevel@tonic-gate    IsAlnum          /^[LMN]/
1736*0Sstevel@tonic-gate    IsASCII          $code <= 127
1737*0Sstevel@tonic-gate    IsCntrl          /^C/
1738*0Sstevel@tonic-gate    IsBlank          $code =~ /^(0020|0009)$/ || /^Z[^lp]/
1739*0Sstevel@tonic-gate    IsDigit          Nd
1740*0Sstevel@tonic-gate    IsGraph          /^([LMNPS]|Co)/
1741*0Sstevel@tonic-gate    IsLower          Ll
1742*0Sstevel@tonic-gate    IsPrint          /^([LMNPS]|Co|Zs)/
1743*0Sstevel@tonic-gate    IsPunct          /^P/
1744*0Sstevel@tonic-gate    IsSpace          /^Z/ || ($code =~ /^(0009|000A|000B|000C|000D)$/
1745*0Sstevel@tonic-gate    IsSpacePerl      /^Z/ || ($code =~ /^(0009|000A|000C|000D|0085|2028|2029)$/
1746*0Sstevel@tonic-gate    IsUpper          /^L[ut]/
1747*0Sstevel@tonic-gate    IsWord           /^[LMN]/ || $code eq "005F"
1748*0Sstevel@tonic-gate    IsXDigit         $code =~ /^00(3[0-9]|[46][1-6])$/
1749*0Sstevel@tonic-gate
1750*0Sstevel@tonic-gateYou can also use the official Unicode class names with the C<\p> and
1751*0Sstevel@tonic-gateC<\P>, like C<\p{L}> for Unicode 'letters', or C<\p{Lu}> for uppercase
1752*0Sstevel@tonic-gateletters, or C<\P{Nd}> for non-digits.  If a C<name> is just one
1753*0Sstevel@tonic-gateletter, the braces can be dropped.  For instance, C<\pM> is the
1754*0Sstevel@tonic-gatecharacter class of Unicode 'marks', for example accent marks.
1755*0Sstevel@tonic-gateFor the full list see L<perlunicode>.
1756*0Sstevel@tonic-gate
1757*0Sstevel@tonic-gateThe Unicode has also been separated into various sets of charaters
1758*0Sstevel@tonic-gatewhich you can test with C<\p{In...}> (in) and C<\P{In...}> (not in),
1759*0Sstevel@tonic-gatefor example C<\p{Latin}>, C<\p{Greek}>, or C<\P{Katakana}>.
1760*0Sstevel@tonic-gateFor the full list see L<perlunicode>.
1761*0Sstevel@tonic-gate
1762*0Sstevel@tonic-gateC<\X> is an abbreviation for a character class sequence that includes
1763*0Sstevel@tonic-gatethe Unicode 'combining character sequences'.  A 'combining character
1764*0Sstevel@tonic-gatesequence' is a base character followed by any number of combining
1765*0Sstevel@tonic-gatecharacters.  An example of a combining character is an accent.   Using
1766*0Sstevel@tonic-gatethe Unicode full names, e.g., S<C<A + COMBINING RING> > is a combining
1767*0Sstevel@tonic-gatecharacter sequence with base character C<A> and combining character
1768*0Sstevel@tonic-gateS<C<COMBINING RING> >, which translates in Danish to A with the circle
1769*0Sstevel@tonic-gateatop it, as in the word Angstrom.  C<\X> is equivalent to C<\PM\pM*}>,
1770*0Sstevel@tonic-gatei.e., a non-mark followed by one or more marks.
1771*0Sstevel@tonic-gate
1772*0Sstevel@tonic-gateFor the full and latest information about Unicode see the latest
1773*0Sstevel@tonic-gateUnicode standard, or the Unicode Consortium's website http://www.unicode.org/
1774*0Sstevel@tonic-gate
1775*0Sstevel@tonic-gateAs if all those classes weren't enough, Perl also defines POSIX style
1776*0Sstevel@tonic-gatecharacter classes.  These have the form C<[:name:]>, with C<name> the
1777*0Sstevel@tonic-gatename of the POSIX class.  The POSIX classes are C<alpha>, C<alnum>,
1778*0Sstevel@tonic-gateC<ascii>, C<cntrl>, C<digit>, C<graph>, C<lower>, C<print>, C<punct>,
1779*0Sstevel@tonic-gateC<space>, C<upper>, and C<xdigit>, and two extensions, C<word> (a Perl
1780*0Sstevel@tonic-gateextension to match C<\w>), and C<blank> (a GNU extension).  If C<utf8>
1781*0Sstevel@tonic-gateis being used, then these classes are defined the same as their
1782*0Sstevel@tonic-gatecorresponding perl Unicode classes: C<[:upper:]> is the same as
1783*0Sstevel@tonic-gateC<\p{IsUpper}>, etc.  The POSIX character classes, however, don't
1784*0Sstevel@tonic-gaterequire using C<utf8>.  The C<[:digit:]>, C<[:word:]>, and
1785*0Sstevel@tonic-gateC<[:space:]> correspond to the familiar C<\d>, C<\w>, and C<\s>
1786*0Sstevel@tonic-gatecharacter classes.  To negate a POSIX class, put a C<^> in front of
1787*0Sstevel@tonic-gatethe name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and under
1788*0Sstevel@tonic-gateC<utf8>, C<\P{IsDigit}>.  The Unicode and POSIX character classes can
1789*0Sstevel@tonic-gatebe used just like C<\d>, with the exception that POSIX character
1790*0Sstevel@tonic-gateclasses can only be used inside of a character class:
1791*0Sstevel@tonic-gate
1792*0Sstevel@tonic-gate    /\s+[abc[:digit:]xyz]\s*/;  # match a,b,c,x,y,z, or a digit
1793*0Sstevel@tonic-gate    /^=item\s[[:digit:]]/;      # match '=item',
1794*0Sstevel@tonic-gate                                # followed by a space and a digit
1795*0Sstevel@tonic-gate    use charnames ":full";
1796*0Sstevel@tonic-gate    /\s+[abc\p{IsDigit}xyz]\s+/;  # match a,b,c,x,y,z, or a digit
1797*0Sstevel@tonic-gate    /^=item\s\p{IsDigit}/;        # match '=item',
1798*0Sstevel@tonic-gate                                  # followed by a space and a digit
1799*0Sstevel@tonic-gate
1800*0Sstevel@tonic-gateWhew! That is all the rest of the characters and character classes.
1801*0Sstevel@tonic-gate
1802*0Sstevel@tonic-gate=head2 Compiling and saving regular expressions
1803*0Sstevel@tonic-gate
1804*0Sstevel@tonic-gateIn Part 1 we discussed the C<//o> modifier, which compiles a regexp
1805*0Sstevel@tonic-gatejust once.  This suggests that a compiled regexp is some data structure
1806*0Sstevel@tonic-gatethat can be stored once and used again and again.  The regexp quote
1807*0Sstevel@tonic-gateC<qr//> does exactly that: C<qr/string/> compiles the C<string> as a
1808*0Sstevel@tonic-gateregexp and transforms the result into a form that can be assigned to a
1809*0Sstevel@tonic-gatevariable:
1810*0Sstevel@tonic-gate
1811*0Sstevel@tonic-gate    $reg = qr/foo+bar?/;  # reg contains a compiled regexp
1812*0Sstevel@tonic-gate
1813*0Sstevel@tonic-gateThen C<$reg> can be used as a regexp:
1814*0Sstevel@tonic-gate
1815*0Sstevel@tonic-gate    $x = "fooooba";
1816*0Sstevel@tonic-gate    $x =~ $reg;     # matches, just like /foo+bar?/
1817*0Sstevel@tonic-gate    $x =~ /$reg/;   # same thing, alternate form
1818*0Sstevel@tonic-gate
1819*0Sstevel@tonic-gateC<$reg> can also be interpolated into a larger regexp:
1820*0Sstevel@tonic-gate
1821*0Sstevel@tonic-gate    $x =~ /(abc)?$reg/;  # still matches
1822*0Sstevel@tonic-gate
1823*0Sstevel@tonic-gateAs with the matching operator, the regexp quote can use different
1824*0Sstevel@tonic-gatedelimiters, e.g., C<qr!!>, C<qr{}> and C<qr~~>.  The single quote
1825*0Sstevel@tonic-gatedelimiters C<qr''> prevent any interpolation from taking place.
1826*0Sstevel@tonic-gate
1827*0Sstevel@tonic-gatePre-compiled regexps are useful for creating dynamic matches that
1828*0Sstevel@tonic-gatedon't need to be recompiled each time they are encountered.  Using
1829*0Sstevel@tonic-gatepre-compiled regexps, C<simple_grep> program can be expanded into a
1830*0Sstevel@tonic-gateprogram that matches multiple patterns:
1831*0Sstevel@tonic-gate
1832*0Sstevel@tonic-gate    % cat > multi_grep
1833*0Sstevel@tonic-gate    #!/usr/bin/perl
1834*0Sstevel@tonic-gate    # multi_grep - match any of <number> regexps
1835*0Sstevel@tonic-gate    # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ...
1836*0Sstevel@tonic-gate
1837*0Sstevel@tonic-gate    $number = shift;
1838*0Sstevel@tonic-gate    $regexp[$_] = shift foreach (0..$number-1);
1839*0Sstevel@tonic-gate    @compiled = map qr/$_/, @regexp;
1840*0Sstevel@tonic-gate    while ($line = <>) {
1841*0Sstevel@tonic-gate        foreach $pattern (@compiled) {
1842*0Sstevel@tonic-gate            if ($line =~ /$pattern/) {
1843*0Sstevel@tonic-gate                print $line;
1844*0Sstevel@tonic-gate                last;  # we matched, so move onto the next line
1845*0Sstevel@tonic-gate            }
1846*0Sstevel@tonic-gate        }
1847*0Sstevel@tonic-gate    }
1848*0Sstevel@tonic-gate    ^D
1849*0Sstevel@tonic-gate
1850*0Sstevel@tonic-gate    % multi_grep 2 last for multi_grep
1851*0Sstevel@tonic-gate        $regexp[$_] = shift foreach (0..$number-1);
1852*0Sstevel@tonic-gate            foreach $pattern (@compiled) {
1853*0Sstevel@tonic-gate                    last;
1854*0Sstevel@tonic-gate
1855*0Sstevel@tonic-gateStoring pre-compiled regexps in an array C<@compiled> allows us to
1856*0Sstevel@tonic-gatesimply loop through the regexps without any recompilation, thus gaining
1857*0Sstevel@tonic-gateflexibility without sacrificing speed.
1858*0Sstevel@tonic-gate
1859*0Sstevel@tonic-gate=head2 Embedding comments and modifiers in a regular expression
1860*0Sstevel@tonic-gate
1861*0Sstevel@tonic-gateStarting with this section, we will be discussing Perl's set of
1862*0Sstevel@tonic-gateB<extended patterns>.  These are extensions to the traditional regular
1863*0Sstevel@tonic-gateexpression syntax that provide powerful new tools for pattern
1864*0Sstevel@tonic-gatematching.  We have already seen extensions in the form of the minimal
1865*0Sstevel@tonic-gatematching constructs C<??>, C<*?>, C<+?>, C<{n,m}?>, and C<{n,}?>.  The
1866*0Sstevel@tonic-gaterest of the extensions below have the form C<(?char...)>, where the
1867*0Sstevel@tonic-gateC<char> is a character that determines the type of extension.
1868*0Sstevel@tonic-gate
1869*0Sstevel@tonic-gateThe first extension is an embedded comment C<(?#text)>.  This embeds a
1870*0Sstevel@tonic-gatecomment into the regular expression without affecting its meaning.  The
1871*0Sstevel@tonic-gatecomment should not have any closing parentheses in the text.  An
1872*0Sstevel@tonic-gateexample is
1873*0Sstevel@tonic-gate
1874*0Sstevel@tonic-gate    /(?# Match an integer:)[+-]?\d+/;
1875*0Sstevel@tonic-gate
1876*0Sstevel@tonic-gateThis style of commenting has been largely superseded by the raw,
1877*0Sstevel@tonic-gatefreeform commenting that is allowed with the C<//x> modifier.
1878*0Sstevel@tonic-gate
1879*0Sstevel@tonic-gateThe modifiers C<//i>, C<//m>, C<//s>, and C<//x> can also embedded in
1880*0Sstevel@tonic-gatea regexp using C<(?i)>, C<(?m)>, C<(?s)>, and C<(?x)>.  For instance,
1881*0Sstevel@tonic-gate
1882*0Sstevel@tonic-gate    /(?i)yes/;  # match 'yes' case insensitively
1883*0Sstevel@tonic-gate    /yes/i;     # same thing
1884*0Sstevel@tonic-gate    /(?x)(          # freeform version of an integer regexp
1885*0Sstevel@tonic-gate             [+-]?  # match an optional sign
1886*0Sstevel@tonic-gate             \d+    # match a sequence of digits
1887*0Sstevel@tonic-gate         )
1888*0Sstevel@tonic-gate    /x;
1889*0Sstevel@tonic-gate
1890*0Sstevel@tonic-gateEmbedded modifiers can have two important advantages over the usual
1891*0Sstevel@tonic-gatemodifiers.  Embedded modifiers allow a custom set of modifiers to
1892*0Sstevel@tonic-gateI<each> regexp pattern.  This is great for matching an array of regexps
1893*0Sstevel@tonic-gatethat must have different modifiers:
1894*0Sstevel@tonic-gate
1895*0Sstevel@tonic-gate    $pattern[0] = '(?i)doctor';
1896*0Sstevel@tonic-gate    $pattern[1] = 'Johnson';
1897*0Sstevel@tonic-gate    ...
1898*0Sstevel@tonic-gate    while (<>) {
1899*0Sstevel@tonic-gate        foreach $patt (@pattern) {
1900*0Sstevel@tonic-gate            print if /$patt/;
1901*0Sstevel@tonic-gate        }
1902*0Sstevel@tonic-gate    }
1903*0Sstevel@tonic-gate
1904*0Sstevel@tonic-gateThe second advantage is that embedded modifiers only affect the regexp
1905*0Sstevel@tonic-gateinside the group the embedded modifier is contained in.  So grouping
1906*0Sstevel@tonic-gatecan be used to localize the modifier's effects:
1907*0Sstevel@tonic-gate
1908*0Sstevel@tonic-gate    /Answer: ((?i)yes)/;  # matches 'Answer: yes', 'Answer: YES', etc.
1909*0Sstevel@tonic-gate
1910*0Sstevel@tonic-gateEmbedded modifiers can also turn off any modifiers already present
1911*0Sstevel@tonic-gateby using, e.g., C<(?-i)>.  Modifiers can also be combined into
1912*0Sstevel@tonic-gatea single expression, e.g., C<(?s-i)> turns on single line mode and
1913*0Sstevel@tonic-gateturns off case insensitivity.
1914*0Sstevel@tonic-gate
1915*0Sstevel@tonic-gate=head2 Non-capturing groupings
1916*0Sstevel@tonic-gate
1917*0Sstevel@tonic-gateWe noted in Part 1 that groupings C<()> had two distinct functions: 1)
1918*0Sstevel@tonic-gategroup regexp elements together as a single unit, and 2) extract, or
1919*0Sstevel@tonic-gatecapture, substrings that matched the regexp in the
1920*0Sstevel@tonic-gategrouping.  Non-capturing groupings, denoted by C<(?:regexp)>, allow the
1921*0Sstevel@tonic-gateregexp to be treated as a single unit, but don't extract substrings or
1922*0Sstevel@tonic-gateset matching variables C<$1>, etc.  Both capturing and non-capturing
1923*0Sstevel@tonic-gategroupings are allowed to co-exist in the same regexp.  Because there is
1924*0Sstevel@tonic-gateno extraction, non-capturing groupings are faster than capturing
1925*0Sstevel@tonic-gategroupings.  Non-capturing groupings are also handy for choosing exactly
1926*0Sstevel@tonic-gatewhich parts of a regexp are to be extracted to matching variables:
1927*0Sstevel@tonic-gate
1928*0Sstevel@tonic-gate    # match a number, $1-$4 are set, but we only want $1
1929*0Sstevel@tonic-gate    /([+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?)/;
1930*0Sstevel@tonic-gate
1931*0Sstevel@tonic-gate    # match a number faster , only $1 is set
1932*0Sstevel@tonic-gate    /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)/;
1933*0Sstevel@tonic-gate
1934*0Sstevel@tonic-gate    # match a number, get $1 = whole number, $2 = exponent
1935*0Sstevel@tonic-gate    /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE]([+-]?\d+))?)/;
1936*0Sstevel@tonic-gate
1937*0Sstevel@tonic-gateNon-capturing groupings are also useful for removing nuisance
1938*0Sstevel@tonic-gateelements gathered from a split operation:
1939*0Sstevel@tonic-gate
1940*0Sstevel@tonic-gate    $x = '12a34b5';
1941*0Sstevel@tonic-gate    @num = split /(a|b)/, $x;    # @num = ('12','a','34','b','5')
1942*0Sstevel@tonic-gate    @num = split /(?:a|b)/, $x;  # @num = ('12','34','5')
1943*0Sstevel@tonic-gate
1944*0Sstevel@tonic-gateNon-capturing groupings may also have embedded modifiers:
1945*0Sstevel@tonic-gateC<(?i-m:regexp)> is a non-capturing grouping that matches C<regexp>
1946*0Sstevel@tonic-gatecase insensitively and turns off multi-line mode.
1947*0Sstevel@tonic-gate
1948*0Sstevel@tonic-gate=head2 Looking ahead and looking behind
1949*0Sstevel@tonic-gate
1950*0Sstevel@tonic-gateThis section concerns the lookahead and lookbehind assertions.  First,
1951*0Sstevel@tonic-gatea little background.
1952*0Sstevel@tonic-gate
1953*0Sstevel@tonic-gateIn Perl regular expressions, most regexp elements 'eat up' a certain
1954*0Sstevel@tonic-gateamount of string when they match.  For instance, the regexp element
1955*0Sstevel@tonic-gateC<[abc}]> eats up one character of the string when it matches, in the
1956*0Sstevel@tonic-gatesense that perl moves to the next character position in the string
1957*0Sstevel@tonic-gateafter the match.  There are some elements, however, that don't eat up
1958*0Sstevel@tonic-gatecharacters (advance the character position) if they match.  The examples
1959*0Sstevel@tonic-gatewe have seen so far are the anchors.  The anchor C<^> matches the
1960*0Sstevel@tonic-gatebeginning of the line, but doesn't eat any characters.  Similarly, the
1961*0Sstevel@tonic-gateword boundary anchor C<\b> matches, e.g., if the character to the left
1962*0Sstevel@tonic-gateis a word character and the character to the right is a non-word
1963*0Sstevel@tonic-gatecharacter, but it doesn't eat up any characters itself.  Anchors are
1964*0Sstevel@tonic-gateexamples of 'zero-width assertions'.  Zero-width, because they consume
1965*0Sstevel@tonic-gateno characters, and assertions, because they test some property of the
1966*0Sstevel@tonic-gatestring.  In the context of our walk in the woods analogy to regexp
1967*0Sstevel@tonic-gatematching, most regexp elements move us along a trail, but anchors have
1968*0Sstevel@tonic-gateus stop a moment and check our surroundings.  If the local environment
1969*0Sstevel@tonic-gatechecks out, we can proceed forward.  But if the local environment
1970*0Sstevel@tonic-gatedoesn't satisfy us, we must backtrack.
1971*0Sstevel@tonic-gate
1972*0Sstevel@tonic-gateChecking the environment entails either looking ahead on the trail,
1973*0Sstevel@tonic-gatelooking behind, or both.  C<^> looks behind, to see that there are no
1974*0Sstevel@tonic-gatecharacters before.  C<$> looks ahead, to see that there are no
1975*0Sstevel@tonic-gatecharacters after.  C<\b> looks both ahead and behind, to see if the
1976*0Sstevel@tonic-gatecharacters on either side differ in their 'word'-ness.
1977*0Sstevel@tonic-gate
1978*0Sstevel@tonic-gateThe lookahead and lookbehind assertions are generalizations of the
1979*0Sstevel@tonic-gateanchor concept.  Lookahead and lookbehind are zero-width assertions
1980*0Sstevel@tonic-gatethat let us specify which characters we want to test for.  The
1981*0Sstevel@tonic-gatelookahead assertion is denoted by C<(?=regexp)> and the lookbehind
1982*0Sstevel@tonic-gateassertion is denoted by C<< (?<=fixed-regexp) >>.  Some examples are
1983*0Sstevel@tonic-gate
1984*0Sstevel@tonic-gate    $x = "I catch the housecat 'Tom-cat' with catnip";
1985*0Sstevel@tonic-gate    $x =~ /cat(?=\s+)/;  # matches 'cat' in 'housecat'
1986*0Sstevel@tonic-gate    @catwords = ($x =~ /(?<=\s)cat\w+/g);  # matches,
1987*0Sstevel@tonic-gate                                           # $catwords[0] = 'catch'
1988*0Sstevel@tonic-gate                                           # $catwords[1] = 'catnip'
1989*0Sstevel@tonic-gate    $x =~ /\bcat\b/;  # matches 'cat' in 'Tom-cat'
1990*0Sstevel@tonic-gate    $x =~ /(?<=\s)cat(?=\s)/; # doesn't match; no isolated 'cat' in
1991*0Sstevel@tonic-gate                              # middle of $x
1992*0Sstevel@tonic-gate
1993*0Sstevel@tonic-gateNote that the parentheses in C<(?=regexp)> and C<< (?<=regexp) >> are
1994*0Sstevel@tonic-gatenon-capturing, since these are zero-width assertions.  Thus in the
1995*0Sstevel@tonic-gatesecond regexp, the substrings captured are those of the whole regexp
1996*0Sstevel@tonic-gateitself.  Lookahead C<(?=regexp)> can match arbitrary regexps, but
1997*0Sstevel@tonic-gatelookbehind C<< (?<=fixed-regexp) >> only works for regexps of fixed
1998*0Sstevel@tonic-gatewidth, i.e., a fixed number of characters long.  Thus
1999*0Sstevel@tonic-gateC<< (?<=(ab|bc)) >> is fine, but C<< (?<=(ab)*) >> is not.  The
2000*0Sstevel@tonic-gatenegated versions of the lookahead and lookbehind assertions are
2001*0Sstevel@tonic-gatedenoted by C<(?!regexp)> and C<< (?<!fixed-regexp) >> respectively.
2002*0Sstevel@tonic-gateThey evaluate true if the regexps do I<not> match:
2003*0Sstevel@tonic-gate
2004*0Sstevel@tonic-gate    $x = "foobar";
2005*0Sstevel@tonic-gate    $x =~ /foo(?!bar)/;  # doesn't match, 'bar' follows 'foo'
2006*0Sstevel@tonic-gate    $x =~ /foo(?!baz)/;  # matches, 'baz' doesn't follow 'foo'
2007*0Sstevel@tonic-gate    $x =~ /(?<!\s)foo/;  # matches, there is no \s before 'foo'
2008*0Sstevel@tonic-gate
2009*0Sstevel@tonic-gateThe C<\C> is unsupported in lookbehind, because the already
2010*0Sstevel@tonic-gatetreacherous definition of C<\C> would become even more so
2011*0Sstevel@tonic-gatewhen going backwards.
2012*0Sstevel@tonic-gate
2013*0Sstevel@tonic-gate=head2 Using independent subexpressions to prevent backtracking
2014*0Sstevel@tonic-gate
2015*0Sstevel@tonic-gateThe last few extended patterns in this tutorial are experimental as of
2016*0Sstevel@tonic-gate5.6.0.  Play with them, use them in some code, but don't rely on them
2017*0Sstevel@tonic-gatejust yet for production code.
2018*0Sstevel@tonic-gate
2019*0Sstevel@tonic-gateS<B<Independent subexpressions> > are regular expressions, in the
2020*0Sstevel@tonic-gatecontext of a larger regular expression, that function independently of
2021*0Sstevel@tonic-gatethe larger regular expression.  That is, they consume as much or as
2022*0Sstevel@tonic-gatelittle of the string as they wish without regard for the ability of
2023*0Sstevel@tonic-gatethe larger regexp to match.  Independent subexpressions are represented
2024*0Sstevel@tonic-gateby C<< (?>regexp) >>.  We can illustrate their behavior by first
2025*0Sstevel@tonic-gateconsidering an ordinary regexp:
2026*0Sstevel@tonic-gate
2027*0Sstevel@tonic-gate    $x = "ab";
2028*0Sstevel@tonic-gate    $x =~ /a*ab/;  # matches
2029*0Sstevel@tonic-gate
2030*0Sstevel@tonic-gateThis obviously matches, but in the process of matching, the
2031*0Sstevel@tonic-gatesubexpression C<a*> first grabbed the C<a>.  Doing so, however,
2032*0Sstevel@tonic-gatewouldn't allow the whole regexp to match, so after backtracking, C<a*>
2033*0Sstevel@tonic-gateeventually gave back the C<a> and matched the empty string.  Here, what
2034*0Sstevel@tonic-gateC<a*> matched was I<dependent> on what the rest of the regexp matched.
2035*0Sstevel@tonic-gate
2036*0Sstevel@tonic-gateContrast that with an independent subexpression:
2037*0Sstevel@tonic-gate
2038*0Sstevel@tonic-gate    $x =~ /(?>a*)ab/;  # doesn't match!
2039*0Sstevel@tonic-gate
2040*0Sstevel@tonic-gateThe independent subexpression C<< (?>a*) >> doesn't care about the rest
2041*0Sstevel@tonic-gateof the regexp, so it sees an C<a> and grabs it.  Then the rest of the
2042*0Sstevel@tonic-gateregexp C<ab> cannot match.  Because C<< (?>a*) >> is independent, there
2043*0Sstevel@tonic-gateis no backtracking and the independent subexpression does not give
2044*0Sstevel@tonic-gateup its C<a>.  Thus the match of the regexp as a whole fails.  A similar
2045*0Sstevel@tonic-gatebehavior occurs with completely independent regexps:
2046*0Sstevel@tonic-gate
2047*0Sstevel@tonic-gate    $x = "ab";
2048*0Sstevel@tonic-gate    $x =~ /a*/g;   # matches, eats an 'a'
2049*0Sstevel@tonic-gate    $x =~ /\Gab/g; # doesn't match, no 'a' available
2050*0Sstevel@tonic-gate
2051*0Sstevel@tonic-gateHere C<//g> and C<\G> create a 'tag team' handoff of the string from
2052*0Sstevel@tonic-gateone regexp to the other.  Regexps with an independent subexpression are
2053*0Sstevel@tonic-gatemuch like this, with a handoff of the string to the independent
2054*0Sstevel@tonic-gatesubexpression, and a handoff of the string back to the enclosing
2055*0Sstevel@tonic-gateregexp.
2056*0Sstevel@tonic-gate
2057*0Sstevel@tonic-gateThe ability of an independent subexpression to prevent backtracking
2058*0Sstevel@tonic-gatecan be quite useful.  Suppose we want to match a non-empty string
2059*0Sstevel@tonic-gateenclosed in parentheses up to two levels deep.  Then the following
2060*0Sstevel@tonic-gateregexp matches:
2061*0Sstevel@tonic-gate
2062*0Sstevel@tonic-gate    $x = "abc(de(fg)h";  # unbalanced parentheses
2063*0Sstevel@tonic-gate    $x =~ /\( ( [^()]+ | \([^()]*\) )+ \)/x;
2064*0Sstevel@tonic-gate
2065*0Sstevel@tonic-gateThe regexp matches an open parenthesis, one or more copies of an
2066*0Sstevel@tonic-gatealternation, and a close parenthesis.  The alternation is two-way, with
2067*0Sstevel@tonic-gatethe first alternative C<[^()]+> matching a substring with no
2068*0Sstevel@tonic-gateparentheses and the second alternative C<\([^()]*\)>  matching a
2069*0Sstevel@tonic-gatesubstring delimited by parentheses.  The problem with this regexp is
2070*0Sstevel@tonic-gatethat it is pathological: it has nested indeterminate quantifiers
2071*0Sstevel@tonic-gateof the form C<(a+|b)+>.  We discussed in Part 1 how nested quantifiers
2072*0Sstevel@tonic-gatelike this could take an exponentially long time to execute if there
2073*0Sstevel@tonic-gatewas no match possible.  To prevent the exponential blowup, we need to
2074*0Sstevel@tonic-gateprevent useless backtracking at some point.  This can be done by
2075*0Sstevel@tonic-gateenclosing the inner quantifier as an independent subexpression:
2076*0Sstevel@tonic-gate
2077*0Sstevel@tonic-gate    $x =~ /\( ( (?>[^()]+) | \([^()]*\) )+ \)/x;
2078*0Sstevel@tonic-gate
2079*0Sstevel@tonic-gateHere, C<< (?>[^()]+) >> breaks the degeneracy of string partitioning
2080*0Sstevel@tonic-gateby gobbling up as much of the string as possible and keeping it.   Then
2081*0Sstevel@tonic-gatematch failures fail much more quickly.
2082*0Sstevel@tonic-gate
2083*0Sstevel@tonic-gate=head2 Conditional expressions
2084*0Sstevel@tonic-gate
2085*0Sstevel@tonic-gateA S<B<conditional expression> > is a form of if-then-else statement
2086*0Sstevel@tonic-gatethat allows one to choose which patterns are to be matched, based on
2087*0Sstevel@tonic-gatesome condition.  There are two types of conditional expression:
2088*0Sstevel@tonic-gateC<(?(condition)yes-regexp)> and
2089*0Sstevel@tonic-gateC<(?(condition)yes-regexp|no-regexp)>.  C<(?(condition)yes-regexp)> is
2090*0Sstevel@tonic-gatelike an S<C<'if () {}'> > statement in Perl.  If the C<condition> is true,
2091*0Sstevel@tonic-gatethe C<yes-regexp> will be matched.  If the C<condition> is false, the
2092*0Sstevel@tonic-gateC<yes-regexp> will be skipped and perl will move onto the next regexp
2093*0Sstevel@tonic-gateelement.  The second form is like an S<C<'if () {} else {}'> > statement
2094*0Sstevel@tonic-gatein Perl.  If the C<condition> is true, the C<yes-regexp> will be
2095*0Sstevel@tonic-gatematched, otherwise the C<no-regexp> will be matched.
2096*0Sstevel@tonic-gate
2097*0Sstevel@tonic-gateThe C<condition> can have two forms.  The first form is simply an
2098*0Sstevel@tonic-gateinteger in parentheses C<(integer)>.  It is true if the corresponding
2099*0Sstevel@tonic-gatebackreference C<\integer> matched earlier in the regexp.  The second
2100*0Sstevel@tonic-gateform is a bare zero width assertion C<(?...)>, either a
2101*0Sstevel@tonic-gatelookahead, a lookbehind, or a code assertion (discussed in the next
2102*0Sstevel@tonic-gatesection).
2103*0Sstevel@tonic-gate
2104*0Sstevel@tonic-gateThe integer form of the C<condition> allows us to choose, with more
2105*0Sstevel@tonic-gateflexibility, what to match based on what matched earlier in the
2106*0Sstevel@tonic-gateregexp. This searches for words of the form C<"$x$x"> or
2107*0Sstevel@tonic-gateC<"$x$y$y$x">:
2108*0Sstevel@tonic-gate
2109*0Sstevel@tonic-gate    % simple_grep '^(\w+)(\w+)?(?(2)\2\1|\1)$' /usr/dict/words
2110*0Sstevel@tonic-gate    beriberi
2111*0Sstevel@tonic-gate    coco
2112*0Sstevel@tonic-gate    couscous
2113*0Sstevel@tonic-gate    deed
2114*0Sstevel@tonic-gate    ...
2115*0Sstevel@tonic-gate    toot
2116*0Sstevel@tonic-gate    toto
2117*0Sstevel@tonic-gate    tutu
2118*0Sstevel@tonic-gate
2119*0Sstevel@tonic-gateThe lookbehind C<condition> allows, along with backreferences,
2120*0Sstevel@tonic-gatean earlier part of the match to influence a later part of the
2121*0Sstevel@tonic-gatematch.  For instance,
2122*0Sstevel@tonic-gate
2123*0Sstevel@tonic-gate    /[ATGC]+(?(?<=AA)G|C)$/;
2124*0Sstevel@tonic-gate
2125*0Sstevel@tonic-gatematches a DNA sequence such that it either ends in C<AAG>, or some
2126*0Sstevel@tonic-gateother base pair combination and C<C>.  Note that the form is
2127*0Sstevel@tonic-gateC<< (?(?<=AA)G|C) >> and not C<< (?((?<=AA))G|C) >>; for the
2128*0Sstevel@tonic-gatelookahead, lookbehind or code assertions, the parentheses around the
2129*0Sstevel@tonic-gateconditional are not needed.
2130*0Sstevel@tonic-gate
2131*0Sstevel@tonic-gate=head2 A bit of magic: executing Perl code in a regular expression
2132*0Sstevel@tonic-gate
2133*0Sstevel@tonic-gateNormally, regexps are a part of Perl expressions.
2134*0Sstevel@tonic-gateS<B<Code evaluation> > expressions turn that around by allowing
2135*0Sstevel@tonic-gatearbitrary Perl code to be a part of a regexp.  A code evaluation
2136*0Sstevel@tonic-gateexpression is denoted C<(?{code})>, with C<code> a string of Perl
2137*0Sstevel@tonic-gatestatements.
2138*0Sstevel@tonic-gate
2139*0Sstevel@tonic-gateCode expressions are zero-width assertions, and the value they return
2140*0Sstevel@tonic-gatedepends on their environment.  There are two possibilities: either the
2141*0Sstevel@tonic-gatecode expression is used as a conditional in a conditional expression
2142*0Sstevel@tonic-gateC<(?(condition)...)>, or it is not.  If the code expression is a
2143*0Sstevel@tonic-gateconditional, the code is evaluated and the result (i.e., the result of
2144*0Sstevel@tonic-gatethe last statement) is used to determine truth or falsehood.  If the
2145*0Sstevel@tonic-gatecode expression is not used as a conditional, the assertion always
2146*0Sstevel@tonic-gateevaluates true and the result is put into the special variable
2147*0Sstevel@tonic-gateC<$^R>.  The variable C<$^R> can then be used in code expressions later
2148*0Sstevel@tonic-gatein the regexp.  Here are some silly examples:
2149*0Sstevel@tonic-gate
2150*0Sstevel@tonic-gate    $x = "abcdef";
2151*0Sstevel@tonic-gate    $x =~ /abc(?{print "Hi Mom!";})def/; # matches,
2152*0Sstevel@tonic-gate                                         # prints 'Hi Mom!'
2153*0Sstevel@tonic-gate    $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match,
2154*0Sstevel@tonic-gate                                         # no 'Hi Mom!'
2155*0Sstevel@tonic-gate
2156*0Sstevel@tonic-gatePay careful attention to the next example:
2157*0Sstevel@tonic-gate
2158*0Sstevel@tonic-gate    $x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match,
2159*0Sstevel@tonic-gate                                         # no 'Hi Mom!'
2160*0Sstevel@tonic-gate                                         # but why not?
2161*0Sstevel@tonic-gate
2162*0Sstevel@tonic-gateAt first glance, you'd think that it shouldn't print, because obviously
2163*0Sstevel@tonic-gatethe C<ddd> isn't going to match the target string. But look at this
2164*0Sstevel@tonic-gateexample:
2165*0Sstevel@tonic-gate
2166*0Sstevel@tonic-gate    $x =~ /abc(?{print "Hi Mom!";})[d]dd/; # doesn't match,
2167*0Sstevel@tonic-gate                                           # but _does_ print
2168*0Sstevel@tonic-gate
2169*0Sstevel@tonic-gateHmm. What happened here? If you've been following along, you know that
2170*0Sstevel@tonic-gatethe above pattern should be effectively the same as the last one --
2171*0Sstevel@tonic-gateenclosing the d in a character class isn't going to change what it
2172*0Sstevel@tonic-gatematches. So why does the first not print while the second one does?
2173*0Sstevel@tonic-gate
2174*0Sstevel@tonic-gateThe answer lies in the optimizations the REx engine makes. In the first
2175*0Sstevel@tonic-gatecase, all the engine sees are plain old characters (aside from the
2176*0Sstevel@tonic-gateC<?{}> construct). It's smart enough to realize that the string 'ddd'
2177*0Sstevel@tonic-gatedoesn't occur in our target string before actually running the pattern
2178*0Sstevel@tonic-gatethrough. But in the second case, we've tricked it into thinking that our
2179*0Sstevel@tonic-gatepattern is more complicated than it is. It takes a look, sees our
2180*0Sstevel@tonic-gatecharacter class, and decides that it will have to actually run the
2181*0Sstevel@tonic-gatepattern to determine whether or not it matches, and in the process of
2182*0Sstevel@tonic-gaterunning it hits the print statement before it discovers that we don't
2183*0Sstevel@tonic-gatehave a match.
2184*0Sstevel@tonic-gate
2185*0Sstevel@tonic-gateTo take a closer look at how the engine does optimizations, see the
2186*0Sstevel@tonic-gatesection L<"Pragmas and debugging"> below.
2187*0Sstevel@tonic-gate
2188*0Sstevel@tonic-gateMore fun with C<?{}>:
2189*0Sstevel@tonic-gate
2190*0Sstevel@tonic-gate    $x =~ /(?{print "Hi Mom!";})/;       # matches,
2191*0Sstevel@tonic-gate                                         # prints 'Hi Mom!'
2192*0Sstevel@tonic-gate    $x =~ /(?{$c = 1;})(?{print "$c";})/;  # matches,
2193*0Sstevel@tonic-gate                                           # prints '1'
2194*0Sstevel@tonic-gate    $x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches,
2195*0Sstevel@tonic-gate                                           # prints '1'
2196*0Sstevel@tonic-gate
2197*0Sstevel@tonic-gateThe bit of magic mentioned in the section title occurs when the regexp
2198*0Sstevel@tonic-gatebacktracks in the process of searching for a match.  If the regexp
2199*0Sstevel@tonic-gatebacktracks over a code expression and if the variables used within are
2200*0Sstevel@tonic-gatelocalized using C<local>, the changes in the variables produced by the
2201*0Sstevel@tonic-gatecode expression are undone! Thus, if we wanted to count how many times
2202*0Sstevel@tonic-gatea character got matched inside a group, we could use, e.g.,
2203*0Sstevel@tonic-gate
2204*0Sstevel@tonic-gate    $x = "aaaa";
2205*0Sstevel@tonic-gate    $count = 0;  # initialize 'a' count
2206*0Sstevel@tonic-gate    $c = "bob";  # test if $c gets clobbered
2207*0Sstevel@tonic-gate    $x =~ /(?{local $c = 0;})         # initialize count
2208*0Sstevel@tonic-gate           ( a                        # match 'a'
2209*0Sstevel@tonic-gate             (?{local $c = $c + 1;})  # increment count
2210*0Sstevel@tonic-gate           )*                         # do this any number of times,
2211*0Sstevel@tonic-gate           aa                         # but match 'aa' at the end
2212*0Sstevel@tonic-gate           (?{$count = $c;})          # copy local $c var into $count
2213*0Sstevel@tonic-gate          /x;
2214*0Sstevel@tonic-gate    print "'a' count is $count, \$c variable is '$c'\n";
2215*0Sstevel@tonic-gate
2216*0Sstevel@tonic-gateThis prints
2217*0Sstevel@tonic-gate
2218*0Sstevel@tonic-gate    'a' count is 2, $c variable is 'bob'
2219*0Sstevel@tonic-gate
2220*0Sstevel@tonic-gateIf we replace the S<C< (?{local $c = $c + 1;})> > with
2221*0Sstevel@tonic-gateS<C< (?{$c = $c + 1;})> >, the variable changes are I<not> undone
2222*0Sstevel@tonic-gateduring backtracking, and we get
2223*0Sstevel@tonic-gate
2224*0Sstevel@tonic-gate    'a' count is 4, $c variable is 'bob'
2225*0Sstevel@tonic-gate
2226*0Sstevel@tonic-gateNote that only localized variable changes are undone.  Other side
2227*0Sstevel@tonic-gateeffects of code expression execution are permanent.  Thus
2228*0Sstevel@tonic-gate
2229*0Sstevel@tonic-gate    $x = "aaaa";
2230*0Sstevel@tonic-gate    $x =~ /(a(?{print "Yow\n";}))*aa/;
2231*0Sstevel@tonic-gate
2232*0Sstevel@tonic-gateproduces
2233*0Sstevel@tonic-gate
2234*0Sstevel@tonic-gate   Yow
2235*0Sstevel@tonic-gate   Yow
2236*0Sstevel@tonic-gate   Yow
2237*0Sstevel@tonic-gate   Yow
2238*0Sstevel@tonic-gate
2239*0Sstevel@tonic-gateThe result C<$^R> is automatically localized, so that it will behave
2240*0Sstevel@tonic-gateproperly in the presence of backtracking.
2241*0Sstevel@tonic-gate
2242*0Sstevel@tonic-gateThis example uses a code expression in a conditional to match the
2243*0Sstevel@tonic-gatearticle 'the' in either English or German:
2244*0Sstevel@tonic-gate
2245*0Sstevel@tonic-gate    $lang = 'DE';  # use German
2246*0Sstevel@tonic-gate    ...
2247*0Sstevel@tonic-gate    $text = "das";
2248*0Sstevel@tonic-gate    print "matched\n"
2249*0Sstevel@tonic-gate        if $text =~ /(?(?{
2250*0Sstevel@tonic-gate                          $lang eq 'EN'; # is the language English?
2251*0Sstevel@tonic-gate                         })
2252*0Sstevel@tonic-gate                       the |             # if so, then match 'the'
2253*0Sstevel@tonic-gate                       (die|das|der)     # else, match 'die|das|der'
2254*0Sstevel@tonic-gate                     )
2255*0Sstevel@tonic-gate                    /xi;
2256*0Sstevel@tonic-gate
2257*0Sstevel@tonic-gateNote that the syntax here is C<(?(?{...})yes-regexp|no-regexp)>, not
2258*0Sstevel@tonic-gateC<(?((?{...}))yes-regexp|no-regexp)>.  In other words, in the case of a
2259*0Sstevel@tonic-gatecode expression, we don't need the extra parentheses around the
2260*0Sstevel@tonic-gateconditional.
2261*0Sstevel@tonic-gate
2262*0Sstevel@tonic-gateIf you try to use code expressions with interpolating variables, perl
2263*0Sstevel@tonic-gatemay surprise you:
2264*0Sstevel@tonic-gate
2265*0Sstevel@tonic-gate    $bar = 5;
2266*0Sstevel@tonic-gate    $pat = '(?{ 1 })';
2267*0Sstevel@tonic-gate    /foo(?{ $bar })bar/; # compiles ok, $bar not interpolated
2268*0Sstevel@tonic-gate    /foo(?{ 1 })$bar/;   # compile error!
2269*0Sstevel@tonic-gate    /foo${pat}bar/;      # compile error!
2270*0Sstevel@tonic-gate
2271*0Sstevel@tonic-gate    $pat = qr/(?{ $foo = 1 })/;  # precompile code regexp
2272*0Sstevel@tonic-gate    /foo${pat}bar/;      # compiles ok
2273*0Sstevel@tonic-gate
2274*0Sstevel@tonic-gateIf a regexp has (1) code expressions and interpolating variables,or
2275*0Sstevel@tonic-gate(2) a variable that interpolates a code expression, perl treats the
2276*0Sstevel@tonic-gateregexp as an error. If the code expression is precompiled into a
2277*0Sstevel@tonic-gatevariable, however, interpolating is ok. The question is, why is this
2278*0Sstevel@tonic-gatean error?
2279*0Sstevel@tonic-gate
2280*0Sstevel@tonic-gateThe reason is that variable interpolation and code expressions
2281*0Sstevel@tonic-gatetogether pose a security risk.  The combination is dangerous because
2282*0Sstevel@tonic-gatemany programmers who write search engines often take user input and
2283*0Sstevel@tonic-gateplug it directly into a regexp:
2284*0Sstevel@tonic-gate
2285*0Sstevel@tonic-gate    $regexp = <>;       # read user-supplied regexp
2286*0Sstevel@tonic-gate    $chomp $regexp;     # get rid of possible newline
2287*0Sstevel@tonic-gate    $text =~ /$regexp/; # search $text for the $regexp
2288*0Sstevel@tonic-gate
2289*0Sstevel@tonic-gateIf the C<$regexp> variable contains a code expression, the user could
2290*0Sstevel@tonic-gatethen execute arbitrary Perl code.  For instance, some joker could
2291*0Sstevel@tonic-gatesearch for S<C<system('rm -rf *');> > to erase your files.  In this
2292*0Sstevel@tonic-gatesense, the combination of interpolation and code expressions B<taints>
2293*0Sstevel@tonic-gateyour regexp.  So by default, using both interpolation and code
2294*0Sstevel@tonic-gateexpressions in the same regexp is not allowed.  If you're not
2295*0Sstevel@tonic-gateconcerned about malicious users, it is possible to bypass this
2296*0Sstevel@tonic-gatesecurity check by invoking S<C<use re 'eval'> >:
2297*0Sstevel@tonic-gate
2298*0Sstevel@tonic-gate    use re 'eval';       # throw caution out the door
2299*0Sstevel@tonic-gate    $bar = 5;
2300*0Sstevel@tonic-gate    $pat = '(?{ 1 })';
2301*0Sstevel@tonic-gate    /foo(?{ 1 })$bar/;   # compiles ok
2302*0Sstevel@tonic-gate    /foo${pat}bar/;      # compiles ok
2303*0Sstevel@tonic-gate
2304*0Sstevel@tonic-gateAnother form of code expression is the S<B<pattern code expression> >.
2305*0Sstevel@tonic-gateThe pattern code expression is like a regular code expression, except
2306*0Sstevel@tonic-gatethat the result of the code evaluation is treated as a regular
2307*0Sstevel@tonic-gateexpression and matched immediately.  A simple example is
2308*0Sstevel@tonic-gate
2309*0Sstevel@tonic-gate    $length = 5;
2310*0Sstevel@tonic-gate    $char = 'a';
2311*0Sstevel@tonic-gate    $x = 'aaaaabb';
2312*0Sstevel@tonic-gate    $x =~ /(??{$char x $length})/x; # matches, there are 5 of 'a'
2313*0Sstevel@tonic-gate
2314*0Sstevel@tonic-gate
2315*0Sstevel@tonic-gateThis final example contains both ordinary and pattern code
2316*0Sstevel@tonic-gateexpressions.   It detects if a binary string C<1101010010001...> has a
2317*0Sstevel@tonic-gateFibonacci spacing 0,1,1,2,3,5,...  of the C<1>'s:
2318*0Sstevel@tonic-gate
2319*0Sstevel@tonic-gate    $s0 = 0; $s1 = 1; # initial conditions
2320*0Sstevel@tonic-gate    $x = "1101010010001000001";
2321*0Sstevel@tonic-gate    print "It is a Fibonacci sequence\n"
2322*0Sstevel@tonic-gate        if $x =~ /^1         # match an initial '1'
2323*0Sstevel@tonic-gate                    (
2324*0Sstevel@tonic-gate                       (??{'0' x $s0}) # match $s0 of '0'
2325*0Sstevel@tonic-gate                       1               # and then a '1'
2326*0Sstevel@tonic-gate                       (?{
2327*0Sstevel@tonic-gate                          $largest = $s0;   # largest seq so far
2328*0Sstevel@tonic-gate                          $s2 = $s1 + $s0;  # compute next term
2329*0Sstevel@tonic-gate                          $s0 = $s1;        # in Fibonacci sequence
2330*0Sstevel@tonic-gate                          $s1 = $s2;
2331*0Sstevel@tonic-gate                         })
2332*0Sstevel@tonic-gate                    )+   # repeat as needed
2333*0Sstevel@tonic-gate                  $      # that is all there is
2334*0Sstevel@tonic-gate                 /x;
2335*0Sstevel@tonic-gate    print "Largest sequence matched was $largest\n";
2336*0Sstevel@tonic-gate
2337*0Sstevel@tonic-gateThis prints
2338*0Sstevel@tonic-gate
2339*0Sstevel@tonic-gate    It is a Fibonacci sequence
2340*0Sstevel@tonic-gate    Largest sequence matched was 5
2341*0Sstevel@tonic-gate
2342*0Sstevel@tonic-gateHa! Try that with your garden variety regexp package...
2343*0Sstevel@tonic-gate
2344*0Sstevel@tonic-gateNote that the variables C<$s0> and C<$s1> are not substituted when the
2345*0Sstevel@tonic-gateregexp is compiled, as happens for ordinary variables outside a code
2346*0Sstevel@tonic-gateexpression.  Rather, the code expressions are evaluated when perl
2347*0Sstevel@tonic-gateencounters them during the search for a match.
2348*0Sstevel@tonic-gate
2349*0Sstevel@tonic-gateThe regexp without the C<//x> modifier is
2350*0Sstevel@tonic-gate
2351*0Sstevel@tonic-gate    /^1((??{'0'x$s0})1(?{$largest=$s0;$s2=$s1+$s0$s0=$s1;$s1=$s2;}))+$/;
2352*0Sstevel@tonic-gate
2353*0Sstevel@tonic-gateand is a great start on an Obfuscated Perl entry :-) When working with
2354*0Sstevel@tonic-gatecode and conditional expressions, the extended form of regexps is
2355*0Sstevel@tonic-gatealmost necessary in creating and debugging regexps.
2356*0Sstevel@tonic-gate
2357*0Sstevel@tonic-gate=head2 Pragmas and debugging
2358*0Sstevel@tonic-gate
2359*0Sstevel@tonic-gateSpeaking of debugging, there are several pragmas available to control
2360*0Sstevel@tonic-gateand debug regexps in Perl.  We have already encountered one pragma in
2361*0Sstevel@tonic-gatethe previous section, S<C<use re 'eval';> >, that allows variable
2362*0Sstevel@tonic-gateinterpolation and code expressions to coexist in a regexp.  The other
2363*0Sstevel@tonic-gatepragmas are
2364*0Sstevel@tonic-gate
2365*0Sstevel@tonic-gate    use re 'taint';
2366*0Sstevel@tonic-gate    $tainted = <>;
2367*0Sstevel@tonic-gate    @parts = ($tainted =~ /(\w+)\s+(\w+)/; # @parts is now tainted
2368*0Sstevel@tonic-gate
2369*0Sstevel@tonic-gateThe C<taint> pragma causes any substrings from a match with a tainted
2370*0Sstevel@tonic-gatevariable to be tainted as well.  This is not normally the case, as
2371*0Sstevel@tonic-gateregexps are often used to extract the safe bits from a tainted
2372*0Sstevel@tonic-gatevariable.  Use C<taint> when you are not extracting safe bits, but are
2373*0Sstevel@tonic-gateperforming some other processing.  Both C<taint> and C<eval> pragmas
2374*0Sstevel@tonic-gateare lexically scoped, which means they are in effect only until
2375*0Sstevel@tonic-gatethe end of the block enclosing the pragmas.
2376*0Sstevel@tonic-gate
2377*0Sstevel@tonic-gate    use re 'debug';
2378*0Sstevel@tonic-gate    /^(.*)$/s;       # output debugging info
2379*0Sstevel@tonic-gate
2380*0Sstevel@tonic-gate    use re 'debugcolor';
2381*0Sstevel@tonic-gate    /^(.*)$/s;       # output debugging info in living color
2382*0Sstevel@tonic-gate
2383*0Sstevel@tonic-gateThe global C<debug> and C<debugcolor> pragmas allow one to get
2384*0Sstevel@tonic-gatedetailed debugging info about regexp compilation and
2385*0Sstevel@tonic-gateexecution.  C<debugcolor> is the same as debug, except the debugging
2386*0Sstevel@tonic-gateinformation is displayed in color on terminals that can display
2387*0Sstevel@tonic-gatetermcap color sequences.  Here is example output:
2388*0Sstevel@tonic-gate
2389*0Sstevel@tonic-gate    % perl -e 'use re "debug"; "abc" =~ /a*b+c/;'
2390*0Sstevel@tonic-gate    Compiling REx `a*b+c'
2391*0Sstevel@tonic-gate    size 9 first at 1
2392*0Sstevel@tonic-gate       1: STAR(4)
2393*0Sstevel@tonic-gate       2:   EXACT <a>(0)
2394*0Sstevel@tonic-gate       4: PLUS(7)
2395*0Sstevel@tonic-gate       5:   EXACT <b>(0)
2396*0Sstevel@tonic-gate       7: EXACT <c>(9)
2397*0Sstevel@tonic-gate       9: END(0)
2398*0Sstevel@tonic-gate    floating `bc' at 0..2147483647 (checking floating) minlen 2
2399*0Sstevel@tonic-gate    Guessing start of match, REx `a*b+c' against `abc'...
2400*0Sstevel@tonic-gate    Found floating substr `bc' at offset 1...
2401*0Sstevel@tonic-gate    Guessed: match at offset 0
2402*0Sstevel@tonic-gate    Matching REx `a*b+c' against `abc'
2403*0Sstevel@tonic-gate      Setting an EVAL scope, savestack=3
2404*0Sstevel@tonic-gate       0 <> <abc>             |  1:  STAR
2405*0Sstevel@tonic-gate                               EXACT <a> can match 1 times out of 32767...
2406*0Sstevel@tonic-gate      Setting an EVAL scope, savestack=3
2407*0Sstevel@tonic-gate       1 <a> <bc>             |  4:    PLUS
2408*0Sstevel@tonic-gate                               EXACT <b> can match 1 times out of 32767...
2409*0Sstevel@tonic-gate      Setting an EVAL scope, savestack=3
2410*0Sstevel@tonic-gate       2 <ab> <c>             |  7:      EXACT <c>
2411*0Sstevel@tonic-gate       3 <abc> <>             |  9:      END
2412*0Sstevel@tonic-gate    Match successful!
2413*0Sstevel@tonic-gate    Freeing REx: `a*b+c'
2414*0Sstevel@tonic-gate
2415*0Sstevel@tonic-gateIf you have gotten this far into the tutorial, you can probably guess
2416*0Sstevel@tonic-gatewhat the different parts of the debugging output tell you.  The first
2417*0Sstevel@tonic-gatepart
2418*0Sstevel@tonic-gate
2419*0Sstevel@tonic-gate    Compiling REx `a*b+c'
2420*0Sstevel@tonic-gate    size 9 first at 1
2421*0Sstevel@tonic-gate       1: STAR(4)
2422*0Sstevel@tonic-gate       2:   EXACT <a>(0)
2423*0Sstevel@tonic-gate       4: PLUS(7)
2424*0Sstevel@tonic-gate       5:   EXACT <b>(0)
2425*0Sstevel@tonic-gate       7: EXACT <c>(9)
2426*0Sstevel@tonic-gate       9: END(0)
2427*0Sstevel@tonic-gate
2428*0Sstevel@tonic-gatedescribes the compilation stage.  C<STAR(4)> means that there is a
2429*0Sstevel@tonic-gatestarred object, in this case C<'a'>, and if it matches, goto line 4,
2430*0Sstevel@tonic-gatei.e., C<PLUS(7)>.  The middle lines describe some heuristics and
2431*0Sstevel@tonic-gateoptimizations performed before a match:
2432*0Sstevel@tonic-gate
2433*0Sstevel@tonic-gate    floating `bc' at 0..2147483647 (checking floating) minlen 2
2434*0Sstevel@tonic-gate    Guessing start of match, REx `a*b+c' against `abc'...
2435*0Sstevel@tonic-gate    Found floating substr `bc' at offset 1...
2436*0Sstevel@tonic-gate    Guessed: match at offset 0
2437*0Sstevel@tonic-gate
2438*0Sstevel@tonic-gateThen the match is executed and the remaining lines describe the
2439*0Sstevel@tonic-gateprocess:
2440*0Sstevel@tonic-gate
2441*0Sstevel@tonic-gate    Matching REx `a*b+c' against `abc'
2442*0Sstevel@tonic-gate      Setting an EVAL scope, savestack=3
2443*0Sstevel@tonic-gate       0 <> <abc>             |  1:  STAR
2444*0Sstevel@tonic-gate                               EXACT <a> can match 1 times out of 32767...
2445*0Sstevel@tonic-gate      Setting an EVAL scope, savestack=3
2446*0Sstevel@tonic-gate       1 <a> <bc>             |  4:    PLUS
2447*0Sstevel@tonic-gate                               EXACT <b> can match 1 times out of 32767...
2448*0Sstevel@tonic-gate      Setting an EVAL scope, savestack=3
2449*0Sstevel@tonic-gate       2 <ab> <c>             |  7:      EXACT <c>
2450*0Sstevel@tonic-gate       3 <abc> <>             |  9:      END
2451*0Sstevel@tonic-gate    Match successful!
2452*0Sstevel@tonic-gate    Freeing REx: `a*b+c'
2453*0Sstevel@tonic-gate
2454*0Sstevel@tonic-gateEach step is of the form S<C<< n <x> <y> >> >, with C<< <x> >> the
2455*0Sstevel@tonic-gatepart of the string matched and C<< <y> >> the part not yet
2456*0Sstevel@tonic-gatematched.  The S<C<< |  1:  STAR >> > says that perl is at line number 1
2457*0Sstevel@tonic-gaten the compilation list above.  See
2458*0Sstevel@tonic-gateL<perldebguts/"Debugging regular expressions"> for much more detail.
2459*0Sstevel@tonic-gate
2460*0Sstevel@tonic-gateAn alternative method of debugging regexps is to embed C<print>
2461*0Sstevel@tonic-gatestatements within the regexp.  This provides a blow-by-blow account of
2462*0Sstevel@tonic-gatethe backtracking in an alternation:
2463*0Sstevel@tonic-gate
2464*0Sstevel@tonic-gate    "that this" =~ m@(?{print "Start at position ", pos, "\n";})
2465*0Sstevel@tonic-gate                     t(?{print "t1\n";})
2466*0Sstevel@tonic-gate                     h(?{print "h1\n";})
2467*0Sstevel@tonic-gate                     i(?{print "i1\n";})
2468*0Sstevel@tonic-gate                     s(?{print "s1\n";})
2469*0Sstevel@tonic-gate                         |
2470*0Sstevel@tonic-gate                     t(?{print "t2\n";})
2471*0Sstevel@tonic-gate                     h(?{print "h2\n";})
2472*0Sstevel@tonic-gate                     a(?{print "a2\n";})
2473*0Sstevel@tonic-gate                     t(?{print "t2\n";})
2474*0Sstevel@tonic-gate                     (?{print "Done at position ", pos, "\n";})
2475*0Sstevel@tonic-gate                    @x;
2476*0Sstevel@tonic-gate
2477*0Sstevel@tonic-gateprints
2478*0Sstevel@tonic-gate
2479*0Sstevel@tonic-gate    Start at position 0
2480*0Sstevel@tonic-gate    t1
2481*0Sstevel@tonic-gate    h1
2482*0Sstevel@tonic-gate    t2
2483*0Sstevel@tonic-gate    h2
2484*0Sstevel@tonic-gate    a2
2485*0Sstevel@tonic-gate    t2
2486*0Sstevel@tonic-gate    Done at position 4
2487*0Sstevel@tonic-gate
2488*0Sstevel@tonic-gate=head1 BUGS
2489*0Sstevel@tonic-gate
2490*0Sstevel@tonic-gateCode expressions, conditional expressions, and independent expressions
2491*0Sstevel@tonic-gateare B<experimental>.  Don't use them in production code.  Yet.
2492*0Sstevel@tonic-gate
2493*0Sstevel@tonic-gate=head1 SEE ALSO
2494*0Sstevel@tonic-gate
2495*0Sstevel@tonic-gateThis is just a tutorial.  For the full story on perl regular
2496*0Sstevel@tonic-gateexpressions, see the L<perlre> regular expressions reference page.
2497*0Sstevel@tonic-gate
2498*0Sstevel@tonic-gateFor more information on the matching C<m//> and substitution C<s///>
2499*0Sstevel@tonic-gateoperators, see L<perlop/"Regexp Quote-Like Operators">.  For
2500*0Sstevel@tonic-gateinformation on the C<split> operation, see L<perlfunc/split>.
2501*0Sstevel@tonic-gate
2502*0Sstevel@tonic-gateFor an excellent all-around resource on the care and feeding of
2503*0Sstevel@tonic-gateregular expressions, see the book I<Mastering Regular Expressions> by
2504*0Sstevel@tonic-gateJeffrey Friedl (published by O'Reilly, ISBN 1556592-257-3).
2505*0Sstevel@tonic-gate
2506*0Sstevel@tonic-gate=head1 AUTHOR AND COPYRIGHT
2507*0Sstevel@tonic-gate
2508*0Sstevel@tonic-gateCopyright (c) 2000 Mark Kvale
2509*0Sstevel@tonic-gateAll rights reserved.
2510*0Sstevel@tonic-gate
2511*0Sstevel@tonic-gateThis document may be distributed under the same terms as Perl itself.
2512*0Sstevel@tonic-gate
2513*0Sstevel@tonic-gate=head2 Acknowledgments
2514*0Sstevel@tonic-gate
2515*0Sstevel@tonic-gateThe inspiration for the stop codon DNA example came from the ZIP
2516*0Sstevel@tonic-gatecode example in chapter 7 of I<Mastering Regular Expressions>.
2517*0Sstevel@tonic-gate
2518*0Sstevel@tonic-gateThe author would like to thank Jeff Pinyan, Andrew Johnson, Peter
2519*0Sstevel@tonic-gateHaworth, Ronald J Kimball, and Joe Smith for all their helpful
2520*0Sstevel@tonic-gatecomments.
2521*0Sstevel@tonic-gate
2522*0Sstevel@tonic-gate=cut
2523*0Sstevel@tonic-gate
2524