xref: /onnv-gate/usr/src/cmd/perl/5.8.4/distrib/pod/perlrequick.pod (revision 0:68f95e015346)
1*0Sstevel@tonic-gate=head1 NAME
2*0Sstevel@tonic-gate
3*0Sstevel@tonic-gateperlrequick - Perl regular expressions quick start
4*0Sstevel@tonic-gate
5*0Sstevel@tonic-gate=head1 DESCRIPTION
6*0Sstevel@tonic-gate
7*0Sstevel@tonic-gateThis page covers the very basics of understanding, creating and
8*0Sstevel@tonic-gateusing regular expressions ('regexes') in Perl.
9*0Sstevel@tonic-gate
10*0Sstevel@tonic-gate
11*0Sstevel@tonic-gate=head1 The Guide
12*0Sstevel@tonic-gate
13*0Sstevel@tonic-gate=head2 Simple word matching
14*0Sstevel@tonic-gate
15*0Sstevel@tonic-gateThe simplest regex is simply a word, or more generally, a string of
16*0Sstevel@tonic-gatecharacters.  A regex consisting of a word matches any string that
17*0Sstevel@tonic-gatecontains that word:
18*0Sstevel@tonic-gate
19*0Sstevel@tonic-gate    "Hello World" =~ /World/;  # matches
20*0Sstevel@tonic-gate
21*0Sstevel@tonic-gateIn this statement, C<World> is a regex and the C<//> enclosing
22*0Sstevel@tonic-gateC</World/> tells perl to search a string for a match.  The operator
23*0Sstevel@tonic-gateC<=~> associates the string with the regex match and produces a true
24*0Sstevel@tonic-gatevalue if the regex matched, or false if the regex did not match.  In
25*0Sstevel@tonic-gateour case, C<World> matches the second word in C<"Hello World">, so the
26*0Sstevel@tonic-gateexpression is true.  This idea has several variations.
27*0Sstevel@tonic-gate
28*0Sstevel@tonic-gateExpressions like this are useful in conditionals:
29*0Sstevel@tonic-gate
30*0Sstevel@tonic-gate    print "It matches\n" if "Hello World" =~ /World/;
31*0Sstevel@tonic-gate
32*0Sstevel@tonic-gateThe sense of the match can be reversed by using C<!~> operator:
33*0Sstevel@tonic-gate
34*0Sstevel@tonic-gate    print "It doesn't match\n" if "Hello World" !~ /World/;
35*0Sstevel@tonic-gate
36*0Sstevel@tonic-gateThe literal string in the regex can be replaced by a variable:
37*0Sstevel@tonic-gate
38*0Sstevel@tonic-gate    $greeting = "World";
39*0Sstevel@tonic-gate    print "It matches\n" if "Hello World" =~ /$greeting/;
40*0Sstevel@tonic-gate
41*0Sstevel@tonic-gateIf you're matching against C<$_>, the C<$_ =~> part can be omitted:
42*0Sstevel@tonic-gate
43*0Sstevel@tonic-gate    $_ = "Hello World";
44*0Sstevel@tonic-gate    print "It matches\n" if /World/;
45*0Sstevel@tonic-gate
46*0Sstevel@tonic-gateFinally, the C<//> default delimiters for a match can be changed to
47*0Sstevel@tonic-gatearbitrary delimiters by putting an C<'m'> out front:
48*0Sstevel@tonic-gate
49*0Sstevel@tonic-gate    "Hello World" =~ m!World!;   # matches, delimited by '!'
50*0Sstevel@tonic-gate    "Hello World" =~ m{World};   # matches, note the matching '{}'
51*0Sstevel@tonic-gate    "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
52*0Sstevel@tonic-gate                                 # '/' becomes an ordinary char
53*0Sstevel@tonic-gate
54*0Sstevel@tonic-gateRegexes must match a part of the string I<exactly> in order for the
55*0Sstevel@tonic-gatestatement to be true:
56*0Sstevel@tonic-gate
57*0Sstevel@tonic-gate    "Hello World" =~ /world/;  # doesn't match, case sensitive
58*0Sstevel@tonic-gate    "Hello World" =~ /o W/;    # matches, ' ' is an ordinary char
59*0Sstevel@tonic-gate    "Hello World" =~ /World /; # doesn't match, no ' ' at end
60*0Sstevel@tonic-gate
61*0Sstevel@tonic-gateperl will always match at the earliest possible point in the string:
62*0Sstevel@tonic-gate
63*0Sstevel@tonic-gate    "Hello World" =~ /o/;       # matches 'o' in 'Hello'
64*0Sstevel@tonic-gate    "That hat is red" =~ /hat/; # matches 'hat' in 'That'
65*0Sstevel@tonic-gate
66*0Sstevel@tonic-gateNot all characters can be used 'as is' in a match.  Some characters,
67*0Sstevel@tonic-gatecalled B<metacharacters>, are reserved for use in regex notation.
68*0Sstevel@tonic-gateThe metacharacters are
69*0Sstevel@tonic-gate
70*0Sstevel@tonic-gate    {}[]()^$.|*+?\
71*0Sstevel@tonic-gate
72*0Sstevel@tonic-gateA metacharacter can be matched by putting a backslash before it:
73*0Sstevel@tonic-gate
74*0Sstevel@tonic-gate    "2+2=4" =~ /2+2/;    # doesn't match, + is a metacharacter
75*0Sstevel@tonic-gate    "2+2=4" =~ /2\+2/;   # matches, \+ is treated like an ordinary +
76*0Sstevel@tonic-gate    'C:\WIN32' =~ /C:\\WIN/;                       # matches
77*0Sstevel@tonic-gate    "/usr/bin/perl" =~ /\/usr\/bin\/perl/;  # matches
78*0Sstevel@tonic-gate
79*0Sstevel@tonic-gateIn the last regex, the forward slash C<'/'> is also backslashed,
80*0Sstevel@tonic-gatebecause it is used to delimit the regex.
81*0Sstevel@tonic-gate
82*0Sstevel@tonic-gateNon-printable ASCII characters are represented by B<escape sequences>.
83*0Sstevel@tonic-gateCommon examples are C<\t> for a tab, C<\n> for a newline, and C<\r>
84*0Sstevel@tonic-gatefor a carriage return.  Arbitrary bytes are represented by octal
85*0Sstevel@tonic-gateescape sequences, e.g., C<\033>, or hexadecimal escape sequences,
86*0Sstevel@tonic-gatee.g., C<\x1B>:
87*0Sstevel@tonic-gate
88*0Sstevel@tonic-gate    "1000\t2000" =~ m(0\t2)        # matches
89*0Sstevel@tonic-gate    "cat"        =~ /\143\x61\x74/ # matches, but a weird way to spell cat
90*0Sstevel@tonic-gate
91*0Sstevel@tonic-gateRegexes are treated mostly as double quoted strings, so variable
92*0Sstevel@tonic-gatesubstitution works:
93*0Sstevel@tonic-gate
94*0Sstevel@tonic-gate    $foo = 'house';
95*0Sstevel@tonic-gate    'cathouse' =~ /cat$foo/;   # matches
96*0Sstevel@tonic-gate    'housecat' =~ /${foo}cat/; # matches
97*0Sstevel@tonic-gate
98*0Sstevel@tonic-gateWith all of the regexes above, if the regex matched anywhere in the
99*0Sstevel@tonic-gatestring, it was considered a match.  To specify I<where> it should
100*0Sstevel@tonic-gatematch, we would use the B<anchor> metacharacters C<^> and C<$>.  The
101*0Sstevel@tonic-gateanchor C<^> means match at the beginning of the string and the anchor
102*0Sstevel@tonic-gateC<$> means match at the end of the string, or before a newline at the
103*0Sstevel@tonic-gateend of the string.  Some examples:
104*0Sstevel@tonic-gate
105*0Sstevel@tonic-gate    "housekeeper" =~ /keeper/;         # matches
106*0Sstevel@tonic-gate    "housekeeper" =~ /^keeper/;        # doesn't match
107*0Sstevel@tonic-gate    "housekeeper" =~ /keeper$/;        # matches
108*0Sstevel@tonic-gate    "housekeeper\n" =~ /keeper$/;      # matches
109*0Sstevel@tonic-gate    "housekeeper" =~ /^housekeeper$/;  # matches
110*0Sstevel@tonic-gate
111*0Sstevel@tonic-gate=head2 Using character classes
112*0Sstevel@tonic-gate
113*0Sstevel@tonic-gateA B<character class> allows a set of possible characters, rather than
114*0Sstevel@tonic-gatejust a single character, to match at a particular point in a regex.
115*0Sstevel@tonic-gateCharacter classes are denoted by brackets C<[...]>, with the set of
116*0Sstevel@tonic-gatecharacters to be possibly matched inside.  Here are some examples:
117*0Sstevel@tonic-gate
118*0Sstevel@tonic-gate    /cat/;            # matches 'cat'
119*0Sstevel@tonic-gate    /[bcr]at/;        # matches 'bat', 'cat', or 'rat'
120*0Sstevel@tonic-gate    "abc" =~ /[cab]/; # matches 'a'
121*0Sstevel@tonic-gate
122*0Sstevel@tonic-gateIn the last statement, even though C<'c'> is the first character in
123*0Sstevel@tonic-gatethe class, the earliest point at which the regex can match is C<'a'>.
124*0Sstevel@tonic-gate
125*0Sstevel@tonic-gate    /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
126*0Sstevel@tonic-gate                    # 'yes', 'Yes', 'YES', etc.
127*0Sstevel@tonic-gate    /yes/i;         # also match 'yes' in a case-insensitive way
128*0Sstevel@tonic-gate
129*0Sstevel@tonic-gateThe last example shows a match with an C<'i'> B<modifier>, which makes
130*0Sstevel@tonic-gatethe match case-insensitive.
131*0Sstevel@tonic-gate
132*0Sstevel@tonic-gateCharacter classes also have ordinary and special characters, but the
133*0Sstevel@tonic-gatesets of ordinary and special characters inside a character class are
134*0Sstevel@tonic-gatedifferent than those outside a character class.  The special
135*0Sstevel@tonic-gatecharacters for a character class are C<-]\^$> and are matched using an
136*0Sstevel@tonic-gateescape:
137*0Sstevel@tonic-gate
138*0Sstevel@tonic-gate   /[\]c]def/; # matches ']def' or 'cdef'
139*0Sstevel@tonic-gate   $x = 'bcr';
140*0Sstevel@tonic-gate   /[$x]at/;   # matches 'bat, 'cat', or 'rat'
141*0Sstevel@tonic-gate   /[\$x]at/;  # matches '$at' or 'xat'
142*0Sstevel@tonic-gate   /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
143*0Sstevel@tonic-gate
144*0Sstevel@tonic-gateThe special character C<'-'> acts as a range operator within character
145*0Sstevel@tonic-gateclasses, so that the unwieldy C<[0123456789]> and C<[abc...xyz]>
146*0Sstevel@tonic-gatebecome the svelte C<[0-9]> and C<[a-z]>:
147*0Sstevel@tonic-gate
148*0Sstevel@tonic-gate    /item[0-9]/;  # matches 'item0' or ... or 'item9'
149*0Sstevel@tonic-gate    /[0-9a-fA-F]/;  # matches a hexadecimal digit
150*0Sstevel@tonic-gate
151*0Sstevel@tonic-gateIf C<'-'> is the first or last character in a character class, it is
152*0Sstevel@tonic-gatetreated as an ordinary character.
153*0Sstevel@tonic-gate
154*0Sstevel@tonic-gateThe special character C<^> in the first position of a character class
155*0Sstevel@tonic-gatedenotes a B<negated character class>, which matches any character but
156*0Sstevel@tonic-gatethose in the brackets.  Both C<[...]> and C<[^...]> must match a
157*0Sstevel@tonic-gatecharacter, or the match fails.  Then
158*0Sstevel@tonic-gate
159*0Sstevel@tonic-gate    /[^a]at/;  # doesn't match 'aat' or 'at', but matches
160*0Sstevel@tonic-gate               # all other 'bat', 'cat, '0at', '%at', etc.
161*0Sstevel@tonic-gate    /[^0-9]/;  # matches a non-numeric character
162*0Sstevel@tonic-gate    /[a^]at/;  # matches 'aat' or '^at'; here '^' is ordinary
163*0Sstevel@tonic-gate
164*0Sstevel@tonic-gatePerl has several abbreviations for common character classes:
165*0Sstevel@tonic-gate
166*0Sstevel@tonic-gate=over 4
167*0Sstevel@tonic-gate
168*0Sstevel@tonic-gate=item *
169*0Sstevel@tonic-gate
170*0Sstevel@tonic-gate\d is a digit and represents
171*0Sstevel@tonic-gate
172*0Sstevel@tonic-gate    [0-9]
173*0Sstevel@tonic-gate
174*0Sstevel@tonic-gate=item *
175*0Sstevel@tonic-gate
176*0Sstevel@tonic-gate\s is a whitespace character and represents
177*0Sstevel@tonic-gate
178*0Sstevel@tonic-gate    [\ \t\r\n\f]
179*0Sstevel@tonic-gate
180*0Sstevel@tonic-gate=item *
181*0Sstevel@tonic-gate
182*0Sstevel@tonic-gate\w is a word character (alphanumeric or _) and represents
183*0Sstevel@tonic-gate
184*0Sstevel@tonic-gate    [0-9a-zA-Z_]
185*0Sstevel@tonic-gate
186*0Sstevel@tonic-gate=item *
187*0Sstevel@tonic-gate
188*0Sstevel@tonic-gate\D is a negated \d; it represents any character but a digit
189*0Sstevel@tonic-gate
190*0Sstevel@tonic-gate    [^0-9]
191*0Sstevel@tonic-gate
192*0Sstevel@tonic-gate=item *
193*0Sstevel@tonic-gate
194*0Sstevel@tonic-gate\S is a negated \s; it represents any non-whitespace character
195*0Sstevel@tonic-gate
196*0Sstevel@tonic-gate    [^\s]
197*0Sstevel@tonic-gate
198*0Sstevel@tonic-gate=item *
199*0Sstevel@tonic-gate
200*0Sstevel@tonic-gate\W is a negated \w; it represents any non-word character
201*0Sstevel@tonic-gate
202*0Sstevel@tonic-gate    [^\w]
203*0Sstevel@tonic-gate
204*0Sstevel@tonic-gate=item *
205*0Sstevel@tonic-gate
206*0Sstevel@tonic-gateThe period '.' matches any character but "\n"
207*0Sstevel@tonic-gate
208*0Sstevel@tonic-gate=back
209*0Sstevel@tonic-gate
210*0Sstevel@tonic-gateThe C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
211*0Sstevel@tonic-gateof character classes.  Here are some in use:
212*0Sstevel@tonic-gate
213*0Sstevel@tonic-gate    /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
214*0Sstevel@tonic-gate    /[\d\s]/;         # matches any digit or whitespace character
215*0Sstevel@tonic-gate    /\w\W\w/;         # matches a word char, followed by a
216*0Sstevel@tonic-gate                      # non-word char, followed by a word char
217*0Sstevel@tonic-gate    /..rt/;           # matches any two chars, followed by 'rt'
218*0Sstevel@tonic-gate    /end\./;          # matches 'end.'
219*0Sstevel@tonic-gate    /end[.]/;         # same thing, matches 'end.'
220*0Sstevel@tonic-gate
221*0Sstevel@tonic-gateThe S<B<word anchor> > C<\b> matches a boundary between a word
222*0Sstevel@tonic-gatecharacter and a non-word character C<\w\W> or C<\W\w>:
223*0Sstevel@tonic-gate
224*0Sstevel@tonic-gate    $x = "Housecat catenates house and cat";
225*0Sstevel@tonic-gate    $x =~ /\bcat/;  # matches cat in 'catenates'
226*0Sstevel@tonic-gate    $x =~ /cat\b/;  # matches cat in 'housecat'
227*0Sstevel@tonic-gate    $x =~ /\bcat\b/;  # matches 'cat' at end of string
228*0Sstevel@tonic-gate
229*0Sstevel@tonic-gateIn the last example, the end of the string is considered a word
230*0Sstevel@tonic-gateboundary.
231*0Sstevel@tonic-gate
232*0Sstevel@tonic-gate=head2 Matching this or that
233*0Sstevel@tonic-gate
234*0Sstevel@tonic-gateWe can match different character strings with the B<alternation>
235*0Sstevel@tonic-gatemetacharacter C<'|'>.  To match C<dog> or C<cat>, we form the regex
236*0Sstevel@tonic-gateC<dog|cat>.  As before, perl will try to match the regex at the
237*0Sstevel@tonic-gateearliest possible point in the string.  At each character position,
238*0Sstevel@tonic-gateperl will first try to match the first alternative, C<dog>.  If
239*0Sstevel@tonic-gateC<dog> doesn't match, perl will then try the next alternative, C<cat>.
240*0Sstevel@tonic-gateIf C<cat> doesn't match either, then the match fails and perl moves to
241*0Sstevel@tonic-gatethe next position in the string.  Some examples:
242*0Sstevel@tonic-gate
243*0Sstevel@tonic-gate    "cats and dogs" =~ /cat|dog|bird/;  # matches "cat"
244*0Sstevel@tonic-gate    "cats and dogs" =~ /dog|cat|bird/;  # matches "cat"
245*0Sstevel@tonic-gate
246*0Sstevel@tonic-gateEven though C<dog> is the first alternative in the second regex,
247*0Sstevel@tonic-gateC<cat> is able to match earlier in the string.
248*0Sstevel@tonic-gate
249*0Sstevel@tonic-gate    "cats"          =~ /c|ca|cat|cats/; # matches "c"
250*0Sstevel@tonic-gate    "cats"          =~ /cats|cat|ca|c/; # matches "cats"
251*0Sstevel@tonic-gate
252*0Sstevel@tonic-gateAt a given character position, the first alternative that allows the
253*0Sstevel@tonic-gateregex match to succeed will be the one that matches. Here, all the
254*0Sstevel@tonic-gatealternatives match at the first string position, so the first matches.
255*0Sstevel@tonic-gate
256*0Sstevel@tonic-gate=head2 Grouping things and hierarchical matching
257*0Sstevel@tonic-gate
258*0Sstevel@tonic-gateThe B<grouping> metacharacters C<()> allow a part of a regex to be
259*0Sstevel@tonic-gatetreated as a single unit.  Parts of a regex are grouped by enclosing
260*0Sstevel@tonic-gatethem in parentheses.  The regex C<house(cat|keeper)> means match
261*0Sstevel@tonic-gateC<house> followed by either C<cat> or C<keeper>.  Some more examples
262*0Sstevel@tonic-gateare
263*0Sstevel@tonic-gate
264*0Sstevel@tonic-gate    /(a|b)b/;    # matches 'ab' or 'bb'
265*0Sstevel@tonic-gate    /(^a|b)c/;   # matches 'ac' at start of string or 'bc' anywhere
266*0Sstevel@tonic-gate
267*0Sstevel@tonic-gate    /house(cat|)/;  # matches either 'housecat' or 'house'
268*0Sstevel@tonic-gate    /house(cat(s|)|)/;  # matches either 'housecats' or 'housecat' or
269*0Sstevel@tonic-gate                        # 'house'.  Note groups can be nested.
270*0Sstevel@tonic-gate
271*0Sstevel@tonic-gate    "20" =~ /(19|20|)\d\d/;  # matches the null alternative '()\d\d',
272*0Sstevel@tonic-gate                             # because '20\d\d' can't match
273*0Sstevel@tonic-gate
274*0Sstevel@tonic-gate=head2 Extracting matches
275*0Sstevel@tonic-gate
276*0Sstevel@tonic-gateThe grouping metacharacters C<()> also allow the extraction of the
277*0Sstevel@tonic-gateparts of a string that matched.  For each grouping, the part that
278*0Sstevel@tonic-gatematched inside goes into the special variables C<$1>, C<$2>, etc.
279*0Sstevel@tonic-gateThey can be used just as ordinary variables:
280*0Sstevel@tonic-gate
281*0Sstevel@tonic-gate    # extract hours, minutes, seconds
282*0Sstevel@tonic-gate    $time =~ /(\d\d):(\d\d):(\d\d)/;  # match hh:mm:ss format
283*0Sstevel@tonic-gate    $hours = $1;
284*0Sstevel@tonic-gate    $minutes = $2;
285*0Sstevel@tonic-gate    $seconds = $3;
286*0Sstevel@tonic-gate
287*0Sstevel@tonic-gateIn list context, a match C</regex/> with groupings will return the
288*0Sstevel@tonic-gatelist of matched values C<($1,$2,...)>.  So we could rewrite it as
289*0Sstevel@tonic-gate
290*0Sstevel@tonic-gate    ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
291*0Sstevel@tonic-gate
292*0Sstevel@tonic-gateIf the groupings in a regex are nested, C<$1> gets the group with the
293*0Sstevel@tonic-gateleftmost opening parenthesis, C<$2> the next opening parenthesis,
294*0Sstevel@tonic-gateetc.  For example, here is a complex regex and the matching variables
295*0Sstevel@tonic-gateindicated below it:
296*0Sstevel@tonic-gate
297*0Sstevel@tonic-gate    /(ab(cd|ef)((gi)|j))/;
298*0Sstevel@tonic-gate     1  2      34
299*0Sstevel@tonic-gate
300*0Sstevel@tonic-gateAssociated with the matching variables C<$1>, C<$2>, ... are
301*0Sstevel@tonic-gatethe B<backreferences> C<\1>, C<\2>, ...  Backreferences are
302*0Sstevel@tonic-gatematching variables that can be used I<inside> a regex:
303*0Sstevel@tonic-gate
304*0Sstevel@tonic-gate    /(\w\w\w)\s\1/; # find sequences like 'the the' in string
305*0Sstevel@tonic-gate
306*0Sstevel@tonic-gateC<$1>, C<$2>, ... should only be used outside of a regex, and C<\1>,
307*0Sstevel@tonic-gateC<\2>, ... only inside a regex.
308*0Sstevel@tonic-gate
309*0Sstevel@tonic-gate=head2 Matching repetitions
310*0Sstevel@tonic-gate
311*0Sstevel@tonic-gateThe B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us
312*0Sstevel@tonic-gateto determine the number of repeats of a portion of a regex we
313*0Sstevel@tonic-gateconsider to be a match.  Quantifiers are put immediately after the
314*0Sstevel@tonic-gatecharacter, character class, or grouping that we want to specify.  They
315*0Sstevel@tonic-gatehave the following meanings:
316*0Sstevel@tonic-gate
317*0Sstevel@tonic-gate=over 4
318*0Sstevel@tonic-gate
319*0Sstevel@tonic-gate=item *
320*0Sstevel@tonic-gate
321*0Sstevel@tonic-gateC<a?> = match 'a' 1 or 0 times
322*0Sstevel@tonic-gate
323*0Sstevel@tonic-gate=item *
324*0Sstevel@tonic-gate
325*0Sstevel@tonic-gateC<a*> = match 'a' 0 or more times, i.e., any number of times
326*0Sstevel@tonic-gate
327*0Sstevel@tonic-gate=item *
328*0Sstevel@tonic-gate
329*0Sstevel@tonic-gateC<a+> = match 'a' 1 or more times, i.e., at least once
330*0Sstevel@tonic-gate
331*0Sstevel@tonic-gate=item *
332*0Sstevel@tonic-gate
333*0Sstevel@tonic-gateC<a{n,m}> = match at least C<n> times, but not more than C<m>
334*0Sstevel@tonic-gatetimes.
335*0Sstevel@tonic-gate
336*0Sstevel@tonic-gate=item *
337*0Sstevel@tonic-gate
338*0Sstevel@tonic-gateC<a{n,}> = match at least C<n> or more times
339*0Sstevel@tonic-gate
340*0Sstevel@tonic-gate=item *
341*0Sstevel@tonic-gate
342*0Sstevel@tonic-gateC<a{n}> = match exactly C<n> times
343*0Sstevel@tonic-gate
344*0Sstevel@tonic-gate=back
345*0Sstevel@tonic-gate
346*0Sstevel@tonic-gateHere are some examples:
347*0Sstevel@tonic-gate
348*0Sstevel@tonic-gate    /[a-z]+\s+\d*/;  # match a lowercase word, at least some space, and
349*0Sstevel@tonic-gate                     # any number of digits
350*0Sstevel@tonic-gate    /(\w+)\s+\1/;    # match doubled words of arbitrary length
351*0Sstevel@tonic-gate    $year =~ /\d{2,4}/;  # make sure year is at least 2 but not more
352*0Sstevel@tonic-gate                         # than 4 digits
353*0Sstevel@tonic-gate    $year =~ /\d{4}|\d{2}/;    # better match; throw out 3 digit dates
354*0Sstevel@tonic-gate
355*0Sstevel@tonic-gateThese quantifiers will try to match as much of the string as possible,
356*0Sstevel@tonic-gatewhile still allowing the regex to match.  So we have
357*0Sstevel@tonic-gate
358*0Sstevel@tonic-gate    $x = 'the cat in the hat';
359*0Sstevel@tonic-gate    $x =~ /^(.*)(at)(.*)$/; # matches,
360*0Sstevel@tonic-gate                            # $1 = 'the cat in the h'
361*0Sstevel@tonic-gate                            # $2 = 'at'
362*0Sstevel@tonic-gate                            # $3 = ''   (0 matches)
363*0Sstevel@tonic-gate
364*0Sstevel@tonic-gateThe first quantifier C<.*> grabs as much of the string as possible
365*0Sstevel@tonic-gatewhile still having the regex match. The second quantifier C<.*> has
366*0Sstevel@tonic-gateno string left to it, so it matches 0 times.
367*0Sstevel@tonic-gate
368*0Sstevel@tonic-gate=head2 More matching
369*0Sstevel@tonic-gate
370*0Sstevel@tonic-gateThere are a few more things you might want to know about matching
371*0Sstevel@tonic-gateoperators.  In the code
372*0Sstevel@tonic-gate
373*0Sstevel@tonic-gate    $pattern = 'Seuss';
374*0Sstevel@tonic-gate    while (<>) {
375*0Sstevel@tonic-gate        print if /$pattern/;
376*0Sstevel@tonic-gate    }
377*0Sstevel@tonic-gate
378*0Sstevel@tonic-gateperl has to re-evaluate C<$pattern> each time through the loop.  If
379*0Sstevel@tonic-gateC<$pattern> won't be changing, use the C<//o> modifier, to only
380*0Sstevel@tonic-gateperform variable substitutions once.  If you don't want any
381*0Sstevel@tonic-gatesubstitutions at all, use the special delimiter C<m''>:
382*0Sstevel@tonic-gate
383*0Sstevel@tonic-gate    @pattern = ('Seuss');
384*0Sstevel@tonic-gate    m/@pattern/; # matches 'Seuss'
385*0Sstevel@tonic-gate    m'@pattern'; # matches the literal string '@pattern'
386*0Sstevel@tonic-gate
387*0Sstevel@tonic-gateThe global modifier C<//g> allows the matching operator to match
388*0Sstevel@tonic-gatewithin a string as many times as possible.  In scalar context,
389*0Sstevel@tonic-gatesuccessive matches against a string will have C<//g> jump from match
390*0Sstevel@tonic-gateto match, keeping track of position in the string as it goes along.
391*0Sstevel@tonic-gateYou can get or set the position with the C<pos()> function.
392*0Sstevel@tonic-gateFor example,
393*0Sstevel@tonic-gate
394*0Sstevel@tonic-gate    $x = "cat dog house"; # 3 words
395*0Sstevel@tonic-gate    while ($x =~ /(\w+)/g) {
396*0Sstevel@tonic-gate        print "Word is $1, ends at position ", pos $x, "\n";
397*0Sstevel@tonic-gate    }
398*0Sstevel@tonic-gate
399*0Sstevel@tonic-gateprints
400*0Sstevel@tonic-gate
401*0Sstevel@tonic-gate    Word is cat, ends at position 3
402*0Sstevel@tonic-gate    Word is dog, ends at position 7
403*0Sstevel@tonic-gate    Word is house, ends at position 13
404*0Sstevel@tonic-gate
405*0Sstevel@tonic-gateA failed match or changing the target string resets the position.  If
406*0Sstevel@tonic-gateyou don't want the position reset after failure to match, add the
407*0Sstevel@tonic-gateC<//c>, as in C</regex/gc>.
408*0Sstevel@tonic-gate
409*0Sstevel@tonic-gateIn list context, C<//g> returns a list of matched groupings, or if
410*0Sstevel@tonic-gatethere are no groupings, a list of matches to the whole regex.  So
411*0Sstevel@tonic-gate
412*0Sstevel@tonic-gate    @words = ($x =~ /(\w+)/g);  # matches,
413*0Sstevel@tonic-gate                                # $word[0] = 'cat'
414*0Sstevel@tonic-gate                                # $word[1] = 'dog'
415*0Sstevel@tonic-gate                                # $word[2] = 'house'
416*0Sstevel@tonic-gate
417*0Sstevel@tonic-gate=head2 Search and replace
418*0Sstevel@tonic-gate
419*0Sstevel@tonic-gateSearch and replace is performed using C<s/regex/replacement/modifiers>.
420*0Sstevel@tonic-gateThe C<replacement> is a Perl double quoted string that replaces in the
421*0Sstevel@tonic-gatestring whatever is matched with the C<regex>.  The operator C<=~> is
422*0Sstevel@tonic-gatealso used here to associate a string with C<s///>.  If matching
423*0Sstevel@tonic-gateagainst C<$_>, the S<C<$_ =~> > can be dropped.  If there is a match,
424*0Sstevel@tonic-gateC<s///> returns the number of substitutions made, otherwise it returns
425*0Sstevel@tonic-gatefalse.  Here are a few examples:
426*0Sstevel@tonic-gate
427*0Sstevel@tonic-gate    $x = "Time to feed the cat!";
428*0Sstevel@tonic-gate    $x =~ s/cat/hacker/;   # $x contains "Time to feed the hacker!"
429*0Sstevel@tonic-gate    $y = "'quoted words'";
430*0Sstevel@tonic-gate    $y =~ s/^'(.*)'$/$1/;  # strip single quotes,
431*0Sstevel@tonic-gate                           # $y contains "quoted words"
432*0Sstevel@tonic-gate
433*0Sstevel@tonic-gateWith the C<s///> operator, the matched variables C<$1>, C<$2>, etc.
434*0Sstevel@tonic-gateare immediately available for use in the replacement expression. With
435*0Sstevel@tonic-gatethe global modifier, C<s///g> will search and replace all occurrences
436*0Sstevel@tonic-gateof the regex in the string:
437*0Sstevel@tonic-gate
438*0Sstevel@tonic-gate    $x = "I batted 4 for 4";
439*0Sstevel@tonic-gate    $x =~ s/4/four/;   # $x contains "I batted four for 4"
440*0Sstevel@tonic-gate    $x = "I batted 4 for 4";
441*0Sstevel@tonic-gate    $x =~ s/4/four/g;  # $x contains "I batted four for four"
442*0Sstevel@tonic-gate
443*0Sstevel@tonic-gateThe evaluation modifier C<s///e> wraps an C<eval{...}> around the
444*0Sstevel@tonic-gatereplacement string and the evaluated result is substituted for the
445*0Sstevel@tonic-gatematched substring.  Some examples:
446*0Sstevel@tonic-gate
447*0Sstevel@tonic-gate    # reverse all the words in a string
448*0Sstevel@tonic-gate    $x = "the cat in the hat";
449*0Sstevel@tonic-gate    $x =~ s/(\w+)/reverse $1/ge;   # $x contains "eht tac ni eht tah"
450*0Sstevel@tonic-gate
451*0Sstevel@tonic-gate    # convert percentage to decimal
452*0Sstevel@tonic-gate    $x = "A 39% hit rate";
453*0Sstevel@tonic-gate    $x =~ s!(\d+)%!$1/100!e;       # $x contains "A 0.39 hit rate"
454*0Sstevel@tonic-gate
455*0Sstevel@tonic-gateThe last example shows that C<s///> can use other delimiters, such as
456*0Sstevel@tonic-gateC<s!!!> and C<s{}{}>, and even C<s{}//>.  If single quotes are used
457*0Sstevel@tonic-gateC<s'''>, then the regex and replacement are treated as single quoted
458*0Sstevel@tonic-gatestrings.
459*0Sstevel@tonic-gate
460*0Sstevel@tonic-gate=head2 The split operator
461*0Sstevel@tonic-gate
462*0Sstevel@tonic-gateC<split /regex/, string> splits C<string> into a list of substrings
463*0Sstevel@tonic-gateand returns that list.  The regex determines the character sequence
464*0Sstevel@tonic-gatethat C<string> is split with respect to.  For example, to split a
465*0Sstevel@tonic-gatestring into words, use
466*0Sstevel@tonic-gate
467*0Sstevel@tonic-gate    $x = "Calvin and Hobbes";
468*0Sstevel@tonic-gate    @word = split /\s+/, $x;  # $word[0] = 'Calvin'
469*0Sstevel@tonic-gate                              # $word[1] = 'and'
470*0Sstevel@tonic-gate                              # $word[2] = 'Hobbes'
471*0Sstevel@tonic-gate
472*0Sstevel@tonic-gateTo extract a comma-delimited list of numbers, use
473*0Sstevel@tonic-gate
474*0Sstevel@tonic-gate    $x = "1.618,2.718,   3.142";
475*0Sstevel@tonic-gate    @const = split /,\s*/, $x;  # $const[0] = '1.618'
476*0Sstevel@tonic-gate                                # $const[1] = '2.718'
477*0Sstevel@tonic-gate                                # $const[2] = '3.142'
478*0Sstevel@tonic-gate
479*0Sstevel@tonic-gateIf the empty regex C<//> is used, the string is split into individual
480*0Sstevel@tonic-gatecharacters.  If the regex has groupings, then the list produced contains
481*0Sstevel@tonic-gatethe matched substrings from the groupings as well:
482*0Sstevel@tonic-gate
483*0Sstevel@tonic-gate    $x = "/usr/bin";
484*0Sstevel@tonic-gate    @parts = split m!(/)!, $x;  # $parts[0] = ''
485*0Sstevel@tonic-gate                                # $parts[1] = '/'
486*0Sstevel@tonic-gate                                # $parts[2] = 'usr'
487*0Sstevel@tonic-gate                                # $parts[3] = '/'
488*0Sstevel@tonic-gate                                # $parts[4] = 'bin'
489*0Sstevel@tonic-gate
490*0Sstevel@tonic-gateSince the first character of $x matched the regex, C<split> prepended
491*0Sstevel@tonic-gatean empty initial element to the list.
492*0Sstevel@tonic-gate
493*0Sstevel@tonic-gate=head1 BUGS
494*0Sstevel@tonic-gate
495*0Sstevel@tonic-gateNone.
496*0Sstevel@tonic-gate
497*0Sstevel@tonic-gate=head1 SEE ALSO
498*0Sstevel@tonic-gate
499*0Sstevel@tonic-gateThis is just a quick start guide.  For a more in-depth tutorial on
500*0Sstevel@tonic-gateregexes, see L<perlretut> and for the reference page, see L<perlre>.
501*0Sstevel@tonic-gate
502*0Sstevel@tonic-gate=head1 AUTHOR AND COPYRIGHT
503*0Sstevel@tonic-gate
504*0Sstevel@tonic-gateCopyright (c) 2000 Mark Kvale
505*0Sstevel@tonic-gateAll rights reserved.
506*0Sstevel@tonic-gate
507*0Sstevel@tonic-gateThis document may be distributed under the same terms as Perl itself.
508*0Sstevel@tonic-gate
509*0Sstevel@tonic-gate=head2 Acknowledgments
510*0Sstevel@tonic-gate
511*0Sstevel@tonic-gateThe author would like to thank Mark-Jason Dominus, Tom Christiansen,
512*0Sstevel@tonic-gateIlya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful
513*0Sstevel@tonic-gatecomments.
514*0Sstevel@tonic-gate
515*0Sstevel@tonic-gate=cut
516*0Sstevel@tonic-gate
517