xref: /onnv-gate/usr/src/cmd/perl/5.8.4/distrib/pod/perlre.pod (revision 0:68f95e015346)
1*0Sstevel@tonic-gate=head1 NAME
2*0Sstevel@tonic-gate
3*0Sstevel@tonic-gateperlre - Perl regular expressions
4*0Sstevel@tonic-gate
5*0Sstevel@tonic-gate=head1 DESCRIPTION
6*0Sstevel@tonic-gate
7*0Sstevel@tonic-gateThis page describes the syntax of regular expressions in Perl.
8*0Sstevel@tonic-gate
9*0Sstevel@tonic-gateIf you haven't used regular expressions before, a quick-start
10*0Sstevel@tonic-gateintroduction is available in L<perlrequick>, and a longer tutorial
11*0Sstevel@tonic-gateintroduction is available in L<perlretut>.
12*0Sstevel@tonic-gate
13*0Sstevel@tonic-gateFor reference on how regular expressions are used in matching
14*0Sstevel@tonic-gateoperations, plus various examples of the same, see discussions of
15*0Sstevel@tonic-gateC<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like
16*0Sstevel@tonic-gateOperators">.
17*0Sstevel@tonic-gate
18*0Sstevel@tonic-gateMatching operations can have various modifiers.  Modifiers
19*0Sstevel@tonic-gatethat relate to the interpretation of the regular expression inside
20*0Sstevel@tonic-gateare listed below.  Modifiers that alter the way a regular expression
21*0Sstevel@tonic-gateis used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and
22*0Sstevel@tonic-gateL<perlop/"Gory details of parsing quoted constructs">.
23*0Sstevel@tonic-gate
24*0Sstevel@tonic-gate=over 4
25*0Sstevel@tonic-gate
26*0Sstevel@tonic-gate=item i
27*0Sstevel@tonic-gate
28*0Sstevel@tonic-gateDo case-insensitive pattern matching.
29*0Sstevel@tonic-gate
30*0Sstevel@tonic-gateIf C<use locale> is in effect, the case map is taken from the current
31*0Sstevel@tonic-gatelocale.  See L<perllocale>.
32*0Sstevel@tonic-gate
33*0Sstevel@tonic-gate=item m
34*0Sstevel@tonic-gate
35*0Sstevel@tonic-gateTreat string as multiple lines.  That is, change "^" and "$" from matching
36*0Sstevel@tonic-gatethe start or end of the string to matching the start or end of any
37*0Sstevel@tonic-gateline anywhere within the string.
38*0Sstevel@tonic-gate
39*0Sstevel@tonic-gate=item s
40*0Sstevel@tonic-gate
41*0Sstevel@tonic-gateTreat string as single line.  That is, change "." to match any character
42*0Sstevel@tonic-gatewhatsoever, even a newline, which normally it would not match.
43*0Sstevel@tonic-gate
44*0Sstevel@tonic-gateThe C</s> and C</m> modifiers both override the C<$*> setting.  That
45*0Sstevel@tonic-gateis, no matter what C<$*> contains, C</s> without C</m> will force
46*0Sstevel@tonic-gate"^" to match only at the beginning of the string and "$" to match
47*0Sstevel@tonic-gateonly at the end (or just before a newline at the end) of the string.
48*0Sstevel@tonic-gateTogether, as /ms, they let the "." match any character whatsoever,
49*0Sstevel@tonic-gatewhile still allowing "^" and "$" to match, respectively, just after
50*0Sstevel@tonic-gateand just before newlines within the string.
51*0Sstevel@tonic-gate
52*0Sstevel@tonic-gate=item x
53*0Sstevel@tonic-gate
54*0Sstevel@tonic-gateExtend your pattern's legibility by permitting whitespace and comments.
55*0Sstevel@tonic-gate
56*0Sstevel@tonic-gate=back
57*0Sstevel@tonic-gate
58*0Sstevel@tonic-gateThese are usually written as "the C</x> modifier", even though the delimiter
59*0Sstevel@tonic-gatein question might not really be a slash.  Any of these
60*0Sstevel@tonic-gatemodifiers may also be embedded within the regular expression itself using
61*0Sstevel@tonic-gatethe C<(?...)> construct.  See below.
62*0Sstevel@tonic-gate
63*0Sstevel@tonic-gateThe C</x> modifier itself needs a little more explanation.  It tells
64*0Sstevel@tonic-gatethe regular expression parser to ignore whitespace that is neither
65*0Sstevel@tonic-gatebackslashed nor within a character class.  You can use this to break up
66*0Sstevel@tonic-gateyour regular expression into (slightly) more readable parts.  The C<#>
67*0Sstevel@tonic-gatecharacter is also treated as a metacharacter introducing a comment,
68*0Sstevel@tonic-gatejust as in ordinary Perl code.  This also means that if you want real
69*0Sstevel@tonic-gatewhitespace or C<#> characters in the pattern (outside a character
70*0Sstevel@tonic-gateclass, where they are unaffected by C</x>), that you'll either have to
71*0Sstevel@tonic-gateescape them or encode them using octal or hex escapes.  Taken together,
72*0Sstevel@tonic-gatethese features go a long way towards making Perl's regular expressions
73*0Sstevel@tonic-gatemore readable.  Note that you have to be careful not to include the
74*0Sstevel@tonic-gatepattern delimiter in the comment--perl has no way of knowing you did
75*0Sstevel@tonic-gatenot intend to close the pattern early.  See the C-comment deletion code
76*0Sstevel@tonic-gatein L<perlop>.
77*0Sstevel@tonic-gate
78*0Sstevel@tonic-gate=head2 Regular Expressions
79*0Sstevel@tonic-gate
80*0Sstevel@tonic-gateThe patterns used in Perl pattern matching derive from supplied in
81*0Sstevel@tonic-gatethe Version 8 regex routines.  (The routines are derived
82*0Sstevel@tonic-gate(distantly) from Henry Spencer's freely redistributable reimplementation
83*0Sstevel@tonic-gateof the V8 routines.)  See L<Version 8 Regular Expressions> for
84*0Sstevel@tonic-gatedetails.
85*0Sstevel@tonic-gate
86*0Sstevel@tonic-gateIn particular the following metacharacters have their standard I<egrep>-ish
87*0Sstevel@tonic-gatemeanings:
88*0Sstevel@tonic-gate
89*0Sstevel@tonic-gate    \	Quote the next metacharacter
90*0Sstevel@tonic-gate    ^	Match the beginning of the line
91*0Sstevel@tonic-gate    .	Match any character (except newline)
92*0Sstevel@tonic-gate    $	Match the end of the line (or before newline at the end)
93*0Sstevel@tonic-gate    |	Alternation
94*0Sstevel@tonic-gate    ()	Grouping
95*0Sstevel@tonic-gate    []	Character class
96*0Sstevel@tonic-gate
97*0Sstevel@tonic-gateBy default, the "^" character is guaranteed to match only the
98*0Sstevel@tonic-gatebeginning of the string, the "$" character only the end (or before the
99*0Sstevel@tonic-gatenewline at the end), and Perl does certain optimizations with the
100*0Sstevel@tonic-gateassumption that the string contains only one line.  Embedded newlines
101*0Sstevel@tonic-gatewill not be matched by "^" or "$".  You may, however, wish to treat a
102*0Sstevel@tonic-gatestring as a multi-line buffer, such that the "^" will match after any
103*0Sstevel@tonic-gatenewline within the string, and "$" will match before any newline.  At the
104*0Sstevel@tonic-gatecost of a little more overhead, you can do this by using the /m modifier
105*0Sstevel@tonic-gateon the pattern match operator.  (Older programs did this by setting C<$*>,
106*0Sstevel@tonic-gatebut this practice is now deprecated.)
107*0Sstevel@tonic-gate
108*0Sstevel@tonic-gateTo simplify multi-line substitutions, the "." character never matches a
109*0Sstevel@tonic-gatenewline unless you use the C</s> modifier, which in effect tells Perl to pretend
110*0Sstevel@tonic-gatethe string is a single line--even if it isn't.  The C</s> modifier also
111*0Sstevel@tonic-gateoverrides the setting of C<$*>, in case you have some (badly behaved) older
112*0Sstevel@tonic-gatecode that sets it in another module.
113*0Sstevel@tonic-gate
114*0Sstevel@tonic-gateThe following standard quantifiers are recognized:
115*0Sstevel@tonic-gate
116*0Sstevel@tonic-gate    *	   Match 0 or more times
117*0Sstevel@tonic-gate    +	   Match 1 or more times
118*0Sstevel@tonic-gate    ?	   Match 1 or 0 times
119*0Sstevel@tonic-gate    {n}    Match exactly n times
120*0Sstevel@tonic-gate    {n,}   Match at least n times
121*0Sstevel@tonic-gate    {n,m}  Match at least n but not more than m times
122*0Sstevel@tonic-gate
123*0Sstevel@tonic-gate(If a curly bracket occurs in any other context, it is treated
124*0Sstevel@tonic-gateas a regular character.  In particular, the lower bound
125*0Sstevel@tonic-gateis not optional.)  The "*" modifier is equivalent to C<{0,}>, the "+"
126*0Sstevel@tonic-gatemodifier to C<{1,}>, and the "?" modifier to C<{0,1}>.  n and m are limited
127*0Sstevel@tonic-gateto integral values less than a preset limit defined when perl is built.
128*0Sstevel@tonic-gateThis is usually 32766 on the most common platforms.  The actual limit can
129*0Sstevel@tonic-gatebe seen in the error message generated by code such as this:
130*0Sstevel@tonic-gate
131*0Sstevel@tonic-gate    $_ **= $_ , / {$_} / for 2 .. 42;
132*0Sstevel@tonic-gate
133*0Sstevel@tonic-gateBy default, a quantified subpattern is "greedy", that is, it will match as
134*0Sstevel@tonic-gatemany times as possible (given a particular starting location) while still
135*0Sstevel@tonic-gateallowing the rest of the pattern to match.  If you want it to match the
136*0Sstevel@tonic-gateminimum number of times possible, follow the quantifier with a "?".  Note
137*0Sstevel@tonic-gatethat the meanings don't change, just the "greediness":
138*0Sstevel@tonic-gate
139*0Sstevel@tonic-gate    *?	   Match 0 or more times
140*0Sstevel@tonic-gate    +?	   Match 1 or more times
141*0Sstevel@tonic-gate    ??	   Match 0 or 1 time
142*0Sstevel@tonic-gate    {n}?   Match exactly n times
143*0Sstevel@tonic-gate    {n,}?  Match at least n times
144*0Sstevel@tonic-gate    {n,m}? Match at least n but not more than m times
145*0Sstevel@tonic-gate
146*0Sstevel@tonic-gateBecause patterns are processed as double quoted strings, the following
147*0Sstevel@tonic-gatealso work:
148*0Sstevel@tonic-gate
149*0Sstevel@tonic-gate    \t		tab                   (HT, TAB)
150*0Sstevel@tonic-gate    \n		newline               (LF, NL)
151*0Sstevel@tonic-gate    \r		return                (CR)
152*0Sstevel@tonic-gate    \f		form feed             (FF)
153*0Sstevel@tonic-gate    \a		alarm (bell)          (BEL)
154*0Sstevel@tonic-gate    \e		escape (think troff)  (ESC)
155*0Sstevel@tonic-gate    \033	octal char (think of a PDP-11)
156*0Sstevel@tonic-gate    \x1B	hex char
157*0Sstevel@tonic-gate    \x{263a}	wide hex char         (Unicode SMILEY)
158*0Sstevel@tonic-gate    \c[		control char
159*0Sstevel@tonic-gate    \N{name}	named char
160*0Sstevel@tonic-gate    \l		lowercase next char (think vi)
161*0Sstevel@tonic-gate    \u		uppercase next char (think vi)
162*0Sstevel@tonic-gate    \L		lowercase till \E (think vi)
163*0Sstevel@tonic-gate    \U		uppercase till \E (think vi)
164*0Sstevel@tonic-gate    \E		end case modification (think vi)
165*0Sstevel@tonic-gate    \Q		quote (disable) pattern metacharacters till \E
166*0Sstevel@tonic-gate
167*0Sstevel@tonic-gateIf C<use locale> is in effect, the case map used by C<\l>, C<\L>, C<\u>
168*0Sstevel@tonic-gateand C<\U> is taken from the current locale.  See L<perllocale>.  For
169*0Sstevel@tonic-gatedocumentation of C<\N{name}>, see L<charnames>.
170*0Sstevel@tonic-gate
171*0Sstevel@tonic-gateYou cannot include a literal C<$> or C<@> within a C<\Q> sequence.
172*0Sstevel@tonic-gateAn unescaped C<$> or C<@> interpolates the corresponding variable,
173*0Sstevel@tonic-gatewhile escaping will cause the literal string C<\$> to be matched.
174*0Sstevel@tonic-gateYou'll need to write something like C<m/\Quser\E\@\Qhost/>.
175*0Sstevel@tonic-gate
176*0Sstevel@tonic-gateIn addition, Perl defines the following:
177*0Sstevel@tonic-gate
178*0Sstevel@tonic-gate    \w	Match a "word" character (alphanumeric plus "_")
179*0Sstevel@tonic-gate    \W	Match a non-"word" character
180*0Sstevel@tonic-gate    \s	Match a whitespace character
181*0Sstevel@tonic-gate    \S	Match a non-whitespace character
182*0Sstevel@tonic-gate    \d	Match a digit character
183*0Sstevel@tonic-gate    \D	Match a non-digit character
184*0Sstevel@tonic-gate    \pP	Match P, named property.  Use \p{Prop} for longer names.
185*0Sstevel@tonic-gate    \PP	Match non-P
186*0Sstevel@tonic-gate    \X	Match eXtended Unicode "combining character sequence",
187*0Sstevel@tonic-gate        equivalent to (?:\PM\pM*)
188*0Sstevel@tonic-gate    \C	Match a single C char (octet) even under Unicode.
189*0Sstevel@tonic-gate	NOTE: breaks up characters into their UTF-8 bytes,
190*0Sstevel@tonic-gate	so you may end up with malformed pieces of UTF-8.
191*0Sstevel@tonic-gate	Unsupported in lookbehind.
192*0Sstevel@tonic-gate
193*0Sstevel@tonic-gateA C<\w> matches a single alphanumeric character (an alphabetic
194*0Sstevel@tonic-gatecharacter, or a decimal digit) or C<_>, not a whole word.  Use C<\w+>
195*0Sstevel@tonic-gateto match a string of Perl-identifier characters (which isn't the same
196*0Sstevel@tonic-gateas matching an English word).  If C<use locale> is in effect, the list
197*0Sstevel@tonic-gateof alphabetic characters generated by C<\w> is taken from the current
198*0Sstevel@tonic-gatelocale.  See L<perllocale>.  You may use C<\w>, C<\W>, C<\s>, C<\S>,
199*0Sstevel@tonic-gateC<\d>, and C<\D> within character classes, but if you try to use them
200*0Sstevel@tonic-gateas endpoints of a range, that's not a range, the "-" is understood
201*0Sstevel@tonic-gateliterally.  If Unicode is in effect, C<\s> matches also "\x{85}",
202*0Sstevel@tonic-gate"\x{2028}, and "\x{2029}", see L<perlunicode> for more details about
203*0Sstevel@tonic-gateC<\pP>, C<\PP>, and C<\X>, and L<perluniintro> about Unicode in general.
204*0Sstevel@tonic-gateYou can define your own C<\p> and C<\P> propreties, see L<perlunicode>.
205*0Sstevel@tonic-gate
206*0Sstevel@tonic-gateThe POSIX character class syntax
207*0Sstevel@tonic-gate
208*0Sstevel@tonic-gate    [:class:]
209*0Sstevel@tonic-gate
210*0Sstevel@tonic-gateis also available.  The available classes and their backslash
211*0Sstevel@tonic-gateequivalents (if available) are as follows:
212*0Sstevel@tonic-gate
213*0Sstevel@tonic-gate    alpha
214*0Sstevel@tonic-gate    alnum
215*0Sstevel@tonic-gate    ascii
216*0Sstevel@tonic-gate    blank		[1]
217*0Sstevel@tonic-gate    cntrl
218*0Sstevel@tonic-gate    digit       \d
219*0Sstevel@tonic-gate    graph
220*0Sstevel@tonic-gate    lower
221*0Sstevel@tonic-gate    print
222*0Sstevel@tonic-gate    punct
223*0Sstevel@tonic-gate    space       \s	[2]
224*0Sstevel@tonic-gate    upper
225*0Sstevel@tonic-gate    word        \w	[3]
226*0Sstevel@tonic-gate    xdigit
227*0Sstevel@tonic-gate
228*0Sstevel@tonic-gate=over
229*0Sstevel@tonic-gate
230*0Sstevel@tonic-gate=item [1]
231*0Sstevel@tonic-gate
232*0Sstevel@tonic-gateA GNU extension equivalent to C<[ \t]>, `all horizontal whitespace'.
233*0Sstevel@tonic-gate
234*0Sstevel@tonic-gate=item [2]
235*0Sstevel@tonic-gate
236*0Sstevel@tonic-gateNot exactly equivalent to C<\s> since the C<[[:space:]]> includes
237*0Sstevel@tonic-gatealso the (very rare) `vertical tabulator', "\ck", chr(11).
238*0Sstevel@tonic-gate
239*0Sstevel@tonic-gate=item [3]
240*0Sstevel@tonic-gate
241*0Sstevel@tonic-gateA Perl extension, see above.
242*0Sstevel@tonic-gate
243*0Sstevel@tonic-gate=back
244*0Sstevel@tonic-gate
245*0Sstevel@tonic-gateFor example use C<[:upper:]> to match all the uppercase characters.
246*0Sstevel@tonic-gateNote that the C<[]> are part of the C<[::]> construct, not part of the
247*0Sstevel@tonic-gatewhole character class.  For example:
248*0Sstevel@tonic-gate
249*0Sstevel@tonic-gate    [01[:alpha:]%]
250*0Sstevel@tonic-gate
251*0Sstevel@tonic-gatematches zero, one, any alphabetic character, and the percentage sign.
252*0Sstevel@tonic-gate
253*0Sstevel@tonic-gateThe following equivalences to Unicode \p{} constructs and equivalent
254*0Sstevel@tonic-gatebackslash character classes (if available), will hold:
255*0Sstevel@tonic-gate
256*0Sstevel@tonic-gate    [:...:]	\p{...}		backslash
257*0Sstevel@tonic-gate
258*0Sstevel@tonic-gate    alpha       IsAlpha
259*0Sstevel@tonic-gate    alnum       IsAlnum
260*0Sstevel@tonic-gate    ascii       IsASCII
261*0Sstevel@tonic-gate    blank	IsSpace
262*0Sstevel@tonic-gate    cntrl       IsCntrl
263*0Sstevel@tonic-gate    digit       IsDigit        \d
264*0Sstevel@tonic-gate    graph       IsGraph
265*0Sstevel@tonic-gate    lower       IsLower
266*0Sstevel@tonic-gate    print       IsPrint
267*0Sstevel@tonic-gate    punct       IsPunct
268*0Sstevel@tonic-gate    space       IsSpace
269*0Sstevel@tonic-gate                IsSpacePerl    \s
270*0Sstevel@tonic-gate    upper       IsUpper
271*0Sstevel@tonic-gate    word        IsWord
272*0Sstevel@tonic-gate    xdigit      IsXDigit
273*0Sstevel@tonic-gate
274*0Sstevel@tonic-gateFor example C<[:lower:]> and C<\p{IsLower}> are equivalent.
275*0Sstevel@tonic-gate
276*0Sstevel@tonic-gateIf the C<utf8> pragma is not used but the C<locale> pragma is, the
277*0Sstevel@tonic-gateclasses correlate with the usual isalpha(3) interface (except for
278*0Sstevel@tonic-gate`word' and `blank').
279*0Sstevel@tonic-gate
280*0Sstevel@tonic-gateThe assumedly non-obviously named classes are:
281*0Sstevel@tonic-gate
282*0Sstevel@tonic-gate=over 4
283*0Sstevel@tonic-gate
284*0Sstevel@tonic-gate=item cntrl
285*0Sstevel@tonic-gate
286*0Sstevel@tonic-gateAny control character.  Usually characters that don't produce output as
287*0Sstevel@tonic-gatesuch but instead control the terminal somehow: for example newline and
288*0Sstevel@tonic-gatebackspace are control characters.  All characters with ord() less than
289*0Sstevel@tonic-gate32 are most often classified as control characters (assuming ASCII,
290*0Sstevel@tonic-gatethe ISO Latin character sets, and Unicode), as is the character with
291*0Sstevel@tonic-gatethe ord() value of 127 (C<DEL>).
292*0Sstevel@tonic-gate
293*0Sstevel@tonic-gate=item graph
294*0Sstevel@tonic-gate
295*0Sstevel@tonic-gateAny alphanumeric or punctuation (special) character.
296*0Sstevel@tonic-gate
297*0Sstevel@tonic-gate=item print
298*0Sstevel@tonic-gate
299*0Sstevel@tonic-gateAny alphanumeric or punctuation (special) character or the space character.
300*0Sstevel@tonic-gate
301*0Sstevel@tonic-gate=item punct
302*0Sstevel@tonic-gate
303*0Sstevel@tonic-gateAny punctuation (special) character.
304*0Sstevel@tonic-gate
305*0Sstevel@tonic-gate=item xdigit
306*0Sstevel@tonic-gate
307*0Sstevel@tonic-gateAny hexadecimal digit.  Though this may feel silly ([0-9A-Fa-f] would
308*0Sstevel@tonic-gatework just fine) it is included for completeness.
309*0Sstevel@tonic-gate
310*0Sstevel@tonic-gate=back
311*0Sstevel@tonic-gate
312*0Sstevel@tonic-gateYou can negate the [::] character classes by prefixing the class name
313*0Sstevel@tonic-gatewith a '^'. This is a Perl extension.  For example:
314*0Sstevel@tonic-gate
315*0Sstevel@tonic-gate    POSIX	traditional Unicode
316*0Sstevel@tonic-gate
317*0Sstevel@tonic-gate    [:^digit:]      \D      \P{IsDigit}
318*0Sstevel@tonic-gate    [:^space:]	    \S	    \P{IsSpace}
319*0Sstevel@tonic-gate    [:^word:]	    \W	    \P{IsWord}
320*0Sstevel@tonic-gate
321*0Sstevel@tonic-gatePerl respects the POSIX standard in that POSIX character classes are
322*0Sstevel@tonic-gateonly supported within a character class.  The POSIX character classes
323*0Sstevel@tonic-gate[.cc.] and [=cc=] are recognized but B<not> supported and trying to
324*0Sstevel@tonic-gateuse them will cause an error.
325*0Sstevel@tonic-gate
326*0Sstevel@tonic-gatePerl defines the following zero-width assertions:
327*0Sstevel@tonic-gate
328*0Sstevel@tonic-gate    \b	Match a word boundary
329*0Sstevel@tonic-gate    \B	Match a non-(word boundary)
330*0Sstevel@tonic-gate    \A	Match only at beginning of string
331*0Sstevel@tonic-gate    \Z	Match only at end of string, or before newline at the end
332*0Sstevel@tonic-gate    \z	Match only at end of string
333*0Sstevel@tonic-gate    \G	Match only at pos() (e.g. at the end-of-match position
334*0Sstevel@tonic-gate        of prior m//g)
335*0Sstevel@tonic-gate
336*0Sstevel@tonic-gateA word boundary (C<\b>) is a spot between two characters
337*0Sstevel@tonic-gatethat has a C<\w> on one side of it and a C<\W> on the other side
338*0Sstevel@tonic-gateof it (in either order), counting the imaginary characters off the
339*0Sstevel@tonic-gatebeginning and end of the string as matching a C<\W>.  (Within
340*0Sstevel@tonic-gatecharacter classes C<\b> represents backspace rather than a word
341*0Sstevel@tonic-gateboundary, just as it normally does in any double-quoted string.)
342*0Sstevel@tonic-gateThe C<\A> and C<\Z> are just like "^" and "$", except that they
343*0Sstevel@tonic-gatewon't match multiple times when the C</m> modifier is used, while
344*0Sstevel@tonic-gate"^" and "$" will match at every internal line boundary.  To match
345*0Sstevel@tonic-gatethe actual end of the string and not ignore an optional trailing
346*0Sstevel@tonic-gatenewline, use C<\z>.
347*0Sstevel@tonic-gate
348*0Sstevel@tonic-gateThe C<\G> assertion can be used to chain global matches (using
349*0Sstevel@tonic-gateC<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">.
350*0Sstevel@tonic-gateIt is also useful when writing C<lex>-like scanners, when you have
351*0Sstevel@tonic-gateseveral patterns that you want to match against consequent substrings
352*0Sstevel@tonic-gateof your string, see the previous reference.  The actual location
353*0Sstevel@tonic-gatewhere C<\G> will match can also be influenced by using C<pos()> as
354*0Sstevel@tonic-gatean lvalue: see L<perlfunc/pos>. Currently C<\G> is only fully
355*0Sstevel@tonic-gatesupported when anchored to the start of the pattern; while it
356*0Sstevel@tonic-gateis permitted to use it elsewhere, as in C</(?<=\G..)./g>, some
357*0Sstevel@tonic-gatesuch uses (C</.\G/g>, for example) currently cause problems, and
358*0Sstevel@tonic-gateit is recommended that you avoid such usage for now.
359*0Sstevel@tonic-gate
360*0Sstevel@tonic-gateThe bracketing construct C<( ... )> creates capture buffers.  To
361*0Sstevel@tonic-gaterefer to the digit'th buffer use \<digit> within the
362*0Sstevel@tonic-gatematch.  Outside the match use "$" instead of "\".  (The
363*0Sstevel@tonic-gate\<digit> notation works in certain circumstances outside
364*0Sstevel@tonic-gatethe match.  See the warning below about \1 vs $1 for details.)
365*0Sstevel@tonic-gateReferring back to another part of the match is called a
366*0Sstevel@tonic-gateI<backreference>.
367*0Sstevel@tonic-gate
368*0Sstevel@tonic-gateThere is no limit to the number of captured substrings that you may
369*0Sstevel@tonic-gateuse.  However Perl also uses \10, \11, etc. as aliases for \010,
370*0Sstevel@tonic-gate\011, etc.  (Recall that 0 means octal, so \011 is the character at
371*0Sstevel@tonic-gatenumber 9 in your coded character set; which would be the 10th character,
372*0Sstevel@tonic-gatea horizontal tab under ASCII.)  Perl resolves this
373*0Sstevel@tonic-gateambiguity by interpreting \10 as a backreference only if at least 10
374*0Sstevel@tonic-gateleft parentheses have opened before it.  Likewise \11 is a
375*0Sstevel@tonic-gatebackreference only if at least 11 left parentheses have opened
376*0Sstevel@tonic-gatebefore it.  And so on.  \1 through \9 are always interpreted as
377*0Sstevel@tonic-gatebackreferences.
378*0Sstevel@tonic-gate
379*0Sstevel@tonic-gateExamples:
380*0Sstevel@tonic-gate
381*0Sstevel@tonic-gate    s/^([^ ]*) *([^ ]*)/$2 $1/;     # swap first two words
382*0Sstevel@tonic-gate
383*0Sstevel@tonic-gate     if (/(.)\1/) {                 # find first doubled char
384*0Sstevel@tonic-gate         print "'$1' is the first doubled character\n";
385*0Sstevel@tonic-gate     }
386*0Sstevel@tonic-gate
387*0Sstevel@tonic-gate    if (/Time: (..):(..):(..)/) {   # parse out values
388*0Sstevel@tonic-gate	$hours = $1;
389*0Sstevel@tonic-gate	$minutes = $2;
390*0Sstevel@tonic-gate	$seconds = $3;
391*0Sstevel@tonic-gate    }
392*0Sstevel@tonic-gate
393*0Sstevel@tonic-gateSeveral special variables also refer back to portions of the previous
394*0Sstevel@tonic-gatematch.  C<$+> returns whatever the last bracket match matched.
395*0Sstevel@tonic-gateC<$&> returns the entire matched string.  (At one point C<$0> did
396*0Sstevel@tonic-gatealso, but now it returns the name of the program.)  C<$`> returns
397*0Sstevel@tonic-gateeverything before the matched string.  C<$'> returns everything
398*0Sstevel@tonic-gateafter the matched string. And C<$^N> contains whatever was matched by
399*0Sstevel@tonic-gatethe most-recently closed group (submatch). C<$^N> can be used in
400*0Sstevel@tonic-gateextended patterns (see below), for example to assign a submatch to a
401*0Sstevel@tonic-gatevariable.
402*0Sstevel@tonic-gate
403*0Sstevel@tonic-gateThe numbered match variables ($1, $2, $3, etc.) and the related punctuation
404*0Sstevel@tonic-gateset (C<$+>, C<$&>, C<$`>, C<$'>, and C<$^N>) are all dynamically scoped
405*0Sstevel@tonic-gateuntil the end of the enclosing block or until the next successful
406*0Sstevel@tonic-gatematch, whichever comes first.  (See L<perlsyn/"Compound Statements">.)
407*0Sstevel@tonic-gate
408*0Sstevel@tonic-gateB<NOTE>: failed matches in Perl do not reset the match variables,
409*0Sstevel@tonic-gatewhich makes easier to write code that tests for a series of more
410*0Sstevel@tonic-gatespecific cases and remembers the best match.
411*0Sstevel@tonic-gate
412*0Sstevel@tonic-gateB<WARNING>: Once Perl sees that you need one of C<$&>, C<$`>, or
413*0Sstevel@tonic-gateC<$'> anywhere in the program, it has to provide them for every
414*0Sstevel@tonic-gatepattern match.  This may substantially slow your program.  Perl
415*0Sstevel@tonic-gateuses the same mechanism to produce $1, $2, etc, so you also pay a
416*0Sstevel@tonic-gateprice for each pattern that contains capturing parentheses.  (To
417*0Sstevel@tonic-gateavoid this cost while retaining the grouping behaviour, use the
418*0Sstevel@tonic-gateextended regular expression C<(?: ... )> instead.)  But if you never
419*0Sstevel@tonic-gateuse C<$&>, C<$`> or C<$'>, then patterns I<without> capturing
420*0Sstevel@tonic-gateparentheses will not be penalized.  So avoid C<$&>, C<$'>, and C<$`>
421*0Sstevel@tonic-gateif you can, but if you can't (and some algorithms really appreciate
422*0Sstevel@tonic-gatethem), once you've used them once, use them at will, because you've
423*0Sstevel@tonic-gatealready paid the price.  As of 5.005, C<$&> is not so costly as the
424*0Sstevel@tonic-gateother two.
425*0Sstevel@tonic-gate
426*0Sstevel@tonic-gateBackslashed metacharacters in Perl are alphanumeric, such as C<\b>,
427*0Sstevel@tonic-gateC<\w>, C<\n>.  Unlike some other regular expression languages, there
428*0Sstevel@tonic-gateare no backslashed symbols that aren't alphanumeric.  So anything
429*0Sstevel@tonic-gatethat looks like \\, \(, \), \<, \>, \{, or \} is always
430*0Sstevel@tonic-gateinterpreted as a literal character, not a metacharacter.  This was
431*0Sstevel@tonic-gateonce used in a common idiom to disable or quote the special meanings
432*0Sstevel@tonic-gateof regular expression metacharacters in a string that you want to
433*0Sstevel@tonic-gateuse for a pattern. Simply quote all non-"word" characters:
434*0Sstevel@tonic-gate
435*0Sstevel@tonic-gate    $pattern =~ s/(\W)/\\$1/g;
436*0Sstevel@tonic-gate
437*0Sstevel@tonic-gate(If C<use locale> is set, then this depends on the current locale.)
438*0Sstevel@tonic-gateToday it is more common to use the quotemeta() function or the C<\Q>
439*0Sstevel@tonic-gatemetaquoting escape sequence to disable all metacharacters' special
440*0Sstevel@tonic-gatemeanings like this:
441*0Sstevel@tonic-gate
442*0Sstevel@tonic-gate    /$unquoted\Q$quoted\E$unquoted/
443*0Sstevel@tonic-gate
444*0Sstevel@tonic-gateBeware that if you put literal backslashes (those not inside
445*0Sstevel@tonic-gateinterpolated variables) between C<\Q> and C<\E>, double-quotish
446*0Sstevel@tonic-gatebackslash interpolation may lead to confusing results.  If you
447*0Sstevel@tonic-gateI<need> to use literal backslashes within C<\Q...\E>,
448*0Sstevel@tonic-gateconsult L<perlop/"Gory details of parsing quoted constructs">.
449*0Sstevel@tonic-gate
450*0Sstevel@tonic-gate=head2 Extended Patterns
451*0Sstevel@tonic-gate
452*0Sstevel@tonic-gatePerl also defines a consistent extension syntax for features not
453*0Sstevel@tonic-gatefound in standard tools like B<awk> and B<lex>.  The syntax is a
454*0Sstevel@tonic-gatepair of parentheses with a question mark as the first thing within
455*0Sstevel@tonic-gatethe parentheses.  The character after the question mark indicates
456*0Sstevel@tonic-gatethe extension.
457*0Sstevel@tonic-gate
458*0Sstevel@tonic-gateThe stability of these extensions varies widely.  Some have been
459*0Sstevel@tonic-gatepart of the core language for many years.  Others are experimental
460*0Sstevel@tonic-gateand may change without warning or be completely removed.  Check
461*0Sstevel@tonic-gatethe documentation on an individual feature to verify its current
462*0Sstevel@tonic-gatestatus.
463*0Sstevel@tonic-gate
464*0Sstevel@tonic-gateA question mark was chosen for this and for the minimal-matching
465*0Sstevel@tonic-gateconstruct because 1) question marks are rare in older regular
466*0Sstevel@tonic-gateexpressions, and 2) whenever you see one, you should stop and
467*0Sstevel@tonic-gate"question" exactly what is going on.  That's psychology...
468*0Sstevel@tonic-gate
469*0Sstevel@tonic-gate=over 10
470*0Sstevel@tonic-gate
471*0Sstevel@tonic-gate=item C<(?#text)>
472*0Sstevel@tonic-gate
473*0Sstevel@tonic-gateA comment.  The text is ignored.  If the C</x> modifier enables
474*0Sstevel@tonic-gatewhitespace formatting, a simple C<#> will suffice.  Note that Perl closes
475*0Sstevel@tonic-gatethe comment as soon as it sees a C<)>, so there is no way to put a literal
476*0Sstevel@tonic-gateC<)> in the comment.
477*0Sstevel@tonic-gate
478*0Sstevel@tonic-gate=item C<(?imsx-imsx)>
479*0Sstevel@tonic-gate
480*0Sstevel@tonic-gateOne or more embedded pattern-match modifiers, to be turned on (or
481*0Sstevel@tonic-gateturned off, if preceded by C<->) for the remainder of the pattern or
482*0Sstevel@tonic-gatethe remainder of the enclosing pattern group (if any). This is
483*0Sstevel@tonic-gateparticularly useful for dynamic patterns, such as those read in from a
484*0Sstevel@tonic-gateconfiguration file, read in as an argument, are specified in a table
485*0Sstevel@tonic-gatesomewhere, etc.  Consider the case that some of which want to be case
486*0Sstevel@tonic-gatesensitive and some do not.  The case insensitive ones need to include
487*0Sstevel@tonic-gatemerely C<(?i)> at the front of the pattern.  For example:
488*0Sstevel@tonic-gate
489*0Sstevel@tonic-gate    $pattern = "foobar";
490*0Sstevel@tonic-gate    if ( /$pattern/i ) { }
491*0Sstevel@tonic-gate
492*0Sstevel@tonic-gate    # more flexible:
493*0Sstevel@tonic-gate
494*0Sstevel@tonic-gate    $pattern = "(?i)foobar";
495*0Sstevel@tonic-gate    if ( /$pattern/ ) { }
496*0Sstevel@tonic-gate
497*0Sstevel@tonic-gateThese modifiers are restored at the end of the enclosing group. For example,
498*0Sstevel@tonic-gate
499*0Sstevel@tonic-gate    ( (?i) blah ) \s+ \1
500*0Sstevel@tonic-gate
501*0Sstevel@tonic-gatewill match a repeated (I<including the case>!) word C<blah> in any
502*0Sstevel@tonic-gatecase, assuming C<x> modifier, and no C<i> modifier outside this
503*0Sstevel@tonic-gategroup.
504*0Sstevel@tonic-gate
505*0Sstevel@tonic-gate=item C<(?:pattern)>
506*0Sstevel@tonic-gate
507*0Sstevel@tonic-gate=item C<(?imsx-imsx:pattern)>
508*0Sstevel@tonic-gate
509*0Sstevel@tonic-gateThis is for clustering, not capturing; it groups subexpressions like
510*0Sstevel@tonic-gate"()", but doesn't make backreferences as "()" does.  So
511*0Sstevel@tonic-gate
512*0Sstevel@tonic-gate    @fields = split(/\b(?:a|b|c)\b/)
513*0Sstevel@tonic-gate
514*0Sstevel@tonic-gateis like
515*0Sstevel@tonic-gate
516*0Sstevel@tonic-gate    @fields = split(/\b(a|b|c)\b/)
517*0Sstevel@tonic-gate
518*0Sstevel@tonic-gatebut doesn't spit out extra fields.  It's also cheaper not to capture
519*0Sstevel@tonic-gatecharacters if you don't need to.
520*0Sstevel@tonic-gate
521*0Sstevel@tonic-gateAny letters between C<?> and C<:> act as flags modifiers as with
522*0Sstevel@tonic-gateC<(?imsx-imsx)>.  For example,
523*0Sstevel@tonic-gate
524*0Sstevel@tonic-gate    /(?s-i:more.*than).*million/i
525*0Sstevel@tonic-gate
526*0Sstevel@tonic-gateis equivalent to the more verbose
527*0Sstevel@tonic-gate
528*0Sstevel@tonic-gate    /(?:(?s-i)more.*than).*million/i
529*0Sstevel@tonic-gate
530*0Sstevel@tonic-gate=item C<(?=pattern)>
531*0Sstevel@tonic-gate
532*0Sstevel@tonic-gateA zero-width positive look-ahead assertion.  For example, C</\w+(?=\t)/>
533*0Sstevel@tonic-gatematches a word followed by a tab, without including the tab in C<$&>.
534*0Sstevel@tonic-gate
535*0Sstevel@tonic-gate=item C<(?!pattern)>
536*0Sstevel@tonic-gate
537*0Sstevel@tonic-gateA zero-width negative look-ahead assertion.  For example C</foo(?!bar)/>
538*0Sstevel@tonic-gatematches any occurrence of "foo" that isn't followed by "bar".  Note
539*0Sstevel@tonic-gatehowever that look-ahead and look-behind are NOT the same thing.  You cannot
540*0Sstevel@tonic-gateuse this for look-behind.
541*0Sstevel@tonic-gate
542*0Sstevel@tonic-gateIf you are looking for a "bar" that isn't preceded by a "foo", C</(?!foo)bar/>
543*0Sstevel@tonic-gatewill not do what you want.  That's because the C<(?!foo)> is just saying that
544*0Sstevel@tonic-gatethe next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will
545*0Sstevel@tonic-gatematch.  You would have to do something like C</(?!foo)...bar/> for that.   We
546*0Sstevel@tonic-gatesay "like" because there's the case of your "bar" not having three characters
547*0Sstevel@tonic-gatebefore it.  You could cover that this way: C</(?:(?!foo)...|^.{0,2})bar/>.
548*0Sstevel@tonic-gateSometimes it's still easier just to say:
549*0Sstevel@tonic-gate
550*0Sstevel@tonic-gate    if (/bar/ && $` !~ /foo$/)
551*0Sstevel@tonic-gate
552*0Sstevel@tonic-gateFor look-behind see below.
553*0Sstevel@tonic-gate
554*0Sstevel@tonic-gate=item C<(?<=pattern)>
555*0Sstevel@tonic-gate
556*0Sstevel@tonic-gateA zero-width positive look-behind assertion.  For example, C</(?<=\t)\w+/>
557*0Sstevel@tonic-gatematches a word that follows a tab, without including the tab in C<$&>.
558*0Sstevel@tonic-gateWorks only for fixed-width look-behind.
559*0Sstevel@tonic-gate
560*0Sstevel@tonic-gate=item C<(?<!pattern)>
561*0Sstevel@tonic-gate
562*0Sstevel@tonic-gateA zero-width negative look-behind assertion.  For example C</(?<!bar)foo/>
563*0Sstevel@tonic-gatematches any occurrence of "foo" that does not follow "bar".  Works
564*0Sstevel@tonic-gateonly for fixed-width look-behind.
565*0Sstevel@tonic-gate
566*0Sstevel@tonic-gate=item C<(?{ code })>
567*0Sstevel@tonic-gate
568*0Sstevel@tonic-gateB<WARNING>: This extended regular expression feature is considered
569*0Sstevel@tonic-gatehighly experimental, and may be changed or deleted without notice.
570*0Sstevel@tonic-gate
571*0Sstevel@tonic-gateThis zero-width assertion evaluates any embedded Perl code.  It
572*0Sstevel@tonic-gatealways succeeds, and its C<code> is not interpolated.  Currently,
573*0Sstevel@tonic-gatethe rules to determine where the C<code> ends are somewhat convoluted.
574*0Sstevel@tonic-gate
575*0Sstevel@tonic-gateThis feature can be used together with the special variable C<$^N> to
576*0Sstevel@tonic-gatecapture the results of submatches in variables without having to keep
577*0Sstevel@tonic-gatetrack of the number of nested parentheses. For example:
578*0Sstevel@tonic-gate
579*0Sstevel@tonic-gate  $_ = "The brown fox jumps over the lazy dog";
580*0Sstevel@tonic-gate  /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
581*0Sstevel@tonic-gate  print "color = $color, animal = $animal\n";
582*0Sstevel@tonic-gate
583*0Sstevel@tonic-gateInside the C<(?{...})> block, C<$_> refers to the string the regular
584*0Sstevel@tonic-gateexpression is matching against. You can also use C<pos()> to know what is
585*0Sstevel@tonic-gatethe current position of matching withing this string.
586*0Sstevel@tonic-gate
587*0Sstevel@tonic-gateThe C<code> is properly scoped in the following sense: If the assertion
588*0Sstevel@tonic-gateis backtracked (compare L<"Backtracking">), all changes introduced after
589*0Sstevel@tonic-gateC<local>ization are undone, so that
590*0Sstevel@tonic-gate
591*0Sstevel@tonic-gate  $_ = 'a' x 8;
592*0Sstevel@tonic-gate  m<
593*0Sstevel@tonic-gate     (?{ $cnt = 0 })			# Initialize $cnt.
594*0Sstevel@tonic-gate     (
595*0Sstevel@tonic-gate       a
596*0Sstevel@tonic-gate       (?{
597*0Sstevel@tonic-gate           local $cnt = $cnt + 1;	# Update $cnt, backtracking-safe.
598*0Sstevel@tonic-gate       })
599*0Sstevel@tonic-gate     )*
600*0Sstevel@tonic-gate     aaaa
601*0Sstevel@tonic-gate     (?{ $res = $cnt })			# On success copy to non-localized
602*0Sstevel@tonic-gate					# location.
603*0Sstevel@tonic-gate   >x;
604*0Sstevel@tonic-gate
605*0Sstevel@tonic-gatewill set C<$res = 4>.  Note that after the match, $cnt returns to the globally
606*0Sstevel@tonic-gateintroduced value, because the scopes that restrict C<local> operators
607*0Sstevel@tonic-gateare unwound.
608*0Sstevel@tonic-gate
609*0Sstevel@tonic-gateThis assertion may be used as a C<(?(condition)yes-pattern|no-pattern)>
610*0Sstevel@tonic-gateswitch.  If I<not> used in this way, the result of evaluation of
611*0Sstevel@tonic-gateC<code> is put into the special variable C<$^R>.  This happens
612*0Sstevel@tonic-gateimmediately, so C<$^R> can be used from other C<(?{ code })> assertions
613*0Sstevel@tonic-gateinside the same regular expression.
614*0Sstevel@tonic-gate
615*0Sstevel@tonic-gateThe assignment to C<$^R> above is properly localized, so the old
616*0Sstevel@tonic-gatevalue of C<$^R> is restored if the assertion is backtracked; compare
617*0Sstevel@tonic-gateL<"Backtracking">.
618*0Sstevel@tonic-gate
619*0Sstevel@tonic-gateFor reasons of security, this construct is forbidden if the regular
620*0Sstevel@tonic-gateexpression involves run-time interpolation of variables, unless the
621*0Sstevel@tonic-gateperilous C<use re 'eval'> pragma has been used (see L<re>), or the
622*0Sstevel@tonic-gatevariables contain results of C<qr//> operator (see
623*0Sstevel@tonic-gateL<perlop/"qr/STRING/imosx">).
624*0Sstevel@tonic-gate
625*0Sstevel@tonic-gateThis restriction is because of the wide-spread and remarkably convenient
626*0Sstevel@tonic-gatecustom of using run-time determined strings as patterns.  For example:
627*0Sstevel@tonic-gate
628*0Sstevel@tonic-gate    $re = <>;
629*0Sstevel@tonic-gate    chomp $re;
630*0Sstevel@tonic-gate    $string =~ /$re/;
631*0Sstevel@tonic-gate
632*0Sstevel@tonic-gateBefore Perl knew how to execute interpolated code within a pattern,
633*0Sstevel@tonic-gatethis operation was completely safe from a security point of view,
634*0Sstevel@tonic-gatealthough it could raise an exception from an illegal pattern.  If
635*0Sstevel@tonic-gateyou turn on the C<use re 'eval'>, though, it is no longer secure,
636*0Sstevel@tonic-gateso you should only do so if you are also using taint checking.
637*0Sstevel@tonic-gateBetter yet, use the carefully constrained evaluation within a Safe
638*0Sstevel@tonic-gatecompartment.  See L<perlsec> for details about both these mechanisms.
639*0Sstevel@tonic-gate
640*0Sstevel@tonic-gate=item C<(??{ code })>
641*0Sstevel@tonic-gate
642*0Sstevel@tonic-gateB<WARNING>: This extended regular expression feature is considered
643*0Sstevel@tonic-gatehighly experimental, and may be changed or deleted without notice.
644*0Sstevel@tonic-gateA simplified version of the syntax may be introduced for commonly
645*0Sstevel@tonic-gateused idioms.
646*0Sstevel@tonic-gate
647*0Sstevel@tonic-gateThis is a "postponed" regular subexpression.  The C<code> is evaluated
648*0Sstevel@tonic-gateat run time, at the moment this subexpression may match.  The result
649*0Sstevel@tonic-gateof evaluation is considered as a regular expression and matched as
650*0Sstevel@tonic-gateif it were inserted instead of this construct.
651*0Sstevel@tonic-gate
652*0Sstevel@tonic-gateThe C<code> is not interpolated.  As before, the rules to determine
653*0Sstevel@tonic-gatewhere the C<code> ends are currently somewhat convoluted.
654*0Sstevel@tonic-gate
655*0Sstevel@tonic-gateThe following pattern matches a parenthesized group:
656*0Sstevel@tonic-gate
657*0Sstevel@tonic-gate  $re = qr{
658*0Sstevel@tonic-gate	     \(
659*0Sstevel@tonic-gate	     (?:
660*0Sstevel@tonic-gate		(?> [^()]+ )	# Non-parens without backtracking
661*0Sstevel@tonic-gate	      |
662*0Sstevel@tonic-gate		(??{ $re })	# Group with matching parens
663*0Sstevel@tonic-gate	     )*
664*0Sstevel@tonic-gate	     \)
665*0Sstevel@tonic-gate	  }x;
666*0Sstevel@tonic-gate
667*0Sstevel@tonic-gate=item C<< (?>pattern) >>
668*0Sstevel@tonic-gate
669*0Sstevel@tonic-gateB<WARNING>: This extended regular expression feature is considered
670*0Sstevel@tonic-gatehighly experimental, and may be changed or deleted without notice.
671*0Sstevel@tonic-gate
672*0Sstevel@tonic-gateAn "independent" subexpression, one which matches the substring
673*0Sstevel@tonic-gatethat a I<standalone> C<pattern> would match if anchored at the given
674*0Sstevel@tonic-gateposition, and it matches I<nothing other than this substring>.  This
675*0Sstevel@tonic-gateconstruct is useful for optimizations of what would otherwise be
676*0Sstevel@tonic-gate"eternal" matches, because it will not backtrack (see L<"Backtracking">).
677*0Sstevel@tonic-gateIt may also be useful in places where the "grab all you can, and do not
678*0Sstevel@tonic-gategive anything back" semantic is desirable.
679*0Sstevel@tonic-gate
680*0Sstevel@tonic-gateFor example: C<< ^(?>a*)ab >> will never match, since C<< (?>a*) >>
681*0Sstevel@tonic-gate(anchored at the beginning of string, as above) will match I<all>
682*0Sstevel@tonic-gatecharacters C<a> at the beginning of string, leaving no C<a> for
683*0Sstevel@tonic-gateC<ab> to match.  In contrast, C<a*ab> will match the same as C<a+b>,
684*0Sstevel@tonic-gatesince the match of the subgroup C<a*> is influenced by the following
685*0Sstevel@tonic-gategroup C<ab> (see L<"Backtracking">).  In particular, C<a*> inside
686*0Sstevel@tonic-gateC<a*ab> will match fewer characters than a standalone C<a*>, since
687*0Sstevel@tonic-gatethis makes the tail match.
688*0Sstevel@tonic-gate
689*0Sstevel@tonic-gateAn effect similar to C<< (?>pattern) >> may be achieved by writing
690*0Sstevel@tonic-gateC<(?=(pattern))\1>.  This matches the same substring as a standalone
691*0Sstevel@tonic-gateC<a+>, and the following C<\1> eats the matched string; it therefore
692*0Sstevel@tonic-gatemakes a zero-length assertion into an analogue of C<< (?>...) >>.
693*0Sstevel@tonic-gate(The difference between these two constructs is that the second one
694*0Sstevel@tonic-gateuses a capturing group, thus shifting ordinals of backreferences
695*0Sstevel@tonic-gatein the rest of a regular expression.)
696*0Sstevel@tonic-gate
697*0Sstevel@tonic-gateConsider this pattern:
698*0Sstevel@tonic-gate
699*0Sstevel@tonic-gate    m{ \(
700*0Sstevel@tonic-gate	  (
701*0Sstevel@tonic-gate	    [^()]+		# x+
702*0Sstevel@tonic-gate          |
703*0Sstevel@tonic-gate            \( [^()]* \)
704*0Sstevel@tonic-gate          )+
705*0Sstevel@tonic-gate       \)
706*0Sstevel@tonic-gate     }x
707*0Sstevel@tonic-gate
708*0Sstevel@tonic-gateThat will efficiently match a nonempty group with matching parentheses
709*0Sstevel@tonic-gatetwo levels deep or less.  However, if there is no such group, it
710*0Sstevel@tonic-gatewill take virtually forever on a long string.  That's because there
711*0Sstevel@tonic-gateare so many different ways to split a long string into several
712*0Sstevel@tonic-gatesubstrings.  This is what C<(.+)+> is doing, and C<(.+)+> is similar
713*0Sstevel@tonic-gateto a subpattern of the above pattern.  Consider how the pattern
714*0Sstevel@tonic-gateabove detects no-match on C<((()aaaaaaaaaaaaaaaaaa> in several
715*0Sstevel@tonic-gateseconds, but that each extra letter doubles this time.  This
716*0Sstevel@tonic-gateexponential performance will make it appear that your program has
717*0Sstevel@tonic-gatehung.  However, a tiny change to this pattern
718*0Sstevel@tonic-gate
719*0Sstevel@tonic-gate    m{ \(
720*0Sstevel@tonic-gate	  (
721*0Sstevel@tonic-gate	    (?> [^()]+ )	# change x+ above to (?> x+ )
722*0Sstevel@tonic-gate          |
723*0Sstevel@tonic-gate            \( [^()]* \)
724*0Sstevel@tonic-gate          )+
725*0Sstevel@tonic-gate       \)
726*0Sstevel@tonic-gate     }x
727*0Sstevel@tonic-gate
728*0Sstevel@tonic-gatewhich uses C<< (?>...) >> matches exactly when the one above does (verifying
729*0Sstevel@tonic-gatethis yourself would be a productive exercise), but finishes in a fourth
730*0Sstevel@tonic-gatethe time when used on a similar string with 1000000 C<a>s.  Be aware,
731*0Sstevel@tonic-gatehowever, that this pattern currently triggers a warning message under
732*0Sstevel@tonic-gatethe C<use warnings> pragma or B<-w> switch saying it
733*0Sstevel@tonic-gateC<"matches null string many times in regex">.
734*0Sstevel@tonic-gate
735*0Sstevel@tonic-gateOn simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable
736*0Sstevel@tonic-gateeffect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>.
737*0Sstevel@tonic-gateThis was only 4 times slower on a string with 1000000 C<a>s.
738*0Sstevel@tonic-gate
739*0Sstevel@tonic-gateThe "grab all you can, and do not give anything back" semantic is desirable
740*0Sstevel@tonic-gatein many situations where on the first sight a simple C<()*> looks like
741*0Sstevel@tonic-gatethe correct solution.  Suppose we parse text with comments being delimited
742*0Sstevel@tonic-gateby C<#> followed by some optional (horizontal) whitespace.  Contrary to
743*0Sstevel@tonic-gateits appearance, C<#[ \t]*> I<is not> the correct subexpression to match
744*0Sstevel@tonic-gatethe comment delimiter, because it may "give up" some whitespace if
745*0Sstevel@tonic-gatethe remainder of the pattern can be made to match that way.  The correct
746*0Sstevel@tonic-gateanswer is either one of these:
747*0Sstevel@tonic-gate
748*0Sstevel@tonic-gate    (?>#[ \t]*)
749*0Sstevel@tonic-gate    #[ \t]*(?![ \t])
750*0Sstevel@tonic-gate
751*0Sstevel@tonic-gateFor example, to grab non-empty comments into $1, one should use either
752*0Sstevel@tonic-gateone of these:
753*0Sstevel@tonic-gate
754*0Sstevel@tonic-gate    / (?> \# [ \t]* ) (        .+ ) /x;
755*0Sstevel@tonic-gate    /     \# [ \t]*   ( [^ \t] .* ) /x;
756*0Sstevel@tonic-gate
757*0Sstevel@tonic-gateWhich one you pick depends on which of these expressions better reflects
758*0Sstevel@tonic-gatethe above specification of comments.
759*0Sstevel@tonic-gate
760*0Sstevel@tonic-gate=item C<(?(condition)yes-pattern|no-pattern)>
761*0Sstevel@tonic-gate
762*0Sstevel@tonic-gate=item C<(?(condition)yes-pattern)>
763*0Sstevel@tonic-gate
764*0Sstevel@tonic-gateB<WARNING>: This extended regular expression feature is considered
765*0Sstevel@tonic-gatehighly experimental, and may be changed or deleted without notice.
766*0Sstevel@tonic-gate
767*0Sstevel@tonic-gateConditional expression.  C<(condition)> should be either an integer in
768*0Sstevel@tonic-gateparentheses (which is valid if the corresponding pair of parentheses
769*0Sstevel@tonic-gatematched), or look-ahead/look-behind/evaluate zero-width assertion.
770*0Sstevel@tonic-gate
771*0Sstevel@tonic-gateFor example:
772*0Sstevel@tonic-gate
773*0Sstevel@tonic-gate    m{ ( \( )?
774*0Sstevel@tonic-gate       [^()]+
775*0Sstevel@tonic-gate       (?(1) \) )
776*0Sstevel@tonic-gate     }x
777*0Sstevel@tonic-gate
778*0Sstevel@tonic-gatematches a chunk of non-parentheses, possibly included in parentheses
779*0Sstevel@tonic-gatethemselves.
780*0Sstevel@tonic-gate
781*0Sstevel@tonic-gate=back
782*0Sstevel@tonic-gate
783*0Sstevel@tonic-gate=head2 Backtracking
784*0Sstevel@tonic-gate
785*0Sstevel@tonic-gateNOTE: This section presents an abstract approximation of regular
786*0Sstevel@tonic-gateexpression behavior.  For a more rigorous (and complicated) view of
787*0Sstevel@tonic-gatethe rules involved in selecting a match among possible alternatives,
788*0Sstevel@tonic-gatesee L<Combining pieces together>.
789*0Sstevel@tonic-gate
790*0Sstevel@tonic-gateA fundamental feature of regular expression matching involves the
791*0Sstevel@tonic-gatenotion called I<backtracking>, which is currently used (when needed)
792*0Sstevel@tonic-gateby all regular expression quantifiers, namely C<*>, C<*?>, C<+>,
793*0Sstevel@tonic-gateC<+?>, C<{n,m}>, and C<{n,m}?>.  Backtracking is often optimized
794*0Sstevel@tonic-gateinternally, but the general principle outlined here is valid.
795*0Sstevel@tonic-gate
796*0Sstevel@tonic-gateFor a regular expression to match, the I<entire> regular expression must
797*0Sstevel@tonic-gatematch, not just part of it.  So if the beginning of a pattern containing a
798*0Sstevel@tonic-gatequantifier succeeds in a way that causes later parts in the pattern to
799*0Sstevel@tonic-gatefail, the matching engine backs up and recalculates the beginning
800*0Sstevel@tonic-gatepart--that's why it's called backtracking.
801*0Sstevel@tonic-gate
802*0Sstevel@tonic-gateHere is an example of backtracking:  Let's say you want to find the
803*0Sstevel@tonic-gateword following "foo" in the string "Food is on the foo table.":
804*0Sstevel@tonic-gate
805*0Sstevel@tonic-gate    $_ = "Food is on the foo table.";
806*0Sstevel@tonic-gate    if ( /\b(foo)\s+(\w+)/i ) {
807*0Sstevel@tonic-gate	print "$2 follows $1.\n";
808*0Sstevel@tonic-gate    }
809*0Sstevel@tonic-gate
810*0Sstevel@tonic-gateWhen the match runs, the first part of the regular expression (C<\b(foo)>)
811*0Sstevel@tonic-gatefinds a possible match right at the beginning of the string, and loads up
812*0Sstevel@tonic-gate$1 with "Foo".  However, as soon as the matching engine sees that there's
813*0Sstevel@tonic-gateno whitespace following the "Foo" that it had saved in $1, it realizes its
814*0Sstevel@tonic-gatemistake and starts over again one character after where it had the
815*0Sstevel@tonic-gatetentative match.  This time it goes all the way until the next occurrence
816*0Sstevel@tonic-gateof "foo". The complete regular expression matches this time, and you get
817*0Sstevel@tonic-gatethe expected output of "table follows foo."
818*0Sstevel@tonic-gate
819*0Sstevel@tonic-gateSometimes minimal matching can help a lot.  Imagine you'd like to match
820*0Sstevel@tonic-gateeverything between "foo" and "bar".  Initially, you write something
821*0Sstevel@tonic-gatelike this:
822*0Sstevel@tonic-gate
823*0Sstevel@tonic-gate    $_ =  "The food is under the bar in the barn.";
824*0Sstevel@tonic-gate    if ( /foo(.*)bar/ ) {
825*0Sstevel@tonic-gate	print "got <$1>\n";
826*0Sstevel@tonic-gate    }
827*0Sstevel@tonic-gate
828*0Sstevel@tonic-gateWhich perhaps unexpectedly yields:
829*0Sstevel@tonic-gate
830*0Sstevel@tonic-gate  got <d is under the bar in the >
831*0Sstevel@tonic-gate
832*0Sstevel@tonic-gateThat's because C<.*> was greedy, so you get everything between the
833*0Sstevel@tonic-gateI<first> "foo" and the I<last> "bar".  Here it's more effective
834*0Sstevel@tonic-gateto use minimal matching to make sure you get the text between a "foo"
835*0Sstevel@tonic-gateand the first "bar" thereafter.
836*0Sstevel@tonic-gate
837*0Sstevel@tonic-gate    if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
838*0Sstevel@tonic-gate  got <d is under the >
839*0Sstevel@tonic-gate
840*0Sstevel@tonic-gateHere's another example: let's say you'd like to match a number at the end
841*0Sstevel@tonic-gateof a string, and you also want to keep the preceding part of the match.
842*0Sstevel@tonic-gateSo you write this:
843*0Sstevel@tonic-gate
844*0Sstevel@tonic-gate    $_ = "I have 2 numbers: 53147";
845*0Sstevel@tonic-gate    if ( /(.*)(\d*)/ ) {				# Wrong!
846*0Sstevel@tonic-gate	print "Beginning is <$1>, number is <$2>.\n";
847*0Sstevel@tonic-gate    }
848*0Sstevel@tonic-gate
849*0Sstevel@tonic-gateThat won't work at all, because C<.*> was greedy and gobbled up the
850*0Sstevel@tonic-gatewhole string. As C<\d*> can match on an empty string the complete
851*0Sstevel@tonic-gateregular expression matched successfully.
852*0Sstevel@tonic-gate
853*0Sstevel@tonic-gate    Beginning is <I have 2 numbers: 53147>, number is <>.
854*0Sstevel@tonic-gate
855*0Sstevel@tonic-gateHere are some variants, most of which don't work:
856*0Sstevel@tonic-gate
857*0Sstevel@tonic-gate    $_ = "I have 2 numbers: 53147";
858*0Sstevel@tonic-gate    @pats = qw{
859*0Sstevel@tonic-gate	(.*)(\d*)
860*0Sstevel@tonic-gate	(.*)(\d+)
861*0Sstevel@tonic-gate	(.*?)(\d*)
862*0Sstevel@tonic-gate	(.*?)(\d+)
863*0Sstevel@tonic-gate	(.*)(\d+)$
864*0Sstevel@tonic-gate	(.*?)(\d+)$
865*0Sstevel@tonic-gate	(.*)\b(\d+)$
866*0Sstevel@tonic-gate	(.*\D)(\d+)$
867*0Sstevel@tonic-gate    };
868*0Sstevel@tonic-gate
869*0Sstevel@tonic-gate    for $pat (@pats) {
870*0Sstevel@tonic-gate	printf "%-12s ", $pat;
871*0Sstevel@tonic-gate	if ( /$pat/ ) {
872*0Sstevel@tonic-gate	    print "<$1> <$2>\n";
873*0Sstevel@tonic-gate	} else {
874*0Sstevel@tonic-gate	    print "FAIL\n";
875*0Sstevel@tonic-gate	}
876*0Sstevel@tonic-gate    }
877*0Sstevel@tonic-gate
878*0Sstevel@tonic-gateThat will print out:
879*0Sstevel@tonic-gate
880*0Sstevel@tonic-gate    (.*)(\d*)    <I have 2 numbers: 53147> <>
881*0Sstevel@tonic-gate    (.*)(\d+)    <I have 2 numbers: 5314> <7>
882*0Sstevel@tonic-gate    (.*?)(\d*)   <> <>
883*0Sstevel@tonic-gate    (.*?)(\d+)   <I have > <2>
884*0Sstevel@tonic-gate    (.*)(\d+)$   <I have 2 numbers: 5314> <7>
885*0Sstevel@tonic-gate    (.*?)(\d+)$  <I have 2 numbers: > <53147>
886*0Sstevel@tonic-gate    (.*)\b(\d+)$ <I have 2 numbers: > <53147>
887*0Sstevel@tonic-gate    (.*\D)(\d+)$ <I have 2 numbers: > <53147>
888*0Sstevel@tonic-gate
889*0Sstevel@tonic-gateAs you see, this can be a bit tricky.  It's important to realize that a
890*0Sstevel@tonic-gateregular expression is merely a set of assertions that gives a definition
891*0Sstevel@tonic-gateof success.  There may be 0, 1, or several different ways that the
892*0Sstevel@tonic-gatedefinition might succeed against a particular string.  And if there are
893*0Sstevel@tonic-gatemultiple ways it might succeed, you need to understand backtracking to
894*0Sstevel@tonic-gateknow which variety of success you will achieve.
895*0Sstevel@tonic-gate
896*0Sstevel@tonic-gateWhen using look-ahead assertions and negations, this can all get even
897*0Sstevel@tonic-gatetrickier.  Imagine you'd like to find a sequence of non-digits not
898*0Sstevel@tonic-gatefollowed by "123".  You might try to write that as
899*0Sstevel@tonic-gate
900*0Sstevel@tonic-gate    $_ = "ABC123";
901*0Sstevel@tonic-gate    if ( /^\D*(?!123)/ ) {		# Wrong!
902*0Sstevel@tonic-gate	print "Yup, no 123 in $_\n";
903*0Sstevel@tonic-gate    }
904*0Sstevel@tonic-gate
905*0Sstevel@tonic-gateBut that isn't going to match; at least, not the way you're hoping.  It
906*0Sstevel@tonic-gateclaims that there is no 123 in the string.  Here's a clearer picture of
907*0Sstevel@tonic-gatewhy that pattern matches, contrary to popular expectations:
908*0Sstevel@tonic-gate
909*0Sstevel@tonic-gate    $x = 'ABC123' ;
910*0Sstevel@tonic-gate    $y = 'ABC445' ;
911*0Sstevel@tonic-gate
912*0Sstevel@tonic-gate    print "1: got $1\n" if $x =~ /^(ABC)(?!123)/ ;
913*0Sstevel@tonic-gate    print "2: got $1\n" if $y =~ /^(ABC)(?!123)/ ;
914*0Sstevel@tonic-gate
915*0Sstevel@tonic-gate    print "3: got $1\n" if $x =~ /^(\D*)(?!123)/ ;
916*0Sstevel@tonic-gate    print "4: got $1\n" if $y =~ /^(\D*)(?!123)/ ;
917*0Sstevel@tonic-gate
918*0Sstevel@tonic-gateThis prints
919*0Sstevel@tonic-gate
920*0Sstevel@tonic-gate    2: got ABC
921*0Sstevel@tonic-gate    3: got AB
922*0Sstevel@tonic-gate    4: got ABC
923*0Sstevel@tonic-gate
924*0Sstevel@tonic-gateYou might have expected test 3 to fail because it seems to a more
925*0Sstevel@tonic-gategeneral purpose version of test 1.  The important difference between
926*0Sstevel@tonic-gatethem is that test 3 contains a quantifier (C<\D*>) and so can use
927*0Sstevel@tonic-gatebacktracking, whereas test 1 will not.  What's happening is
928*0Sstevel@tonic-gatethat you've asked "Is it true that at the start of $x, following 0 or more
929*0Sstevel@tonic-gatenon-digits, you have something that's not 123?"  If the pattern matcher had
930*0Sstevel@tonic-gatelet C<\D*> expand to "ABC", this would have caused the whole pattern to
931*0Sstevel@tonic-gatefail.
932*0Sstevel@tonic-gate
933*0Sstevel@tonic-gateThe search engine will initially match C<\D*> with "ABC".  Then it will
934*0Sstevel@tonic-gatetry to match C<(?!123> with "123", which fails.  But because
935*0Sstevel@tonic-gatea quantifier (C<\D*>) has been used in the regular expression, the
936*0Sstevel@tonic-gatesearch engine can backtrack and retry the match differently
937*0Sstevel@tonic-gatein the hope of matching the complete regular expression.
938*0Sstevel@tonic-gate
939*0Sstevel@tonic-gateThe pattern really, I<really> wants to succeed, so it uses the
940*0Sstevel@tonic-gatestandard pattern back-off-and-retry and lets C<\D*> expand to just "AB" this
941*0Sstevel@tonic-gatetime.  Now there's indeed something following "AB" that is not
942*0Sstevel@tonic-gate"123".  It's "C123", which suffices.
943*0Sstevel@tonic-gate
944*0Sstevel@tonic-gateWe can deal with this by using both an assertion and a negation.
945*0Sstevel@tonic-gateWe'll say that the first part in $1 must be followed both by a digit
946*0Sstevel@tonic-gateand by something that's not "123".  Remember that the look-aheads
947*0Sstevel@tonic-gateare zero-width expressions--they only look, but don't consume any
948*0Sstevel@tonic-gateof the string in their match.  So rewriting this way produces what
949*0Sstevel@tonic-gateyou'd expect; that is, case 5 will fail, but case 6 succeeds:
950*0Sstevel@tonic-gate
951*0Sstevel@tonic-gate    print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/ ;
952*0Sstevel@tonic-gate    print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/ ;
953*0Sstevel@tonic-gate
954*0Sstevel@tonic-gate    6: got ABC
955*0Sstevel@tonic-gate
956*0Sstevel@tonic-gateIn other words, the two zero-width assertions next to each other work as though
957*0Sstevel@tonic-gatethey're ANDed together, just as you'd use any built-in assertions:  C</^$/>
958*0Sstevel@tonic-gatematches only if you're at the beginning of the line AND the end of the
959*0Sstevel@tonic-gateline simultaneously.  The deeper underlying truth is that juxtaposition in
960*0Sstevel@tonic-gateregular expressions always means AND, except when you write an explicit OR
961*0Sstevel@tonic-gateusing the vertical bar.  C</ab/> means match "a" AND (then) match "b",
962*0Sstevel@tonic-gatealthough the attempted matches are made at different positions because "a"
963*0Sstevel@tonic-gateis not a zero-width assertion, but a one-width assertion.
964*0Sstevel@tonic-gate
965*0Sstevel@tonic-gateB<WARNING>: particularly complicated regular expressions can take
966*0Sstevel@tonic-gateexponential time to solve because of the immense number of possible
967*0Sstevel@tonic-gateways they can use backtracking to try match.  For example, without
968*0Sstevel@tonic-gateinternal optimizations done by the regular expression engine, this will
969*0Sstevel@tonic-gatetake a painfully long time to run:
970*0Sstevel@tonic-gate
971*0Sstevel@tonic-gate    'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/
972*0Sstevel@tonic-gate
973*0Sstevel@tonic-gateAnd if you used C<*>'s in the internal groups instead of limiting them
974*0Sstevel@tonic-gateto 0 through 5 matches, then it would take forever--or until you ran
975*0Sstevel@tonic-gateout of stack space.  Moreover, these internal optimizations are not
976*0Sstevel@tonic-gatealways applicable.  For example, if you put C<{0,5}> instead of C<*>
977*0Sstevel@tonic-gateon the external group, no current optimization is applicable, and the
978*0Sstevel@tonic-gatematch takes a long time to finish.
979*0Sstevel@tonic-gate
980*0Sstevel@tonic-gateA powerful tool for optimizing such beasts is what is known as an
981*0Sstevel@tonic-gate"independent group",
982*0Sstevel@tonic-gatewhich does not backtrack (see L<C<< (?>pattern) >>>).  Note also that
983*0Sstevel@tonic-gatezero-length look-ahead/look-behind assertions will not backtrack to make
984*0Sstevel@tonic-gatethe tail match, since they are in "logical" context: only
985*0Sstevel@tonic-gatewhether they match is considered relevant.  For an example
986*0Sstevel@tonic-gatewhere side-effects of look-ahead I<might> have influenced the
987*0Sstevel@tonic-gatefollowing match, see L<C<< (?>pattern) >>>.
988*0Sstevel@tonic-gate
989*0Sstevel@tonic-gate=head2 Version 8 Regular Expressions
990*0Sstevel@tonic-gate
991*0Sstevel@tonic-gateIn case you're not familiar with the "regular" Version 8 regex
992*0Sstevel@tonic-gateroutines, here are the pattern-matching rules not described above.
993*0Sstevel@tonic-gate
994*0Sstevel@tonic-gateAny single character matches itself, unless it is a I<metacharacter>
995*0Sstevel@tonic-gatewith a special meaning described here or above.  You can cause
996*0Sstevel@tonic-gatecharacters that normally function as metacharacters to be interpreted
997*0Sstevel@tonic-gateliterally by prefixing them with a "\" (e.g., "\." matches a ".", not any
998*0Sstevel@tonic-gatecharacter; "\\" matches a "\").  A series of characters matches that
999*0Sstevel@tonic-gateseries of characters in the target string, so the pattern C<blurfl>
1000*0Sstevel@tonic-gatewould match "blurfl" in the target string.
1001*0Sstevel@tonic-gate
1002*0Sstevel@tonic-gateYou can specify a character class, by enclosing a list of characters
1003*0Sstevel@tonic-gatein C<[]>, which will match any one character from the list.  If the
1004*0Sstevel@tonic-gatefirst character after the "[" is "^", the class matches any character not
1005*0Sstevel@tonic-gatein the list.  Within a list, the "-" character specifies a
1006*0Sstevel@tonic-gaterange, so that C<a-z> represents all characters between "a" and "z",
1007*0Sstevel@tonic-gateinclusive.  If you want either "-" or "]" itself to be a member of a
1008*0Sstevel@tonic-gateclass, put it at the start of the list (possibly after a "^"), or
1009*0Sstevel@tonic-gateescape it with a backslash.  "-" is also taken literally when it is
1010*0Sstevel@tonic-gateat the end of the list, just before the closing "]".  (The
1011*0Sstevel@tonic-gatefollowing all specify the same class of three characters: C<[-az]>,
1012*0Sstevel@tonic-gateC<[az-]>, and C<[a\-z]>.  All are different from C<[a-z]>, which
1013*0Sstevel@tonic-gatespecifies a class containing twenty-six characters, even on EBCDIC
1014*0Sstevel@tonic-gatebased coded character sets.)  Also, if you try to use the character
1015*0Sstevel@tonic-gateclasses C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, or C<\D> as endpoints of
1016*0Sstevel@tonic-gatea range, that's not a range, the "-" is understood literally.
1017*0Sstevel@tonic-gate
1018*0Sstevel@tonic-gateNote also that the whole range idea is rather unportable between
1019*0Sstevel@tonic-gatecharacter sets--and even within character sets they may cause results
1020*0Sstevel@tonic-gateyou probably didn't expect.  A sound principle is to use only ranges
1021*0Sstevel@tonic-gatethat begin from and end at either alphabets of equal case ([a-e],
1022*0Sstevel@tonic-gate[A-E]), or digits ([0-9]).  Anything else is unsafe.  If in doubt,
1023*0Sstevel@tonic-gatespell out the character sets in full.
1024*0Sstevel@tonic-gate
1025*0Sstevel@tonic-gateCharacters may be specified using a metacharacter syntax much like that
1026*0Sstevel@tonic-gateused in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
1027*0Sstevel@tonic-gate"\f" a form feed, etc.  More generally, \I<nnn>, where I<nnn> is a string
1028*0Sstevel@tonic-gateof octal digits, matches the character whose coded character set value
1029*0Sstevel@tonic-gateis I<nnn>.  Similarly, \xI<nn>, where I<nn> are hexadecimal digits,
1030*0Sstevel@tonic-gatematches the character whose numeric value is I<nn>. The expression \cI<x>
1031*0Sstevel@tonic-gatematches the character control-I<x>.  Finally, the "." metacharacter
1032*0Sstevel@tonic-gatematches any character except "\n" (unless you use C</s>).
1033*0Sstevel@tonic-gate
1034*0Sstevel@tonic-gateYou can specify a series of alternatives for a pattern using "|" to
1035*0Sstevel@tonic-gateseparate them, so that C<fee|fie|foe> will match any of "fee", "fie",
1036*0Sstevel@tonic-gateor "foe" in the target string (as would C<f(e|i|o)e>).  The
1037*0Sstevel@tonic-gatefirst alternative includes everything from the last pattern delimiter
1038*0Sstevel@tonic-gate("(", "[", or the beginning of the pattern) up to the first "|", and
1039*0Sstevel@tonic-gatethe last alternative contains everything from the last "|" to the next
1040*0Sstevel@tonic-gatepattern delimiter.  That's why it's common practice to include
1041*0Sstevel@tonic-gatealternatives in parentheses: to minimize confusion about where they
1042*0Sstevel@tonic-gatestart and end.
1043*0Sstevel@tonic-gate
1044*0Sstevel@tonic-gateAlternatives are tried from left to right, so the first
1045*0Sstevel@tonic-gatealternative found for which the entire expression matches, is the one that
1046*0Sstevel@tonic-gateis chosen. This means that alternatives are not necessarily greedy. For
1047*0Sstevel@tonic-gateexample: when matching C<foo|foot> against "barefoot", only the "foo"
1048*0Sstevel@tonic-gatepart will match, as that is the first alternative tried, and it successfully
1049*0Sstevel@tonic-gatematches the target string. (This might not seem important, but it is
1050*0Sstevel@tonic-gateimportant when you are capturing matched text using parentheses.)
1051*0Sstevel@tonic-gate
1052*0Sstevel@tonic-gateAlso remember that "|" is interpreted as a literal within square brackets,
1053*0Sstevel@tonic-gateso if you write C<[fee|fie|foe]> you're really only matching C<[feio|]>.
1054*0Sstevel@tonic-gate
1055*0Sstevel@tonic-gateWithin a pattern, you may designate subpatterns for later reference
1056*0Sstevel@tonic-gateby enclosing them in parentheses, and you may refer back to the
1057*0Sstevel@tonic-gateI<n>th subpattern later in the pattern using the metacharacter
1058*0Sstevel@tonic-gate\I<n>.  Subpatterns are numbered based on the left to right order
1059*0Sstevel@tonic-gateof their opening parenthesis.  A backreference matches whatever
1060*0Sstevel@tonic-gateactually matched the subpattern in the string being examined, not
1061*0Sstevel@tonic-gatethe rules for that subpattern.  Therefore, C<(0|0x)\d*\s\1\d*> will
1062*0Sstevel@tonic-gatematch "0x1234 0x4321", but not "0x1234 01234", because subpattern
1063*0Sstevel@tonic-gate1 matched "0x", even though the rule C<0|0x> could potentially match
1064*0Sstevel@tonic-gatethe leading 0 in the second number.
1065*0Sstevel@tonic-gate
1066*0Sstevel@tonic-gate=head2 Warning on \1 vs $1
1067*0Sstevel@tonic-gate
1068*0Sstevel@tonic-gateSome people get too used to writing things like:
1069*0Sstevel@tonic-gate
1070*0Sstevel@tonic-gate    $pattern =~ s/(\W)/\\\1/g;
1071*0Sstevel@tonic-gate
1072*0Sstevel@tonic-gateThis is grandfathered for the RHS of a substitute to avoid shocking the
1073*0Sstevel@tonic-gateB<sed> addicts, but it's a dirty habit to get into.  That's because in
1074*0Sstevel@tonic-gatePerlThink, the righthand side of an C<s///> is a double-quoted string.  C<\1> in
1075*0Sstevel@tonic-gatethe usual double-quoted string means a control-A.  The customary Unix
1076*0Sstevel@tonic-gatemeaning of C<\1> is kludged in for C<s///>.  However, if you get into the habit
1077*0Sstevel@tonic-gateof doing that, you get yourself into trouble if you then add an C</e>
1078*0Sstevel@tonic-gatemodifier.
1079*0Sstevel@tonic-gate
1080*0Sstevel@tonic-gate    s/(\d+)/ \1 + 1 /eg;    	# causes warning under -w
1081*0Sstevel@tonic-gate
1082*0Sstevel@tonic-gateOr if you try to do
1083*0Sstevel@tonic-gate
1084*0Sstevel@tonic-gate    s/(\d+)/\1000/;
1085*0Sstevel@tonic-gate
1086*0Sstevel@tonic-gateYou can't disambiguate that by saying C<\{1}000>, whereas you can fix it with
1087*0Sstevel@tonic-gateC<${1}000>.  The operation of interpolation should not be confused
1088*0Sstevel@tonic-gatewith the operation of matching a backreference.  Certainly they mean two
1089*0Sstevel@tonic-gatedifferent things on the I<left> side of the C<s///>.
1090*0Sstevel@tonic-gate
1091*0Sstevel@tonic-gate=head2 Repeated patterns matching zero-length substring
1092*0Sstevel@tonic-gate
1093*0Sstevel@tonic-gateB<WARNING>: Difficult material (and prose) ahead.  This section needs a rewrite.
1094*0Sstevel@tonic-gate
1095*0Sstevel@tonic-gateRegular expressions provide a terse and powerful programming language.  As
1096*0Sstevel@tonic-gatewith most other power tools, power comes together with the ability
1097*0Sstevel@tonic-gateto wreak havoc.
1098*0Sstevel@tonic-gate
1099*0Sstevel@tonic-gateA common abuse of this power stems from the ability to make infinite
1100*0Sstevel@tonic-gateloops using regular expressions, with something as innocuous as:
1101*0Sstevel@tonic-gate
1102*0Sstevel@tonic-gate    'foo' =~ m{ ( o? )* }x;
1103*0Sstevel@tonic-gate
1104*0Sstevel@tonic-gateThe C<o?> can match at the beginning of C<'foo'>, and since the position
1105*0Sstevel@tonic-gatein the string is not moved by the match, C<o?> would match again and again
1106*0Sstevel@tonic-gatebecause of the C<*> modifier.  Another common way to create a similar cycle
1107*0Sstevel@tonic-gateis with the looping modifier C<//g>:
1108*0Sstevel@tonic-gate
1109*0Sstevel@tonic-gate    @matches = ( 'foo' =~ m{ o? }xg );
1110*0Sstevel@tonic-gate
1111*0Sstevel@tonic-gateor
1112*0Sstevel@tonic-gate
1113*0Sstevel@tonic-gate    print "match: <$&>\n" while 'foo' =~ m{ o? }xg;
1114*0Sstevel@tonic-gate
1115*0Sstevel@tonic-gateor the loop implied by split().
1116*0Sstevel@tonic-gate
1117*0Sstevel@tonic-gateHowever, long experience has shown that many programming tasks may
1118*0Sstevel@tonic-gatebe significantly simplified by using repeated subexpressions that
1119*0Sstevel@tonic-gatemay match zero-length substrings.  Here's a simple example being:
1120*0Sstevel@tonic-gate
1121*0Sstevel@tonic-gate    @chars = split //, $string;		  # // is not magic in split
1122*0Sstevel@tonic-gate    ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
1123*0Sstevel@tonic-gate
1124*0Sstevel@tonic-gateThus Perl allows such constructs, by I<forcefully breaking
1125*0Sstevel@tonic-gatethe infinite loop>.  The rules for this are different for lower-level
1126*0Sstevel@tonic-gateloops given by the greedy modifiers C<*+{}>, and for higher-level
1127*0Sstevel@tonic-gateones like the C</g> modifier or split() operator.
1128*0Sstevel@tonic-gate
1129*0Sstevel@tonic-gateThe lower-level loops are I<interrupted> (that is, the loop is
1130*0Sstevel@tonic-gatebroken) when Perl detects that a repeated expression matched a
1131*0Sstevel@tonic-gatezero-length substring.   Thus
1132*0Sstevel@tonic-gate
1133*0Sstevel@tonic-gate   m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;
1134*0Sstevel@tonic-gate
1135*0Sstevel@tonic-gateis made equivalent to
1136*0Sstevel@tonic-gate
1137*0Sstevel@tonic-gate   m{   (?: NON_ZERO_LENGTH )*
1138*0Sstevel@tonic-gate      |
1139*0Sstevel@tonic-gate        (?: ZERO_LENGTH )?
1140*0Sstevel@tonic-gate    }x;
1141*0Sstevel@tonic-gate
1142*0Sstevel@tonic-gateThe higher level-loops preserve an additional state between iterations:
1143*0Sstevel@tonic-gatewhether the last match was zero-length.  To break the loop, the following
1144*0Sstevel@tonic-gatematch after a zero-length match is prohibited to have a length of zero.
1145*0Sstevel@tonic-gateThis prohibition interacts with backtracking (see L<"Backtracking">),
1146*0Sstevel@tonic-gateand so the I<second best> match is chosen if the I<best> match is of
1147*0Sstevel@tonic-gatezero length.
1148*0Sstevel@tonic-gate
1149*0Sstevel@tonic-gateFor example:
1150*0Sstevel@tonic-gate
1151*0Sstevel@tonic-gate    $_ = 'bar';
1152*0Sstevel@tonic-gate    s/\w??/<$&>/g;
1153*0Sstevel@tonic-gate
1154*0Sstevel@tonic-gateresults in C<< <><b><><a><><r><> >>.  At each position of the string the best
1155*0Sstevel@tonic-gatematch given by non-greedy C<??> is the zero-length match, and the I<second
1156*0Sstevel@tonic-gatebest> match is what is matched by C<\w>.  Thus zero-length matches
1157*0Sstevel@tonic-gatealternate with one-character-long matches.
1158*0Sstevel@tonic-gate
1159*0Sstevel@tonic-gateSimilarly, for repeated C<m/()/g> the second-best match is the match at the
1160*0Sstevel@tonic-gateposition one notch further in the string.
1161*0Sstevel@tonic-gate
1162*0Sstevel@tonic-gateThe additional state of being I<matched with zero-length> is associated with
1163*0Sstevel@tonic-gatethe matched string, and is reset by each assignment to pos().
1164*0Sstevel@tonic-gateZero-length matches at the end of the previous match are ignored
1165*0Sstevel@tonic-gateduring C<split>.
1166*0Sstevel@tonic-gate
1167*0Sstevel@tonic-gate=head2 Combining pieces together
1168*0Sstevel@tonic-gate
1169*0Sstevel@tonic-gateEach of the elementary pieces of regular expressions which were described
1170*0Sstevel@tonic-gatebefore (such as C<ab> or C<\Z>) could match at most one substring
1171*0Sstevel@tonic-gateat the given position of the input string.  However, in a typical regular
1172*0Sstevel@tonic-gateexpression these elementary pieces are combined into more complicated
1173*0Sstevel@tonic-gatepatterns using combining operators C<ST>, C<S|T>, C<S*> etc
1174*0Sstevel@tonic-gate(in these examples C<S> and C<T> are regular subexpressions).
1175*0Sstevel@tonic-gate
1176*0Sstevel@tonic-gateSuch combinations can include alternatives, leading to a problem of choice:
1177*0Sstevel@tonic-gateif we match a regular expression C<a|ab> against C<"abc">, will it match
1178*0Sstevel@tonic-gatesubstring C<"a"> or C<"ab">?  One way to describe which substring is
1179*0Sstevel@tonic-gateactually matched is the concept of backtracking (see L<"Backtracking">).
1180*0Sstevel@tonic-gateHowever, this description is too low-level and makes you think
1181*0Sstevel@tonic-gatein terms of a particular implementation.
1182*0Sstevel@tonic-gate
1183*0Sstevel@tonic-gateAnother description starts with notions of "better"/"worse".  All the
1184*0Sstevel@tonic-gatesubstrings which may be matched by the given regular expression can be
1185*0Sstevel@tonic-gatesorted from the "best" match to the "worst" match, and it is the "best"
1186*0Sstevel@tonic-gatematch which is chosen.  This substitutes the question of "what is chosen?"
1187*0Sstevel@tonic-gateby the question of "which matches are better, and which are worse?".
1188*0Sstevel@tonic-gate
1189*0Sstevel@tonic-gateAgain, for elementary pieces there is no such question, since at most
1190*0Sstevel@tonic-gateone match at a given position is possible.  This section describes the
1191*0Sstevel@tonic-gatenotion of better/worse for combining operators.  In the description
1192*0Sstevel@tonic-gatebelow C<S> and C<T> are regular subexpressions.
1193*0Sstevel@tonic-gate
1194*0Sstevel@tonic-gate=over 4
1195*0Sstevel@tonic-gate
1196*0Sstevel@tonic-gate=item C<ST>
1197*0Sstevel@tonic-gate
1198*0Sstevel@tonic-gateConsider two possible matches, C<AB> and C<A'B'>, C<A> and C<A'> are
1199*0Sstevel@tonic-gatesubstrings which can be matched by C<S>, C<B> and C<B'> are substrings
1200*0Sstevel@tonic-gatewhich can be matched by C<T>.
1201*0Sstevel@tonic-gate
1202*0Sstevel@tonic-gateIf C<A> is better match for C<S> than C<A'>, C<AB> is a better
1203*0Sstevel@tonic-gatematch than C<A'B'>.
1204*0Sstevel@tonic-gate
1205*0Sstevel@tonic-gateIf C<A> and C<A'> coincide: C<AB> is a better match than C<AB'> if
1206*0Sstevel@tonic-gateC<B> is better match for C<T> than C<B'>.
1207*0Sstevel@tonic-gate
1208*0Sstevel@tonic-gate=item C<S|T>
1209*0Sstevel@tonic-gate
1210*0Sstevel@tonic-gateWhen C<S> can match, it is a better match than when only C<T> can match.
1211*0Sstevel@tonic-gate
1212*0Sstevel@tonic-gateOrdering of two matches for C<S> is the same as for C<S>.  Similar for
1213*0Sstevel@tonic-gatetwo matches for C<T>.
1214*0Sstevel@tonic-gate
1215*0Sstevel@tonic-gate=item C<S{REPEAT_COUNT}>
1216*0Sstevel@tonic-gate
1217*0Sstevel@tonic-gateMatches as C<SSS...S> (repeated as many times as necessary).
1218*0Sstevel@tonic-gate
1219*0Sstevel@tonic-gate=item C<S{min,max}>
1220*0Sstevel@tonic-gate
1221*0Sstevel@tonic-gateMatches as C<S{max}|S{max-1}|...|S{min+1}|S{min}>.
1222*0Sstevel@tonic-gate
1223*0Sstevel@tonic-gate=item C<S{min,max}?>
1224*0Sstevel@tonic-gate
1225*0Sstevel@tonic-gateMatches as C<S{min}|S{min+1}|...|S{max-1}|S{max}>.
1226*0Sstevel@tonic-gate
1227*0Sstevel@tonic-gate=item C<S?>, C<S*>, C<S+>
1228*0Sstevel@tonic-gate
1229*0Sstevel@tonic-gateSame as C<S{0,1}>, C<S{0,BIG_NUMBER}>, C<S{1,BIG_NUMBER}> respectively.
1230*0Sstevel@tonic-gate
1231*0Sstevel@tonic-gate=item C<S??>, C<S*?>, C<S+?>
1232*0Sstevel@tonic-gate
1233*0Sstevel@tonic-gateSame as C<S{0,1}?>, C<S{0,BIG_NUMBER}?>, C<S{1,BIG_NUMBER}?> respectively.
1234*0Sstevel@tonic-gate
1235*0Sstevel@tonic-gate=item C<< (?>S) >>
1236*0Sstevel@tonic-gate
1237*0Sstevel@tonic-gateMatches the best match for C<S> and only that.
1238*0Sstevel@tonic-gate
1239*0Sstevel@tonic-gate=item C<(?=S)>, C<(?<=S)>
1240*0Sstevel@tonic-gate
1241*0Sstevel@tonic-gateOnly the best match for C<S> is considered.  (This is important only if
1242*0Sstevel@tonic-gateC<S> has capturing parentheses, and backreferences are used somewhere
1243*0Sstevel@tonic-gateelse in the whole regular expression.)
1244*0Sstevel@tonic-gate
1245*0Sstevel@tonic-gate=item C<(?!S)>, C<(?<!S)>
1246*0Sstevel@tonic-gate
1247*0Sstevel@tonic-gateFor this grouping operator there is no need to describe the ordering, since
1248*0Sstevel@tonic-gateonly whether or not C<S> can match is important.
1249*0Sstevel@tonic-gate
1250*0Sstevel@tonic-gate=item C<(??{ EXPR })>
1251*0Sstevel@tonic-gate
1252*0Sstevel@tonic-gateThe ordering is the same as for the regular expression which is
1253*0Sstevel@tonic-gatethe result of EXPR.
1254*0Sstevel@tonic-gate
1255*0Sstevel@tonic-gate=item C<(?(condition)yes-pattern|no-pattern)>
1256*0Sstevel@tonic-gate
1257*0Sstevel@tonic-gateRecall that which of C<yes-pattern> or C<no-pattern> actually matches is
1258*0Sstevel@tonic-gatealready determined.  The ordering of the matches is the same as for the
1259*0Sstevel@tonic-gatechosen subexpression.
1260*0Sstevel@tonic-gate
1261*0Sstevel@tonic-gate=back
1262*0Sstevel@tonic-gate
1263*0Sstevel@tonic-gateThe above recipes describe the ordering of matches I<at a given position>.
1264*0Sstevel@tonic-gateOne more rule is needed to understand how a match is determined for the
1265*0Sstevel@tonic-gatewhole regular expression: a match at an earlier position is always better
1266*0Sstevel@tonic-gatethan a match at a later position.
1267*0Sstevel@tonic-gate
1268*0Sstevel@tonic-gate=head2 Creating custom RE engines
1269*0Sstevel@tonic-gate
1270*0Sstevel@tonic-gateOverloaded constants (see L<overload>) provide a simple way to extend
1271*0Sstevel@tonic-gatethe functionality of the RE engine.
1272*0Sstevel@tonic-gate
1273*0Sstevel@tonic-gateSuppose that we want to enable a new RE escape-sequence C<\Y|> which
1274*0Sstevel@tonic-gatematches at boundary between white-space characters and non-whitespace
1275*0Sstevel@tonic-gatecharacters.  Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly
1276*0Sstevel@tonic-gateat these positions, so we want to have each C<\Y|> in the place of the
1277*0Sstevel@tonic-gatemore complicated version.  We can create a module C<customre> to do
1278*0Sstevel@tonic-gatethis:
1279*0Sstevel@tonic-gate
1280*0Sstevel@tonic-gate    package customre;
1281*0Sstevel@tonic-gate    use overload;
1282*0Sstevel@tonic-gate
1283*0Sstevel@tonic-gate    sub import {
1284*0Sstevel@tonic-gate      shift;
1285*0Sstevel@tonic-gate      die "No argument to customre::import allowed" if @_;
1286*0Sstevel@tonic-gate      overload::constant 'qr' => \&convert;
1287*0Sstevel@tonic-gate    }
1288*0Sstevel@tonic-gate
1289*0Sstevel@tonic-gate    sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
1290*0Sstevel@tonic-gate
1291*0Sstevel@tonic-gate    my %rules = ( '\\' => '\\',
1292*0Sstevel@tonic-gate		  'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );
1293*0Sstevel@tonic-gate    sub convert {
1294*0Sstevel@tonic-gate      my $re = shift;
1295*0Sstevel@tonic-gate      $re =~ s{
1296*0Sstevel@tonic-gate                \\ ( \\ | Y . )
1297*0Sstevel@tonic-gate              }
1298*0Sstevel@tonic-gate              { $rules{$1} or invalid($re,$1) }sgex;
1299*0Sstevel@tonic-gate      return $re;
1300*0Sstevel@tonic-gate    }
1301*0Sstevel@tonic-gate
1302*0Sstevel@tonic-gateNow C<use customre> enables the new escape in constant regular
1303*0Sstevel@tonic-gateexpressions, i.e., those without any runtime variable interpolations.
1304*0Sstevel@tonic-gateAs documented in L<overload>, this conversion will work only over
1305*0Sstevel@tonic-gateliteral parts of regular expressions.  For C<\Y|$re\Y|> the variable
1306*0Sstevel@tonic-gatepart of this regular expression needs to be converted explicitly
1307*0Sstevel@tonic-gate(but only if the special meaning of C<\Y|> should be enabled inside $re):
1308*0Sstevel@tonic-gate
1309*0Sstevel@tonic-gate    use customre;
1310*0Sstevel@tonic-gate    $re = <>;
1311*0Sstevel@tonic-gate    chomp $re;
1312*0Sstevel@tonic-gate    $re = customre::convert $re;
1313*0Sstevel@tonic-gate    /\Y|$re\Y|/;
1314*0Sstevel@tonic-gate
1315*0Sstevel@tonic-gate=head1 BUGS
1316*0Sstevel@tonic-gate
1317*0Sstevel@tonic-gateThis document varies from difficult to understand to completely
1318*0Sstevel@tonic-gateand utterly opaque.  The wandering prose riddled with jargon is
1319*0Sstevel@tonic-gatehard to fathom in several places.
1320*0Sstevel@tonic-gate
1321*0Sstevel@tonic-gateThis document needs a rewrite that separates the tutorial content
1322*0Sstevel@tonic-gatefrom the reference content.
1323*0Sstevel@tonic-gate
1324*0Sstevel@tonic-gate=head1 SEE ALSO
1325*0Sstevel@tonic-gate
1326*0Sstevel@tonic-gateL<perlrequick>.
1327*0Sstevel@tonic-gate
1328*0Sstevel@tonic-gateL<perlretut>.
1329*0Sstevel@tonic-gate
1330*0Sstevel@tonic-gateL<perlop/"Regexp Quote-Like Operators">.
1331*0Sstevel@tonic-gate
1332*0Sstevel@tonic-gateL<perlop/"Gory details of parsing quoted constructs">.
1333*0Sstevel@tonic-gate
1334*0Sstevel@tonic-gateL<perlfaq6>.
1335*0Sstevel@tonic-gate
1336*0Sstevel@tonic-gateL<perlfunc/pos>.
1337*0Sstevel@tonic-gate
1338*0Sstevel@tonic-gateL<perllocale>.
1339*0Sstevel@tonic-gate
1340*0Sstevel@tonic-gateL<perlebcdic>.
1341*0Sstevel@tonic-gate
1342*0Sstevel@tonic-gateI<Mastering Regular Expressions> by Jeffrey Friedl, published
1343*0Sstevel@tonic-gateby O'Reilly and Associates.
1344