xref: /openbsd-src/gnu/usr.bin/perl/pod/perlreapi.pod (revision f6aab3d83b51b91c24247ad2c2573574de475a82)
1=head1 NAME
2
3perlreapi - Perl regular expression plugin interface
4
5=head1 DESCRIPTION
6
7As of Perl 5.9.5 there is a new interface for plugging and using
8regular expression engines other than the default one.
9
10Each engine is supposed to provide access to a constant structure of the
11following format:
12
13    typedef struct regexp_engine {
14        REGEXP* (*comp) (pTHX_
15                         const SV * const pattern, const U32 flags);
16        I32     (*exec) (pTHX_
17                         REGEXP * const rx,
18                         char* stringarg,
19                         char* strend, char* strbeg,
20                         SSize_t minend, SV* sv,
21                         void* data, U32 flags);
22        char*   (*intuit) (pTHX_
23                           REGEXP * const rx, SV *sv,
24			   const char * const strbeg,
25                           char *strpos, char *strend, U32 flags,
26                           struct re_scream_pos_data_s *data);
27        SV*     (*checkstr) (pTHX_ REGEXP * const rx);
28        void    (*free) (pTHX_ REGEXP * const rx);
29        void    (*numbered_buff_FETCH) (pTHX_
30                                        REGEXP * const rx,
31                                        const I32 paren,
32                                        SV * const sv);
33        void    (*numbered_buff_STORE) (pTHX_
34                                        REGEXP * const rx,
35                                        const I32 paren,
36                                        SV const * const value);
37        I32     (*numbered_buff_LENGTH) (pTHX_
38                                         REGEXP * const rx,
39                                         const SV * const sv,
40                                         const I32 paren);
41        SV*     (*named_buff) (pTHX_
42                               REGEXP * const rx,
43                               SV * const key,
44                               SV * const value,
45                               U32 flags);
46        SV*     (*named_buff_iter) (pTHX_
47                                    REGEXP * const rx,
48                                    const SV * const lastkey,
49                                    const U32 flags);
50        SV*     (*qr_package)(pTHX_ REGEXP * const rx);
51    #ifdef USE_ITHREADS
52        void*   (*dupe) (pTHX_ REGEXP * const rx, CLONE_PARAMS *param);
53    #endif
54        REGEXP* (*op_comp) (...);
55
56
57=for apidoc_section $regexp
58=for apidoc Ay||regexp_engine
59
60When a regexp is compiled, its C<engine> field is then set to point at
61the appropriate structure, so that when it needs to be used Perl can find
62the right routines to do so.
63
64In order to install a new regexp handler, C<$^H{regcomp}> is set
65to an integer which (when casted appropriately) resolves to one of these
66structures.  When compiling, the C<comp> method is executed, and the
67resulting C<regexp> structure's engine field is expected to point back at
68the same structure.
69
70The pTHX_ symbol in the definition is a macro used by Perl under threading
71to provide an extra argument to the routine holding a pointer back to
72the interpreter that is executing the regexp. So under threading all
73routines get an extra argument.
74
75=head1 Callbacks
76
77=head2 comp
78
79    REGEXP* comp(pTHX_ const SV * const pattern, const U32 flags);
80
81Compile the pattern stored in C<pattern> using the given C<flags> and
82return a pointer to a prepared C<REGEXP> structure that can perform
83the match.  See L</The REGEXP structure> below for an explanation of
84the individual fields in the REGEXP struct.
85
86The C<pattern> parameter is the scalar that was used as the
87pattern.  Previous versions of Perl would pass two C<char*> indicating
88the start and end of the stringified pattern; the following snippet can
89be used to get the old parameters:
90
91    STRLEN plen;
92    char*  exp = SvPV(pattern, plen);
93    char* xend = exp + plen;
94
95Since any scalar can be passed as a pattern, it's possible to implement
96an engine that does something with an array (C<< "ook" =~ [ qw/ eek
97hlagh / ] >>) or with the non-stringified form of a compiled regular
98expression (C<< "ook" =~ qr/eek/ >>).  Perl's own engine will always
99stringify everything using the snippet above, but that doesn't mean
100other engines have to.
101
102The C<flags> parameter is a bitfield which indicates which of the
103C<msixpn> flags the regex was compiled with.  It also contains
104additional info, such as if C<use locale> is in effect.
105
106The C<eogc> flags are stripped out before being passed to the comp
107routine.  The regex engine does not need to know if any of these
108are set, as those flags should only affect what Perl does with the
109pattern and its match variables, not how it gets compiled and
110executed.
111
112By the time the comp callback is called, some of these flags have
113already had effect (noted below where applicable).  However most of
114their effect occurs after the comp callback has run, in routines that
115read the C<< rx->extflags >> field which it populates.
116
117In general the flags should be preserved in C<< rx->extflags >> after
118compilation, although the regex engine might want to add or delete
119some of them to invoke or disable some special behavior in Perl.  The
120flags along with any special behavior they cause are documented below:
121
122The pattern modifiers:
123
124=over 4
125
126=item C</m> - RXf_PMf_MULTILINE
127
128If this is in C<< rx->extflags >> it will be passed to
129C<Perl_fbm_instr> by C<pp_split> which will treat the subject string
130as a multi-line string.
131
132=for apidoc Amnh||RXf_PMf_MULTILINE
133=for apidoc_item  RXf_PMf_SINGLELINE
134=for apidoc_item  RXf_PMf_FOLD
135=for apidoc_item  RXf_PMf_EXTENDED
136=for apidoc_item  RXf_PMf_KEEPCOPY
137
138=item C</s> - RXf_PMf_SINGLELINE
139
140=item C</i> - RXf_PMf_FOLD
141
142=item C</x> - RXf_PMf_EXTENDED
143
144If present on a regex, C<"#"> comments will be handled differently by the
145tokenizer in some cases.
146
147TODO: Document those cases.
148
149
150=item C</p> - RXf_PMf_KEEPCOPY
151
152TODO: Document this
153
154=item Character set
155
156The character set rules are determined by an enum that is contained
157in this field.  This is still experimental and subject to change, but
158the current interface returns the rules by use of the in-line function
159C<get_regex_charset(const U32 flags)>.  The only currently documented
160value returned from it is REGEX_LOCALE_CHARSET, which is set if
161C<use locale> is in effect. If present in C<< rx->extflags >>,
162C<split> will use the locale dependent definition of whitespace
163when RXf_SKIPWHITE or RXf_WHITE is in effect.  ASCII whitespace
164is defined as per L<isSPACE|perlapi/isSPACE>, and by the internal
165macros C<is_utf8_space> under UTF-8, and C<isSPACE_LC> under C<use
166locale>.
167
168=for apidoc Amnh||REGEX_LOCALE_CHARSET
169
170=back
171
172Additional flags:
173
174=over 4
175
176=item RXf_SPLIT
177
178This flag was removed in perl 5.18.0.  C<split ' '> is now special-cased
179solely in the parser.  RXf_SPLIT is still #defined, so you can test for it.
180This is how it used to work:
181
182If C<split> is invoked as C<split ' '> or with no arguments (which
183really means C<split(' ', $_)>, see L<split|perlfunc/split>), Perl will
184set this flag.  The regex engine can then check for it and set the
185SKIPWHITE and WHITE extflags.  To do this, the Perl engine does:
186
187    if (flags & RXf_SPLIT && r->prelen == 1 && r->precomp[0] == ' ')
188        r->extflags |= (RXf_SKIPWHITE|RXf_WHITE);
189
190=back
191
192These flags can be set during compilation to enable optimizations in
193the C<split> operator.
194
195=for apidoc Amnh||RXf_SPLIT
196=for apidoc_item  RXf_SKIPWHITE
197=for apidoc_item  RXf_START_ONLY
198=for apidoc_item  RXf_WHITE
199=for apidoc_item  RXf_NULL
200=for apidoc_item  RXf_NO_INPLACE_SUBST
201
202=over 4
203
204=item RXf_SKIPWHITE
205
206This flag was removed in perl 5.18.0.  It is still #defined, so you can
207set it, but doing so will have no effect.  This is how it used to work:
208
209If the flag is present in C<< rx->extflags >> C<split> will delete
210whitespace from the start of the subject string before it's operated
211on.  What is considered whitespace depends on if the subject is a
212UTF-8 string and if the C<RXf_PMf_LOCALE> flag is set.
213
214If RXf_WHITE is set in addition to this flag, C<split> will behave like
215C<split " "> under the Perl engine.
216
217
218=item RXf_START_ONLY
219
220Tells the split operator to split the target string on newlines
221(C<\n>) without invoking the regex engine.
222
223Perl's engine sets this if the pattern is C</^/> (C<plen == 1 && *exp
224== '^'>), even under C</^/s>; see L<split|perlfunc>.  Of course a
225different regex engine might want to use the same optimizations
226with a different syntax.
227
228=item RXf_WHITE
229
230Tells the split operator to split the target string on whitespace
231without invoking the regex engine.  The definition of whitespace varies
232depending on if the target string is a UTF-8 string and on
233if RXf_PMf_LOCALE is set.
234
235Perl's engine sets this flag if the pattern is C<\s+>.
236
237=item RXf_NULL
238
239Tells the split operator to split the target string on
240characters.  The definition of character varies depending on if
241the target string is a UTF-8 string.
242
243Perl's engine sets this flag on empty patterns, this optimization
244makes C<split //> much faster than it would otherwise be.  It's even
245faster than C<unpack>.
246
247=item RXf_NO_INPLACE_SUBST
248
249Added in perl 5.18.0, this flag indicates that a regular expression might
250perform an operation that would interfere with inplace substitution. For
251instance it might contain lookbehind, or assign to non-magical variables
252(such as $REGMARK and $REGERROR) during matching.  C<s///> will skip
253certain optimisations when this is set.
254
255=back
256
257=head2 exec
258
259    I32 exec(pTHX_ REGEXP * const rx,
260             char *stringarg, char* strend, char* strbeg,
261             SSize_t minend, SV* sv,
262             void* data, U32 flags);
263
264Execute a regexp. The arguments are
265
266=over 4
267
268=item rx
269
270The regular expression to execute.
271
272=item sv
273
274This is the SV to be matched against.  Note that the
275actual char array to be matched against is supplied by the arguments
276described below; the SV is just used to determine UTF8ness, C<pos()> etc.
277
278=item strbeg
279
280Pointer to the physical start of the string.
281
282=item strend
283
284Pointer to the character following the physical end of the string (i.e.
285the C<\0>, if any).
286
287=item stringarg
288
289Pointer to the position in the string where matching should start; it might
290not be equal to C<strbeg> (for example in a later iteration of C</.../g>).
291
292=item minend
293
294Minimum length of string (measured in bytes from C<stringarg>) that must
295match; if the engine reaches the end of the match but hasn't reached this
296position in the string, it should fail.
297
298=item data
299
300Optimisation data; subject to change.
301
302=item flags
303
304Optimisation flags; subject to change.
305
306=back
307
308=head2 intuit
309
310    char* intuit(pTHX_
311		REGEXP * const rx,
312		SV *sv,
313		const char * const strbeg,
314		char *strpos,
315		char *strend,
316		const U32 flags,
317		struct re_scream_pos_data_s *data);
318
319Find the start position where a regex match should be attempted,
320or possibly if the regex engine should not be run because the
321pattern can't match.  This is called, as appropriate, by the core,
322depending on the values of the C<extflags> member of the C<regexp>
323structure.
324
325Arguments:
326
327    rx:     the regex to match against
328    sv:     the SV being matched: only used for utf8 flag; the string
329	    itself is accessed via the pointers below. Note that on
330	    something like an overloaded SV, SvPOK(sv) may be false
331	    and the string pointers may point to something unrelated to
332	    the SV itself.
333    strbeg: real beginning of string
334    strpos: the point in the string at which to begin matching
335    strend: pointer to the byte following the last char of the string
336    flags   currently unused; set to 0
337    data:   currently unused; set to NULL
338
339
340=head2 checkstr
341
342    SV*	checkstr(pTHX_ REGEXP * const rx);
343
344Return a SV containing a string that must appear in the pattern. Used
345by C<split> for optimising matches.
346
347=head2 free
348
349    void free(pTHX_ REGEXP * const rx);
350
351Called by Perl when it is freeing a regexp pattern so that the engine
352can release any resources pointed to by the C<pprivate> member of the
353C<regexp> structure.  This is only responsible for freeing private data;
354Perl will handle releasing anything else contained in the C<regexp> structure.
355
356=head2 Numbered capture callbacks
357
358Called to get/set the value of C<$`>, C<$'>, C<$&> and their named
359equivalents, ${^PREMATCH}, ${^POSTMATCH} and ${^MATCH}, as well as the
360numbered capture groups (C<$1>, C<$2>, ...).
361
362The C<paren> parameter will be C<1> for C<$1>, C<2> for C<$2> and so
363forth, and have these symbolic values for the special variables:
364
365    ${^PREMATCH}  RX_BUFF_IDX_CARET_PREMATCH
366    ${^POSTMATCH} RX_BUFF_IDX_CARET_POSTMATCH
367    ${^MATCH}     RX_BUFF_IDX_CARET_FULLMATCH
368    $`            RX_BUFF_IDX_PREMATCH
369    $'            RX_BUFF_IDX_POSTMATCH
370    $&            RX_BUFF_IDX_FULLMATCH
371
372=for apidoc Amnh||RX_BUFF_IDX_CARET_FULLMATCH
373=for apidoc_item  RX_BUFF_IDX_CARET_POSTMATCH
374=for apidoc_item  RX_BUFF_IDX_CARET_PREMATCH
375=for apidoc_item  RX_BUFF_IDX_FULLMATCH
376=for apidoc_item  RX_BUFF_IDX_POSTMATCH
377=for apidoc_item  RX_BUFF_IDX_PREMATCH
378
379Note that in Perl 5.17.3 and earlier, the last three constants were also
380used for the caret variants of the variables.
381
382The names have been chosen by analogy with L<Tie::Scalar> methods
383names with an additional B<LENGTH> callback for efficiency.  However
384named capture variables are currently not tied internally but
385implemented via magic.
386
387=head3 numbered_buff_FETCH
388
389    void numbered_buff_FETCH(pTHX_ REGEXP * const rx, const I32 paren,
390                             SV * const sv);
391
392Fetch a specified numbered capture.  C<sv> should be set to the scalar
393to return, the scalar is passed as an argument rather than being
394returned from the function because when it's called Perl already has a
395scalar to store the value, creating another one would be
396redundant.  The scalar can be set with C<sv_setsv>, C<sv_setpvn> and
397friends, see L<perlapi>.
398
399This callback is where Perl untaints its own capture variables under
400taint mode (see L<perlsec>).  See the C<Perl_reg_numbered_buff_fetch>
401function in F<regcomp.c> for how to untaint capture variables if
402that's something you'd like your engine to do as well.
403
404=head3 numbered_buff_STORE
405
406    void    (*numbered_buff_STORE) (pTHX_
407                                    REGEXP * const rx,
408                                    const I32 paren,
409                                    SV const * const value);
410
411Set the value of a numbered capture variable.  C<value> is the scalar
412that is to be used as the new value.  It's up to the engine to make
413sure this is used as the new value (or reject it).
414
415Example:
416
417    if ("ook" =~ /(o*)/) {
418        # 'paren' will be '1' and 'value' will be 'ee'
419        $1 =~ tr/o/e/;
420    }
421
422Perl's own engine will croak on any attempt to modify the capture
423variables, to do this in another engine use the following callback
424(copied from C<Perl_reg_numbered_buff_store>):
425
426    void
427    Example_reg_numbered_buff_store(pTHX_
428                                    REGEXP * const rx,
429                                    const I32 paren,
430                                    SV const * const value)
431    {
432        PERL_UNUSED_ARG(rx);
433        PERL_UNUSED_ARG(paren);
434        PERL_UNUSED_ARG(value);
435
436        if (!PL_localizing)
437            Perl_croak(aTHX_ PL_no_modify);
438    }
439
440Actually Perl will not I<always> croak in a statement that looks
441like it would modify a numbered capture variable.  This is because the
442STORE callback will not be called if Perl can determine that it
443doesn't have to modify the value.  This is exactly how tied variables
444behave in the same situation:
445
446    package CaptureVar;
447    use parent 'Tie::Scalar';
448
449    sub TIESCALAR { bless [] }
450    sub FETCH { undef }
451    sub STORE { die "This doesn't get called" }
452
453    package main;
454
455    tie my $sv => "CaptureVar";
456    $sv =~ y/a/b/;
457
458Because C<$sv> is C<undef> when the C<y///> operator is applied to it,
459the transliteration won't actually execute and the program won't
460C<die>.  This is different to how 5.8 and earlier versions behaved
461since the capture variables were READONLY variables then; now they'll
462just die when assigned to in the default engine.
463
464=head3 numbered_buff_LENGTH
465
466    I32 numbered_buff_LENGTH (pTHX_
467                              REGEXP * const rx,
468                              const SV * const sv,
469                              const I32 paren);
470
471Get the C<length> of a capture variable.  There's a special callback
472for this so that Perl doesn't have to do a FETCH and run C<length> on
473the result, since the length is (in Perl's case) known from an offset
474stored in C<< rx->offs >>, this is much more efficient:
475
476    I32 s1  = rx->offs[paren].start;
477    I32 s2  = rx->offs[paren].end;
478    I32 len = t1 - s1;
479
480This is a little bit more complex in the case of UTF-8, see what
481C<Perl_reg_numbered_buff_length> does with
482L<is_utf8_string_loclen|perlapi/is_utf8_string_loclen>.
483
484=head2 Named capture callbacks
485
486Called to get/set the value of C<%+> and C<%->, as well as by some
487utility functions in L<re>.
488
489There are two callbacks, C<named_buff> is called in all the cases the
490FETCH, STORE, DELETE, CLEAR, EXISTS and SCALAR L<Tie::Hash> callbacks
491would be on changes to C<%+> and C<%-> and C<named_buff_iter> in the
492same cases as FIRSTKEY and NEXTKEY.
493
494The C<flags> parameter can be used to determine which of these
495operations the callbacks should respond to.  The following flags are
496currently defined:
497
498Which L<Tie::Hash> operation is being performed from the Perl level on
499C<%+> or C<%+>, if any:
500
501    RXapif_FETCH
502    RXapif_STORE
503    RXapif_DELETE
504    RXapif_CLEAR
505    RXapif_EXISTS
506    RXapif_SCALAR
507    RXapif_FIRSTKEY
508    RXapif_NEXTKEY
509
510=for apidoc Amnh ||RXapif_CLEAR
511=for apidoc_item   RXapif_DELETE
512=for apidoc_item   RXapif_EXISTS
513=for apidoc_item   RXapif_FETCH
514=for apidoc_item   RXapif_FIRSTKEY
515=for apidoc_item   RXapif_NEXTKEY
516=for apidoc_item   RXapif_SCALAR
517=for apidoc_item   RXapif_STORE
518=for apidoc_item   RXapif_ALL
519=for apidoc_item   RXapif_ONE
520=for apidoc_item   RXapif_REGNAME
521=for apidoc_item   RXapif_REGNAMES
522=for apidoc_item   RXapif_REGNAMES_COUNT
523
524If C<%+> or C<%-> is being operated on, if any.
525
526    RXapif_ONE /* %+ */
527    RXapif_ALL /* %- */
528
529If this is being called as C<re::regname>, C<re::regnames> or
530C<re::regnames_count>, if any.  The first two will be combined with
531C<RXapif_ONE> or C<RXapif_ALL>.
532
533    RXapif_REGNAME
534    RXapif_REGNAMES
535    RXapif_REGNAMES_COUNT
536
537
538Internally C<%+> and C<%-> are implemented with a real tied interface
539via L<Tie::Hash::NamedCapture>.  The methods in that package will call
540back into these functions.  However the usage of
541L<Tie::Hash::NamedCapture> for this purpose might change in future
542releases.  For instance this might be implemented by magic instead
543(would need an extension to mgvtbl).
544
545=head3 named_buff
546
547    SV*     (*named_buff) (pTHX_ REGEXP * const rx, SV * const key,
548                           SV * const value, U32 flags);
549
550=head3 named_buff_iter
551
552    SV*     (*named_buff_iter) (pTHX_
553                                REGEXP * const rx,
554                                const SV * const lastkey,
555                                const U32 flags);
556
557=head2 qr_package
558
559    SV* qr_package(pTHX_ REGEXP * const rx);
560
561The package the qr// magic object is blessed into (as seen by C<ref
562qr//>).  It is recommended that engines change this to their package
563name for identification regardless of if they implement methods
564on the object.
565
566The package this method returns should also have the internal
567C<Regexp> package in its C<@ISA>.  C<< qr//->isa("Regexp") >> should always
568be true regardless of what engine is being used.
569
570Example implementation might be:
571
572    SV*
573    Example_qr_package(pTHX_ REGEXP * const rx)
574    {
575    	PERL_UNUSED_ARG(rx);
576    	return newSVpvs("re::engine::Example");
577    }
578
579Any method calls on an object created with C<qr//> will be dispatched to the
580package as a normal object.
581
582    use re::engine::Example;
583    my $re = qr//;
584    $re->meth; # dispatched to re::engine::Example::meth()
585
586To retrieve the C<REGEXP> object from the scalar in an XS function use
587the C<SvRX> macro, see L<"REGEXP Functions" in perlapi|perlapi/REGEXP
588Functions>.
589
590    void meth(SV * rv)
591    PPCODE:
592        REGEXP * re = SvRX(sv);
593
594=head2 dupe
595
596    void* dupe(pTHX_ REGEXP * const rx, CLONE_PARAMS *param);
597
598On threaded builds a regexp may need to be duplicated so that the pattern
599can be used by multiple threads.  This routine is expected to handle the
600duplication of any private data pointed to by the C<pprivate> member of
601the C<regexp> structure.  It will be called with the preconstructed new
602C<regexp> structure as an argument, the C<pprivate> member will point at
603the B<old> private structure, and it is this routine's responsibility to
604construct a copy and return a pointer to it (which Perl will then use to
605overwrite the field as passed to this routine.)
606
607This allows the engine to dupe its private data but also if necessary
608modify the final structure if it really must.
609
610On unthreaded builds this field doesn't exist.
611
612=head2 op_comp
613
614This is private to the Perl core and subject to change. Should be left
615null.
616
617=head1 The REGEXP structure
618
619The REGEXP struct is defined in F<regexp.h>.
620All regex engines must be able to
621correctly build such a structure in their L</comp> routine.
622
623=for apidoc Ayh||REGEXP
624
625The REGEXP structure contains all the data that Perl needs to be aware of
626to properly work with the regular expression.  It includes data about
627optimisations that Perl can use to determine if the regex engine should
628really be used, and various other control info that is needed to properly
629execute patterns in various contexts, such as if the pattern anchored in
630some way, or what flags were used during the compile, or if the
631program contains special constructs that Perl needs to be aware of.
632
633In addition it contains two fields that are intended for the private
634use of the regex engine that compiled the pattern.  These are the
635C<intflags> and C<pprivate> members.  C<pprivate> is a void pointer to
636an arbitrary structure, whose use and management is the responsibility
637of the compiling engine.  Perl will never modify either of these
638values.
639
640    typedef struct regexp {
641        /* what engine created this regexp? */
642        const struct regexp_engine* engine;
643
644        /* what re is this a lightweight copy of? */
645        struct regexp* mother_re;
646
647        /* Information about the match that the Perl core uses to manage
648         * things */
649        U32 extflags;   /* Flags used both externally and internally */
650	I32 minlen;	/* mininum possible number of chars in */
651                           string to match */
652	I32 minlenret;	/* mininum possible number of chars in $& */
653        U32 gofs;       /* chars left of pos that we search from */
654
655        /* substring data about strings that must appear
656           in the final match, used for optimisations */
657        struct reg_substr_data *substrs;
658
659        U32 nparens;  /* number of capture groups */
660
661        /* private engine specific data */
662        U32 intflags;   /* Engine Specific Internal flags */
663        void *pprivate; /* Data private to the regex engine which
664                           created this object. */
665
666        /* Data about the last/current match. These are modified during
667         * matching*/
668        U32 lastparen;            /* highest close paren matched ($+) */
669        U32 lastcloseparen;       /* last close paren matched ($^N) */
670        regexp_paren_pair *offs;  /* Array of offsets for (@-) and
671                                     (@+) */
672
673        char *subbeg;  /* saved or original string so \digit works
674                          forever. */
675        SV_SAVED_COPY  /* If non-NULL, SV which is COW from original */
676        I32 sublen;    /* Length of string pointed by subbeg */
677        I32 suboffset;	/* byte offset of subbeg from logical start of
678                           str */
679	I32 subcoffset;	/* suboffset equiv, but in chars (for @-/@+) */
680
681        /* Information about the match that isn't often used */
682        I32 prelen;           /* length of precomp */
683        const char *precomp;  /* pre-compilation regular expression */
684
685        char *wrapped;  /* wrapped version of the pattern */
686        I32 wraplen;    /* length of wrapped */
687
688        I32 seen_evals;   /* number of eval groups in the pattern - for
689                             security checks */
690        HV *paren_names;  /* Optional hash of paren names */
691
692        /* Refcount of this regexp */
693        I32 refcnt;             /* Refcount of this regexp */
694    } regexp;
695
696The fields are discussed in more detail below:
697
698=head2 C<engine>
699
700This field points at a C<regexp_engine> structure which contains pointers
701to the subroutines that are to be used for performing a match.  It
702is the compiling routine's responsibility to populate this field before
703returning the regexp object.
704
705Internally this is set to C<NULL> unless a custom engine is specified in
706C<$^H{regcomp}>, Perl's own set of callbacks can be accessed in the struct
707pointed to by C<RE_ENGINE_PTR>.
708
709=for apidoc Amnh||SV_SAVED_COPY
710
711=head2 C<mother_re>
712
713TODO, see commit 28d8d7f41a.
714
715=head2 C<extflags>
716
717This will be used by Perl to see what flags the regexp was compiled
718with, this will normally be set to the value of the flags parameter by
719the L<comp|/comp> callback.  See the L<comp|/comp> documentation for
720valid flags.
721
722=head2 C<minlen> C<minlenret>
723
724The minimum string length (in characters) required for the pattern to match.
725This is used to
726prune the search space by not bothering to match any closer to the end of a
727string than would allow a match.  For instance there is no point in even
728starting the regex engine if the minlen is 10 but the string is only 5
729characters long.  There is no way that the pattern can match.
730
731C<minlenret> is the minimum length (in characters) of the string that would
732be found in $& after a match.
733
734The difference between C<minlen> and C<minlenret> can be seen in the
735following pattern:
736
737    /ns(?=\d)/
738
739where the C<minlen> would be 3 but C<minlenret> would only be 2 as the \d is
740required to match but is not actually
741included in the matched content.  This
742distinction is particularly important as the substitution logic uses the
743C<minlenret> to tell if it can do in-place substitutions (these can
744result in considerable speed-up).
745
746=head2 C<gofs>
747
748Left offset from pos() to start match at.
749
750=head2 C<substrs>
751
752Substring data about strings that must appear in the final match.  This
753is currently only used internally by Perl's engine, but might be
754used in the future for all engines for optimisations.
755
756=head2 C<nparens>, C<lastparen>, and C<lastcloseparen>
757
758These fields are used to keep track of: how many paren capture groups
759there are in the pattern; which was the highest paren to be closed (see
760L<perlvar/$+>); and which was the most recent paren to be closed (see
761L<perlvar/$^N>).
762
763=head2 C<intflags>
764
765The engine's private copy of the flags the pattern was compiled with. Usually
766this is the same as C<extflags> unless the engine chose to modify one of them.
767
768=head2 C<pprivate>
769
770A void* pointing to an engine-defined
771data structure.  The Perl engine uses the
772C<regexp_internal> structure (see L<perlreguts/Base Structures>) but a custom
773engine should use something else.
774
775=head2 C<offs>
776
777A C<regexp_paren_pair> structure which defines offsets into the string being
778matched which correspond to the C<$&> and C<$1>, C<$2> etc. captures, the
779C<regexp_paren_pair> struct is defined as follows:
780
781    typedef struct regexp_paren_pair {
782        I32 start;
783        I32 end;
784    } regexp_paren_pair;
785
786=for apidoc Ayh||regexp_paren_pair
787
788If C<< ->offs[num].start >> or C<< ->offs[num].end >> is C<-1> then that
789capture group did not match.
790C<< ->offs[0].start/end >> represents C<$&> (or
791C<${^MATCH}> under C</p>) and C<< ->offs[paren].end >> matches C<$$paren> where
792C<$paren >= 1>.
793
794=head2 C<precomp> C<prelen>
795
796Used for optimisations.  C<precomp> holds a copy of the pattern that
797was compiled and C<prelen> its length.  When a new pattern is to be
798compiled (such as inside a loop) the internal C<regcomp> operator
799checks if the last compiled C<REGEXP>'s C<precomp> and C<prelen>
800are equivalent to the new one, and if so uses the old pattern instead
801of compiling a new one.
802
803The relevant snippet from C<Perl_pp_regcomp>:
804
805	if (!re || !re->precomp || re->prelen != (I32)len ||
806	    memNE(re->precomp, t, len))
807        /* Compile a new pattern */
808
809=head2 C<paren_names>
810
811This is a hash used internally to track named capture groups and their
812offsets.  The keys are the names of the buffers the values are dualvars,
813with the IV slot holding the number of buffers with the given name and the
814pv being an embedded array of I32.  The values may also be contained
815independently in the data array in cases where named backreferences are
816used.
817
818=head2 C<substrs>
819
820Holds information on the longest string that must occur at a fixed
821offset from the start of the pattern, and the longest string that must
822occur at a floating offset from the start of the pattern.  Used to do
823Fast-Boyer-Moore searches on the string to find out if its worth using
824the regex engine at all, and if so where in the string to search.
825
826=head2 C<subbeg> C<sublen> C<saved_copy> C<suboffset> C<subcoffset>
827
828Used during the execution phase for managing search and replace patterns,
829and for providing the text for C<$&>, C<$1> etc. C<subbeg> points to a
830buffer (either the original string, or a copy in the case of
831C<RX_MATCH_COPIED(rx)>), and C<sublen> is the length of the buffer.  The
832C<RX_OFFS> start and end indices index into this buffer.
833
834=for apidoc Amh||RX_MATCH_COPIED|const REGEXP * rx
835=for apidoc Amh||RX_OFFS|const REGEXP * rx_sv
836
837In the presence of the C<REXEC_COPY_STR> flag, but with the addition of
838the C<REXEC_COPY_SKIP_PRE> or C<REXEC_COPY_SKIP_POST> flags, an engine
839can choose not to copy the full buffer (although it must still do so in
840the presence of C<RXf_PMf_KEEPCOPY> or the relevant bits being set in
841C<PL_sawampersand>).  In this case, it may set C<suboffset> to indicate the
842number of bytes from the logical start of the buffer to the physical start
843(i.e. C<subbeg>).  It should also set C<subcoffset>, the number of
844characters in the offset. The latter is needed to support C<@-> and C<@+>
845which work in characters, not bytes.
846
847=for apidoc Amnh||REXEC_COPY_STR
848=for apidoc_item ||REXEC_COPY_SKIP_PRE
849=for apidoc_item ||REXEC_COPY_SKIP_POST
850
851=head2 C<wrapped> C<wraplen>
852
853Stores the string C<qr//> stringifies to. The Perl engine for example
854stores C<(?^:eek)> in the case of C<qr/eek/>.
855
856When using a custom engine that doesn't support the C<(?:)> construct
857for inline modifiers, it's probably best to have C<qr//> stringify to
858the supplied pattern, note that this will create undesired patterns in
859cases such as:
860
861    my $x = qr/a|b/;  # "a|b"
862    my $y = qr/c/i;   # "c"
863    my $z = qr/$x$y/; # "a|bc"
864
865There's no solution for this problem other than making the custom
866engine understand a construct like C<(?:)>.
867
868=head2 C<seen_evals>
869
870This stores the number of eval groups in
871the pattern.  This is used for security
872purposes when embedding compiled regexes into larger patterns with C<qr//>.
873
874=head2 C<refcnt>
875
876The number of times the structure is referenced.  When
877this falls to 0, the regexp is automatically freed
878by a call to C<pregfree>.  This should be set to 1 in
879each engine's L</comp> routine.
880
881=head1 HISTORY
882
883Originally part of L<perlreguts>.
884
885=head1 AUTHORS
886
887Originally written by Yves Orton, expanded by E<AElig>var ArnfjE<ouml>rE<eth>
888Bjarmason.
889
890=head1 LICENSE
891
892Copyright 2006 Yves Orton and 2007 E<AElig>var ArnfjE<ouml>rE<eth> Bjarmason.
893
894This program is free software; you can redistribute it and/or modify it under
895the same terms as Perl itself.
896
897=cut
898