1=head1 NAME 2 3perlreapi - Perl regular expression plugin interface 4 5=head1 DESCRIPTION 6 7As of Perl 5.9.5 there is a new interface for plugging and using 8regular expression engines other than the default one. 9 10Each engine is supposed to provide access to a constant structure of the 11following format: 12 13 typedef struct regexp_engine { 14 REGEXP* (*comp) (pTHX_ 15 const SV * const pattern, const U32 flags); 16 I32 (*exec) (pTHX_ 17 REGEXP * const rx, 18 char* stringarg, 19 char* strend, char* strbeg, 20 SSize_t minend, SV* sv, 21 void* data, U32 flags); 22 char* (*intuit) (pTHX_ 23 REGEXP * const rx, SV *sv, 24 const char * const strbeg, 25 char *strpos, char *strend, U32 flags, 26 struct re_scream_pos_data_s *data); 27 SV* (*checkstr) (pTHX_ REGEXP * const rx); 28 void (*free) (pTHX_ REGEXP * const rx); 29 void (*numbered_buff_FETCH) (pTHX_ 30 REGEXP * const rx, 31 const I32 paren, 32 SV * const sv); 33 void (*numbered_buff_STORE) (pTHX_ 34 REGEXP * const rx, 35 const I32 paren, 36 SV const * const value); 37 I32 (*numbered_buff_LENGTH) (pTHX_ 38 REGEXP * const rx, 39 const SV * const sv, 40 const I32 paren); 41 SV* (*named_buff) (pTHX_ 42 REGEXP * const rx, 43 SV * const key, 44 SV * const value, 45 U32 flags); 46 SV* (*named_buff_iter) (pTHX_ 47 REGEXP * const rx, 48 const SV * const lastkey, 49 const U32 flags); 50 SV* (*qr_package)(pTHX_ REGEXP * const rx); 51 #ifdef USE_ITHREADS 52 void* (*dupe) (pTHX_ REGEXP * const rx, CLONE_PARAMS *param); 53 #endif 54 REGEXP* (*op_comp) (...); 55 56 57=for apidoc_section $regexp 58=for apidoc Ay||regexp_engine 59 60When a regexp is compiled, its C<engine> field is then set to point at 61the appropriate structure, so that when it needs to be used Perl can find 62the right routines to do so. 63 64In order to install a new regexp handler, C<$^H{regcomp}> is set 65to an integer which (when casted appropriately) resolves to one of these 66structures. When compiling, the C<comp> method is executed, and the 67resulting C<regexp> structure's engine field is expected to point back at 68the same structure. 69 70The pTHX_ symbol in the definition is a macro used by Perl under threading 71to provide an extra argument to the routine holding a pointer back to 72the interpreter that is executing the regexp. So under threading all 73routines get an extra argument. 74 75=head1 Callbacks 76 77=head2 comp 78 79 REGEXP* comp(pTHX_ const SV * const pattern, const U32 flags); 80 81Compile the pattern stored in C<pattern> using the given C<flags> and 82return a pointer to a prepared C<REGEXP> structure that can perform 83the match. See L</The REGEXP structure> below for an explanation of 84the individual fields in the REGEXP struct. 85 86The C<pattern> parameter is the scalar that was used as the 87pattern. Previous versions of Perl would pass two C<char*> indicating 88the start and end of the stringified pattern; the following snippet can 89be used to get the old parameters: 90 91 STRLEN plen; 92 char* exp = SvPV(pattern, plen); 93 char* xend = exp + plen; 94 95Since any scalar can be passed as a pattern, it's possible to implement 96an engine that does something with an array (C<< "ook" =~ [ qw/ eek 97hlagh / ] >>) or with the non-stringified form of a compiled regular 98expression (C<< "ook" =~ qr/eek/ >>). Perl's own engine will always 99stringify everything using the snippet above, but that doesn't mean 100other engines have to. 101 102The C<flags> parameter is a bitfield which indicates which of the 103C<msixpn> flags the regex was compiled with. It also contains 104additional info, such as if C<use locale> is in effect. 105 106The C<eogc> flags are stripped out before being passed to the comp 107routine. The regex engine does not need to know if any of these 108are set, as those flags should only affect what Perl does with the 109pattern and its match variables, not how it gets compiled and 110executed. 111 112By the time the comp callback is called, some of these flags have 113already had effect (noted below where applicable). However most of 114their effect occurs after the comp callback has run, in routines that 115read the C<< rx->extflags >> field which it populates. 116 117In general the flags should be preserved in C<< rx->extflags >> after 118compilation, although the regex engine might want to add or delete 119some of them to invoke or disable some special behavior in Perl. The 120flags along with any special behavior they cause are documented below: 121 122The pattern modifiers: 123 124=over 4 125 126=item C</m> - RXf_PMf_MULTILINE 127 128If this is in C<< rx->extflags >> it will be passed to 129C<Perl_fbm_instr> by C<pp_split> which will treat the subject string 130as a multi-line string. 131 132=for apidoc Amnh||RXf_PMf_MULTILINE 133=for apidoc_item RXf_PMf_SINGLELINE 134=for apidoc_item RXf_PMf_FOLD 135=for apidoc_item RXf_PMf_EXTENDED 136=for apidoc_item RXf_PMf_KEEPCOPY 137 138=item C</s> - RXf_PMf_SINGLELINE 139 140=item C</i> - RXf_PMf_FOLD 141 142=item C</x> - RXf_PMf_EXTENDED 143 144If present on a regex, C<"#"> comments will be handled differently by the 145tokenizer in some cases. 146 147TODO: Document those cases. 148 149 150=item C</p> - RXf_PMf_KEEPCOPY 151 152TODO: Document this 153 154=item Character set 155 156The character set rules are determined by an enum that is contained 157in this field. This is still experimental and subject to change, but 158the current interface returns the rules by use of the in-line function 159C<get_regex_charset(const U32 flags)>. The only currently documented 160value returned from it is REGEX_LOCALE_CHARSET, which is set if 161C<use locale> is in effect. If present in C<< rx->extflags >>, 162C<split> will use the locale dependent definition of whitespace 163when RXf_SKIPWHITE or RXf_WHITE is in effect. ASCII whitespace 164is defined as per L<isSPACE|perlapi/isSPACE>, and by the internal 165macros C<is_utf8_space> under UTF-8, and C<isSPACE_LC> under C<use 166locale>. 167 168=for apidoc Amnh||REGEX_LOCALE_CHARSET 169 170=back 171 172Additional flags: 173 174=over 4 175 176=item RXf_SPLIT 177 178This flag was removed in perl 5.18.0. C<split ' '> is now special-cased 179solely in the parser. RXf_SPLIT is still #defined, so you can test for it. 180This is how it used to work: 181 182If C<split> is invoked as C<split ' '> or with no arguments (which 183really means C<split(' ', $_)>, see L<split|perlfunc/split>), Perl will 184set this flag. The regex engine can then check for it and set the 185SKIPWHITE and WHITE extflags. To do this, the Perl engine does: 186 187 if (flags & RXf_SPLIT && r->prelen == 1 && r->precomp[0] == ' ') 188 r->extflags |= (RXf_SKIPWHITE|RXf_WHITE); 189 190=back 191 192These flags can be set during compilation to enable optimizations in 193the C<split> operator. 194 195=for apidoc Amnh||RXf_SPLIT 196=for apidoc_item RXf_SKIPWHITE 197=for apidoc_item RXf_START_ONLY 198=for apidoc_item RXf_WHITE 199=for apidoc_item RXf_NULL 200=for apidoc_item RXf_NO_INPLACE_SUBST 201 202=over 4 203 204=item RXf_SKIPWHITE 205 206This flag was removed in perl 5.18.0. It is still #defined, so you can 207set it, but doing so will have no effect. This is how it used to work: 208 209If the flag is present in C<< rx->extflags >> C<split> will delete 210whitespace from the start of the subject string before it's operated 211on. What is considered whitespace depends on if the subject is a 212UTF-8 string and if the C<RXf_PMf_LOCALE> flag is set. 213 214If RXf_WHITE is set in addition to this flag, C<split> will behave like 215C<split " "> under the Perl engine. 216 217 218=item RXf_START_ONLY 219 220Tells the split operator to split the target string on newlines 221(C<\n>) without invoking the regex engine. 222 223Perl's engine sets this if the pattern is C</^/> (C<plen == 1 && *exp 224== '^'>), even under C</^/s>; see L<split|perlfunc>. Of course a 225different regex engine might want to use the same optimizations 226with a different syntax. 227 228=item RXf_WHITE 229 230Tells the split operator to split the target string on whitespace 231without invoking the regex engine. The definition of whitespace varies 232depending on if the target string is a UTF-8 string and on 233if RXf_PMf_LOCALE is set. 234 235Perl's engine sets this flag if the pattern is C<\s+>. 236 237=item RXf_NULL 238 239Tells the split operator to split the target string on 240characters. The definition of character varies depending on if 241the target string is a UTF-8 string. 242 243Perl's engine sets this flag on empty patterns, this optimization 244makes C<split //> much faster than it would otherwise be. It's even 245faster than C<unpack>. 246 247=item RXf_NO_INPLACE_SUBST 248 249Added in perl 5.18.0, this flag indicates that a regular expression might 250perform an operation that would interfere with inplace substitution. For 251instance it might contain lookbehind, or assign to non-magical variables 252(such as $REGMARK and $REGERROR) during matching. C<s///> will skip 253certain optimisations when this is set. 254 255=back 256 257=head2 exec 258 259 I32 exec(pTHX_ REGEXP * const rx, 260 char *stringarg, char* strend, char* strbeg, 261 SSize_t minend, SV* sv, 262 void* data, U32 flags); 263 264Execute a regexp. The arguments are 265 266=over 4 267 268=item rx 269 270The regular expression to execute. 271 272=item sv 273 274This is the SV to be matched against. Note that the 275actual char array to be matched against is supplied by the arguments 276described below; the SV is just used to determine UTF8ness, C<pos()> etc. 277 278=item strbeg 279 280Pointer to the physical start of the string. 281 282=item strend 283 284Pointer to the character following the physical end of the string (i.e. 285the C<\0>, if any). 286 287=item stringarg 288 289Pointer to the position in the string where matching should start; it might 290not be equal to C<strbeg> (for example in a later iteration of C</.../g>). 291 292=item minend 293 294Minimum length of string (measured in bytes from C<stringarg>) that must 295match; if the engine reaches the end of the match but hasn't reached this 296position in the string, it should fail. 297 298=item data 299 300Optimisation data; subject to change. 301 302=item flags 303 304Optimisation flags; subject to change. 305 306=back 307 308=head2 intuit 309 310 char* intuit(pTHX_ 311 REGEXP * const rx, 312 SV *sv, 313 const char * const strbeg, 314 char *strpos, 315 char *strend, 316 const U32 flags, 317 struct re_scream_pos_data_s *data); 318 319Find the start position where a regex match should be attempted, 320or possibly if the regex engine should not be run because the 321pattern can't match. This is called, as appropriate, by the core, 322depending on the values of the C<extflags> member of the C<regexp> 323structure. 324 325Arguments: 326 327 rx: the regex to match against 328 sv: the SV being matched: only used for utf8 flag; the string 329 itself is accessed via the pointers below. Note that on 330 something like an overloaded SV, SvPOK(sv) may be false 331 and the string pointers may point to something unrelated to 332 the SV itself. 333 strbeg: real beginning of string 334 strpos: the point in the string at which to begin matching 335 strend: pointer to the byte following the last char of the string 336 flags currently unused; set to 0 337 data: currently unused; set to NULL 338 339 340=head2 checkstr 341 342 SV* checkstr(pTHX_ REGEXP * const rx); 343 344Return a SV containing a string that must appear in the pattern. Used 345by C<split> for optimising matches. 346 347=head2 free 348 349 void free(pTHX_ REGEXP * const rx); 350 351Called by Perl when it is freeing a regexp pattern so that the engine 352can release any resources pointed to by the C<pprivate> member of the 353C<regexp> structure. This is only responsible for freeing private data; 354Perl will handle releasing anything else contained in the C<regexp> structure. 355 356=head2 Numbered capture callbacks 357 358Called to get/set the value of C<$`>, C<$'>, C<$&> and their named 359equivalents, ${^PREMATCH}, ${^POSTMATCH} and ${^MATCH}, as well as the 360numbered capture groups (C<$1>, C<$2>, ...). 361 362The C<paren> parameter will be C<1> for C<$1>, C<2> for C<$2> and so 363forth, and have these symbolic values for the special variables: 364 365 ${^PREMATCH} RX_BUFF_IDX_CARET_PREMATCH 366 ${^POSTMATCH} RX_BUFF_IDX_CARET_POSTMATCH 367 ${^MATCH} RX_BUFF_IDX_CARET_FULLMATCH 368 $` RX_BUFF_IDX_PREMATCH 369 $' RX_BUFF_IDX_POSTMATCH 370 $& RX_BUFF_IDX_FULLMATCH 371 372=for apidoc Amnh||RX_BUFF_IDX_CARET_FULLMATCH 373=for apidoc_item RX_BUFF_IDX_CARET_POSTMATCH 374=for apidoc_item RX_BUFF_IDX_CARET_PREMATCH 375=for apidoc_item RX_BUFF_IDX_FULLMATCH 376=for apidoc_item RX_BUFF_IDX_POSTMATCH 377=for apidoc_item RX_BUFF_IDX_PREMATCH 378 379Note that in Perl 5.17.3 and earlier, the last three constants were also 380used for the caret variants of the variables. 381 382The names have been chosen by analogy with L<Tie::Scalar> methods 383names with an additional B<LENGTH> callback for efficiency. However 384named capture variables are currently not tied internally but 385implemented via magic. 386 387=head3 numbered_buff_FETCH 388 389 void numbered_buff_FETCH(pTHX_ REGEXP * const rx, const I32 paren, 390 SV * const sv); 391 392Fetch a specified numbered capture. C<sv> should be set to the scalar 393to return, the scalar is passed as an argument rather than being 394returned from the function because when it's called Perl already has a 395scalar to store the value, creating another one would be 396redundant. The scalar can be set with C<sv_setsv>, C<sv_setpvn> and 397friends, see L<perlapi>. 398 399This callback is where Perl untaints its own capture variables under 400taint mode (see L<perlsec>). See the C<Perl_reg_numbered_buff_fetch> 401function in F<regcomp.c> for how to untaint capture variables if 402that's something you'd like your engine to do as well. 403 404=head3 numbered_buff_STORE 405 406 void (*numbered_buff_STORE) (pTHX_ 407 REGEXP * const rx, 408 const I32 paren, 409 SV const * const value); 410 411Set the value of a numbered capture variable. C<value> is the scalar 412that is to be used as the new value. It's up to the engine to make 413sure this is used as the new value (or reject it). 414 415Example: 416 417 if ("ook" =~ /(o*)/) { 418 # 'paren' will be '1' and 'value' will be 'ee' 419 $1 =~ tr/o/e/; 420 } 421 422Perl's own engine will croak on any attempt to modify the capture 423variables, to do this in another engine use the following callback 424(copied from C<Perl_reg_numbered_buff_store>): 425 426 void 427 Example_reg_numbered_buff_store(pTHX_ 428 REGEXP * const rx, 429 const I32 paren, 430 SV const * const value) 431 { 432 PERL_UNUSED_ARG(rx); 433 PERL_UNUSED_ARG(paren); 434 PERL_UNUSED_ARG(value); 435 436 if (!PL_localizing) 437 Perl_croak(aTHX_ PL_no_modify); 438 } 439 440Actually Perl will not I<always> croak in a statement that looks 441like it would modify a numbered capture variable. This is because the 442STORE callback will not be called if Perl can determine that it 443doesn't have to modify the value. This is exactly how tied variables 444behave in the same situation: 445 446 package CaptureVar; 447 use parent 'Tie::Scalar'; 448 449 sub TIESCALAR { bless [] } 450 sub FETCH { undef } 451 sub STORE { die "This doesn't get called" } 452 453 package main; 454 455 tie my $sv => "CaptureVar"; 456 $sv =~ y/a/b/; 457 458Because C<$sv> is C<undef> when the C<y///> operator is applied to it, 459the transliteration won't actually execute and the program won't 460C<die>. This is different to how 5.8 and earlier versions behaved 461since the capture variables were READONLY variables then; now they'll 462just die when assigned to in the default engine. 463 464=head3 numbered_buff_LENGTH 465 466 I32 numbered_buff_LENGTH (pTHX_ 467 REGEXP * const rx, 468 const SV * const sv, 469 const I32 paren); 470 471Get the C<length> of a capture variable. There's a special callback 472for this so that Perl doesn't have to do a FETCH and run C<length> on 473the result, since the length is (in Perl's case) known from an offset 474stored in C<< rx->offs >>, this is much more efficient: 475 476 I32 s1 = rx->offs[paren].start; 477 I32 s2 = rx->offs[paren].end; 478 I32 len = t1 - s1; 479 480This is a little bit more complex in the case of UTF-8, see what 481C<Perl_reg_numbered_buff_length> does with 482L<is_utf8_string_loclen|perlapi/is_utf8_string_loclen>. 483 484=head2 Named capture callbacks 485 486Called to get/set the value of C<%+> and C<%->, as well as by some 487utility functions in L<re>. 488 489There are two callbacks, C<named_buff> is called in all the cases the 490FETCH, STORE, DELETE, CLEAR, EXISTS and SCALAR L<Tie::Hash> callbacks 491would be on changes to C<%+> and C<%-> and C<named_buff_iter> in the 492same cases as FIRSTKEY and NEXTKEY. 493 494The C<flags> parameter can be used to determine which of these 495operations the callbacks should respond to. The following flags are 496currently defined: 497 498Which L<Tie::Hash> operation is being performed from the Perl level on 499C<%+> or C<%+>, if any: 500 501 RXapif_FETCH 502 RXapif_STORE 503 RXapif_DELETE 504 RXapif_CLEAR 505 RXapif_EXISTS 506 RXapif_SCALAR 507 RXapif_FIRSTKEY 508 RXapif_NEXTKEY 509 510=for apidoc Amnh ||RXapif_CLEAR 511=for apidoc_item RXapif_DELETE 512=for apidoc_item RXapif_EXISTS 513=for apidoc_item RXapif_FETCH 514=for apidoc_item RXapif_FIRSTKEY 515=for apidoc_item RXapif_NEXTKEY 516=for apidoc_item RXapif_SCALAR 517=for apidoc_item RXapif_STORE 518=for apidoc_item RXapif_ALL 519=for apidoc_item RXapif_ONE 520=for apidoc_item RXapif_REGNAME 521=for apidoc_item RXapif_REGNAMES 522=for apidoc_item RXapif_REGNAMES_COUNT 523 524If C<%+> or C<%-> is being operated on, if any. 525 526 RXapif_ONE /* %+ */ 527 RXapif_ALL /* %- */ 528 529If this is being called as C<re::regname>, C<re::regnames> or 530C<re::regnames_count>, if any. The first two will be combined with 531C<RXapif_ONE> or C<RXapif_ALL>. 532 533 RXapif_REGNAME 534 RXapif_REGNAMES 535 RXapif_REGNAMES_COUNT 536 537 538Internally C<%+> and C<%-> are implemented with a real tied interface 539via L<Tie::Hash::NamedCapture>. The methods in that package will call 540back into these functions. However the usage of 541L<Tie::Hash::NamedCapture> for this purpose might change in future 542releases. For instance this might be implemented by magic instead 543(would need an extension to mgvtbl). 544 545=head3 named_buff 546 547 SV* (*named_buff) (pTHX_ REGEXP * const rx, SV * const key, 548 SV * const value, U32 flags); 549 550=head3 named_buff_iter 551 552 SV* (*named_buff_iter) (pTHX_ 553 REGEXP * const rx, 554 const SV * const lastkey, 555 const U32 flags); 556 557=head2 qr_package 558 559 SV* qr_package(pTHX_ REGEXP * const rx); 560 561The package the qr// magic object is blessed into (as seen by C<ref 562qr//>). It is recommended that engines change this to their package 563name for identification regardless of if they implement methods 564on the object. 565 566The package this method returns should also have the internal 567C<Regexp> package in its C<@ISA>. C<< qr//->isa("Regexp") >> should always 568be true regardless of what engine is being used. 569 570Example implementation might be: 571 572 SV* 573 Example_qr_package(pTHX_ REGEXP * const rx) 574 { 575 PERL_UNUSED_ARG(rx); 576 return newSVpvs("re::engine::Example"); 577 } 578 579Any method calls on an object created with C<qr//> will be dispatched to the 580package as a normal object. 581 582 use re::engine::Example; 583 my $re = qr//; 584 $re->meth; # dispatched to re::engine::Example::meth() 585 586To retrieve the C<REGEXP> object from the scalar in an XS function use 587the C<SvRX> macro, see L<"REGEXP Functions" in perlapi|perlapi/REGEXP 588Functions>. 589 590 void meth(SV * rv) 591 PPCODE: 592 REGEXP * re = SvRX(sv); 593 594=head2 dupe 595 596 void* dupe(pTHX_ REGEXP * const rx, CLONE_PARAMS *param); 597 598On threaded builds a regexp may need to be duplicated so that the pattern 599can be used by multiple threads. This routine is expected to handle the 600duplication of any private data pointed to by the C<pprivate> member of 601the C<regexp> structure. It will be called with the preconstructed new 602C<regexp> structure as an argument, the C<pprivate> member will point at 603the B<old> private structure, and it is this routine's responsibility to 604construct a copy and return a pointer to it (which Perl will then use to 605overwrite the field as passed to this routine.) 606 607This allows the engine to dupe its private data but also if necessary 608modify the final structure if it really must. 609 610On unthreaded builds this field doesn't exist. 611 612=head2 op_comp 613 614This is private to the Perl core and subject to change. Should be left 615null. 616 617=head1 The REGEXP structure 618 619The REGEXP struct is defined in F<regexp.h>. 620All regex engines must be able to 621correctly build such a structure in their L</comp> routine. 622 623=for apidoc Ayh||REGEXP 624 625The REGEXP structure contains all the data that Perl needs to be aware of 626to properly work with the regular expression. It includes data about 627optimisations that Perl can use to determine if the regex engine should 628really be used, and various other control info that is needed to properly 629execute patterns in various contexts, such as if the pattern anchored in 630some way, or what flags were used during the compile, or if the 631program contains special constructs that Perl needs to be aware of. 632 633In addition it contains two fields that are intended for the private 634use of the regex engine that compiled the pattern. These are the 635C<intflags> and C<pprivate> members. C<pprivate> is a void pointer to 636an arbitrary structure, whose use and management is the responsibility 637of the compiling engine. Perl will never modify either of these 638values. 639 640 typedef struct regexp { 641 /* what engine created this regexp? */ 642 const struct regexp_engine* engine; 643 644 /* what re is this a lightweight copy of? */ 645 struct regexp* mother_re; 646 647 /* Information about the match that the Perl core uses to manage 648 * things */ 649 U32 extflags; /* Flags used both externally and internally */ 650 I32 minlen; /* mininum possible number of chars in */ 651 string to match */ 652 I32 minlenret; /* mininum possible number of chars in $& */ 653 U32 gofs; /* chars left of pos that we search from */ 654 655 /* substring data about strings that must appear 656 in the final match, used for optimisations */ 657 struct reg_substr_data *substrs; 658 659 U32 nparens; /* number of capture groups */ 660 661 /* private engine specific data */ 662 U32 intflags; /* Engine Specific Internal flags */ 663 void *pprivate; /* Data private to the regex engine which 664 created this object. */ 665 666 /* Data about the last/current match. These are modified during 667 * matching*/ 668 U32 lastparen; /* highest close paren matched ($+) */ 669 U32 lastcloseparen; /* last close paren matched ($^N) */ 670 regexp_paren_pair *offs; /* Array of offsets for (@-) and 671 (@+) */ 672 673 char *subbeg; /* saved or original string so \digit works 674 forever. */ 675 SV_SAVED_COPY /* If non-NULL, SV which is COW from original */ 676 I32 sublen; /* Length of string pointed by subbeg */ 677 I32 suboffset; /* byte offset of subbeg from logical start of 678 str */ 679 I32 subcoffset; /* suboffset equiv, but in chars (for @-/@+) */ 680 681 /* Information about the match that isn't often used */ 682 I32 prelen; /* length of precomp */ 683 const char *precomp; /* pre-compilation regular expression */ 684 685 char *wrapped; /* wrapped version of the pattern */ 686 I32 wraplen; /* length of wrapped */ 687 688 I32 seen_evals; /* number of eval groups in the pattern - for 689 security checks */ 690 HV *paren_names; /* Optional hash of paren names */ 691 692 /* Refcount of this regexp */ 693 I32 refcnt; /* Refcount of this regexp */ 694 } regexp; 695 696The fields are discussed in more detail below: 697 698=head2 C<engine> 699 700This field points at a C<regexp_engine> structure which contains pointers 701to the subroutines that are to be used for performing a match. It 702is the compiling routine's responsibility to populate this field before 703returning the regexp object. 704 705Internally this is set to C<NULL> unless a custom engine is specified in 706C<$^H{regcomp}>, Perl's own set of callbacks can be accessed in the struct 707pointed to by C<RE_ENGINE_PTR>. 708 709=for apidoc Amnh||SV_SAVED_COPY 710 711=head2 C<mother_re> 712 713TODO, see commit 28d8d7f41a. 714 715=head2 C<extflags> 716 717This will be used by Perl to see what flags the regexp was compiled 718with, this will normally be set to the value of the flags parameter by 719the L<comp|/comp> callback. See the L<comp|/comp> documentation for 720valid flags. 721 722=head2 C<minlen> C<minlenret> 723 724The minimum string length (in characters) required for the pattern to match. 725This is used to 726prune the search space by not bothering to match any closer to the end of a 727string than would allow a match. For instance there is no point in even 728starting the regex engine if the minlen is 10 but the string is only 5 729characters long. There is no way that the pattern can match. 730 731C<minlenret> is the minimum length (in characters) of the string that would 732be found in $& after a match. 733 734The difference between C<minlen> and C<minlenret> can be seen in the 735following pattern: 736 737 /ns(?=\d)/ 738 739where the C<minlen> would be 3 but C<minlenret> would only be 2 as the \d is 740required to match but is not actually 741included in the matched content. This 742distinction is particularly important as the substitution logic uses the 743C<minlenret> to tell if it can do in-place substitutions (these can 744result in considerable speed-up). 745 746=head2 C<gofs> 747 748Left offset from pos() to start match at. 749 750=head2 C<substrs> 751 752Substring data about strings that must appear in the final match. This 753is currently only used internally by Perl's engine, but might be 754used in the future for all engines for optimisations. 755 756=head2 C<nparens>, C<lastparen>, and C<lastcloseparen> 757 758These fields are used to keep track of: how many paren capture groups 759there are in the pattern; which was the highest paren to be closed (see 760L<perlvar/$+>); and which was the most recent paren to be closed (see 761L<perlvar/$^N>). 762 763=head2 C<intflags> 764 765The engine's private copy of the flags the pattern was compiled with. Usually 766this is the same as C<extflags> unless the engine chose to modify one of them. 767 768=head2 C<pprivate> 769 770A void* pointing to an engine-defined 771data structure. The Perl engine uses the 772C<regexp_internal> structure (see L<perlreguts/Base Structures>) but a custom 773engine should use something else. 774 775=head2 C<offs> 776 777A C<regexp_paren_pair> structure which defines offsets into the string being 778matched which correspond to the C<$&> and C<$1>, C<$2> etc. captures, the 779C<regexp_paren_pair> struct is defined as follows: 780 781 typedef struct regexp_paren_pair { 782 I32 start; 783 I32 end; 784 } regexp_paren_pair; 785 786=for apidoc Ayh||regexp_paren_pair 787 788If C<< ->offs[num].start >> or C<< ->offs[num].end >> is C<-1> then that 789capture group did not match. 790C<< ->offs[0].start/end >> represents C<$&> (or 791C<${^MATCH}> under C</p>) and C<< ->offs[paren].end >> matches C<$$paren> where 792C<$paren >= 1>. 793 794=head2 C<precomp> C<prelen> 795 796Used for optimisations. C<precomp> holds a copy of the pattern that 797was compiled and C<prelen> its length. When a new pattern is to be 798compiled (such as inside a loop) the internal C<regcomp> operator 799checks if the last compiled C<REGEXP>'s C<precomp> and C<prelen> 800are equivalent to the new one, and if so uses the old pattern instead 801of compiling a new one. 802 803The relevant snippet from C<Perl_pp_regcomp>: 804 805 if (!re || !re->precomp || re->prelen != (I32)len || 806 memNE(re->precomp, t, len)) 807 /* Compile a new pattern */ 808 809=head2 C<paren_names> 810 811This is a hash used internally to track named capture groups and their 812offsets. The keys are the names of the buffers the values are dualvars, 813with the IV slot holding the number of buffers with the given name and the 814pv being an embedded array of I32. The values may also be contained 815independently in the data array in cases where named backreferences are 816used. 817 818=head2 C<substrs> 819 820Holds information on the longest string that must occur at a fixed 821offset from the start of the pattern, and the longest string that must 822occur at a floating offset from the start of the pattern. Used to do 823Fast-Boyer-Moore searches on the string to find out if its worth using 824the regex engine at all, and if so where in the string to search. 825 826=head2 C<subbeg> C<sublen> C<saved_copy> C<suboffset> C<subcoffset> 827 828Used during the execution phase for managing search and replace patterns, 829and for providing the text for C<$&>, C<$1> etc. C<subbeg> points to a 830buffer (either the original string, or a copy in the case of 831C<RX_MATCH_COPIED(rx)>), and C<sublen> is the length of the buffer. The 832C<RX_OFFS> start and end indices index into this buffer. 833 834=for apidoc Amh||RX_MATCH_COPIED|const REGEXP * rx 835=for apidoc Amh||RX_OFFS|const REGEXP * rx_sv 836 837In the presence of the C<REXEC_COPY_STR> flag, but with the addition of 838the C<REXEC_COPY_SKIP_PRE> or C<REXEC_COPY_SKIP_POST> flags, an engine 839can choose not to copy the full buffer (although it must still do so in 840the presence of C<RXf_PMf_KEEPCOPY> or the relevant bits being set in 841C<PL_sawampersand>). In this case, it may set C<suboffset> to indicate the 842number of bytes from the logical start of the buffer to the physical start 843(i.e. C<subbeg>). It should also set C<subcoffset>, the number of 844characters in the offset. The latter is needed to support C<@-> and C<@+> 845which work in characters, not bytes. 846 847=for apidoc Amnh||REXEC_COPY_STR 848=for apidoc_item ||REXEC_COPY_SKIP_PRE 849=for apidoc_item ||REXEC_COPY_SKIP_POST 850 851=head2 C<wrapped> C<wraplen> 852 853Stores the string C<qr//> stringifies to. The Perl engine for example 854stores C<(?^:eek)> in the case of C<qr/eek/>. 855 856When using a custom engine that doesn't support the C<(?:)> construct 857for inline modifiers, it's probably best to have C<qr//> stringify to 858the supplied pattern, note that this will create undesired patterns in 859cases such as: 860 861 my $x = qr/a|b/; # "a|b" 862 my $y = qr/c/i; # "c" 863 my $z = qr/$x$y/; # "a|bc" 864 865There's no solution for this problem other than making the custom 866engine understand a construct like C<(?:)>. 867 868=head2 C<seen_evals> 869 870This stores the number of eval groups in 871the pattern. This is used for security 872purposes when embedding compiled regexes into larger patterns with C<qr//>. 873 874=head2 C<refcnt> 875 876The number of times the structure is referenced. When 877this falls to 0, the regexp is automatically freed 878by a call to C<pregfree>. This should be set to 1 in 879each engine's L</comp> routine. 880 881=head1 HISTORY 882 883Originally part of L<perlreguts>. 884 885=head1 AUTHORS 886 887Originally written by Yves Orton, expanded by E<AElig>var ArnfjE<ouml>rE<eth> 888Bjarmason. 889 890=head1 LICENSE 891 892Copyright 2006 Yves Orton and 2007 E<AElig>var ArnfjE<ouml>rE<eth> Bjarmason. 893 894This program is free software; you can redistribute it and/or modify it under 895the same terms as Perl itself. 896 897=cut 898