1.\" $NetBSD: regex.3,v 1.26 2018/02/26 14:04:43 kamil Exp $ 2.\" 3.\" Copyright (c) 1992, 1993, 1994 4.\" The Regents of the University of California. All rights reserved. 5.\" 6.\" This code is derived from software contributed to Berkeley by 7.\" Henry Spencer. 8.\" 9.\" Redistribution and use in source and binary forms, with or without 10.\" modification, are permitted provided that the following conditions 11.\" are met: 12.\" 1. Redistributions of source code must retain the above copyright 13.\" notice, this list of conditions and the following disclaimer. 14.\" 2. Redistributions in binary form must reproduce the above copyright 15.\" notice, this list of conditions and the following disclaimer in the 16.\" documentation and/or other materials provided with the distribution. 17.\" 3. Neither the name of the University nor the names of its contributors 18.\" may be used to endorse or promote products derived from this software 19.\" without specific prior written permission. 20.\" 21.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 22.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 23.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 24.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 25.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 26.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 27.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 28.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 29.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 30.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 31.\" SUCH DAMAGE. 32.\" 33.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 34.\" 35.\" This code is derived from software contributed to Berkeley by 36.\" Henry Spencer. 37.\" 38.\" Redistribution and use in source and binary forms, with or without 39.\" modification, are permitted provided that the following conditions 40.\" are met: 41.\" 1. Redistributions of source code must retain the above copyright 42.\" notice, this list of conditions and the following disclaimer. 43.\" 2. Redistributions in binary form must reproduce the above copyright 44.\" notice, this list of conditions and the following disclaimer in the 45.\" documentation and/or other materials provided with the distribution. 46.\" 3. All advertising materials mentioning features or use of this software 47.\" must display the following acknowledgement: 48.\" This product includes software developed by the University of 49.\" California, Berkeley and its contributors. 50.\" 4. Neither the name of the University nor the names of its contributors 51.\" may be used to endorse or promote products derived from this software 52.\" without specific prior written permission. 53.\" 54.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 55.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 56.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 57.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 58.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 59.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 60.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 61.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 62.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 63.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 64.\" SUCH DAMAGE. 65.\" 66.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 67.\" 68.Dd February 26, 2018 69.Dt REGEX 3 70.Os 71.Sh NAME 72.Nm regex , 73.Nm regcomp , 74.Nm regexec , 75.Nm regerror , 76.Nm regfree , 77.Nm regasub , 78.Nm regnsub 79.Nd regular-expression library 80.Sh LIBRARY 81.Lb libc 82.Sh SYNOPSIS 83.In regex.h 84.Ft int 85.Fn regcomp "regex_t * restrict preg" "const char * restrict pattern" "int cflags" 86.Ft int 87.Fn regexec "const regex_t * restrict preg" "const char * restrict string" "size_t nmatch" "regmatch_t pmatch[]" "int eflags" 88.Ft size_t 89.Fn regerror "int errcode" "const regex_t * restrict preg" "char * restrict errbuf" "size_t errbuf_size" 90.Ft void 91.Fn regfree "regex_t *preg" 92.Ft ssize_t 93.Fn regnsub "char *buf" "size_t bufsiz" "const char *sub" "const regmatch_t *rm" "const char *str" 94.Ft ssize_t 95.Fn regasub "char **buf" "const char *sub" "const regmatch_t *rm" "const char *sstr" 96.Sh DESCRIPTION 97These routines implement 98.St -p1003.2-92 99regular expressions (``RE''s); 100see 101.Xr re_format 7 . 102.Fn regcomp 103compiles an RE written as a string into an internal form, 104.Fn regexec 105matches that internal form against a string and reports results, 106.Fn regerror 107transforms error codes from either into human-readable messages, 108and 109.Fn regfree 110frees any dynamically-allocated storage used by the internal form 111of an RE. 112.Pp 113The header 114.In regex.h 115declares two structure types, 116.Fa regex_t 117and 118.Fa regmatch_t , 119the former for compiled internal forms and the latter for match reporting. 120It also declares the four functions, 121a type 122.Fa regoff_t , 123and a number of constants with names starting with ``REG_''. 124.Pp 125.Fn regcomp 126compiles the regular expression contained in the 127.Fa pattern 128string, 129subject to the flags in 130.Fa cflags , 131and places the results in the 132.Fa regex_t 133structure pointed to by 134.Fa preg . 135.Fa cflags 136is the bitwise OR of zero or more of the following flags: 137.Bl -tag -width XXXREG_EXTENDED 138.It Dv REG_EXTENDED 139Compile modern (``extended'') REs, rather than the obsolete 140(``basic'') REs that are the default. 141.It Dv REG_BASIC 142This is a synonym for 0, 143provided as a counterpart to REG_EXTENDED to improve readability. 144.It Dv REG_NOSPEC 145Compile with recognition of all special characters turned off. 146All characters are thus considered ordinary, so the ``RE'' is a literal 147string. 148This is an extension, compatible with but not specified by 149.St -p1003.2-92 , 150and should be used with caution in software intended to be portable to 151other systems. 152.Dv REG_EXTENDED 153and 154.Dv REG_NOSPEC 155may not be used in the same call to 156.Fn regcomp . 157.It Dv REG_ICASE 158Compile for matching that ignores upper/lower case distinctions. 159See 160.Xr re_format 7 . 161.It Dv REG_NOSUB 162Compile for matching that need only report success or failure, not 163what was matched. 164.It Dv REG_NEWLINE 165Compile for newline-sensitive matching. 166By default, newline is a completely ordinary character with no special 167meaning in either REs or strings. 168With this flag, 169`[^' bracket expressions and `.' never match newline, 170a `^' anchor matches the null string after any newline in the string 171in addition to its normal function, 172and the `$' anchor matches the null string before any newline in the 173string in addition to its normal function. 174.It Dv REG_PEND 175The regular expression ends, not at the first NUL, but just before the 176character pointed to by the 177.Fa re_endp 178member of the structure pointed to by 179.Fa preg . 180The 181.Fa re_endp 182member is of type 183.Fa "const\ char\ *" . 184This flag permits inclusion of NULs in the RE; they are considered 185ordinary characters. 186This is an extension, compatible with but not specified by 187.St -p1003.2-92 , 188and should be used with caution in software intended to be portable to 189other systems. 190.El 191.Pp 192When successful, 193.Fn regcomp 194returns 0 and fills in the structure pointed to by 195.Fa preg . 196One member of that structure (other than 197.Fa re_endp ) 198is publicized: 199.Fa re_nsub , 200of type 201.Fa size_t , 202contains the number of parenthesized subexpressions within the RE 203(except that the value of this member is undefined if the 204.Dv REG_NOSUB 205flag was used). 206If 207.Fn regcomp 208fails, it returns a non-zero error code; 209see 210.Sx DIAGNOSTICS . 211.Pp 212.Fn regexec 213matches the compiled RE pointed to by 214.Fa preg 215against the 216.Fa string , 217subject to the flags in 218.Fa eflags , 219and reports results using 220.Fa nmatch , 221.Fa pmatch , 222and the returned value. 223The RE must have been compiled by a previous invocation of 224.Fn regcomp . 225The compiled form is not altered during execution of 226.Fn regexec , 227so a single compiled RE can be used simultaneously by multiple threads. 228.Pp 229By default, 230the NUL-terminated string pointed to by 231.Fa string 232is considered to be the text of an entire line, minus any terminating 233newline. 234The 235.Fa eflags 236argument is the bitwise OR of zero or more of the following flags: 237.Bl -tag -width XXXREG_NOTBOL 238.It Dv REG_NOTBOL 239The first character of the string 240is not the beginning of a line, so the `^' anchor should not match before it. 241This does not affect the behavior of newlines under 242.Dv REG_NEWLINE . 243.It Dv REG_NOTEOL 244The NUL terminating the string does not end a line, so the `$' anchor 245should not match before it. 246This does not affect the behavior of newlines under 247.Dv REG_NEWLINE . 248.It Dv REG_STARTEND 249The string is considered to start at 250.Fa string 251+ 252.Fa pmatch[0].rm_so 253and to have a terminating NUL located at 254.Fa string 255+ 256.Fa pmatch[0].rm_eo 257(there need not actually be a NUL at that location), 258regardless of the value of 259.Fa nmatch . 260See below for the definition of 261.Fa pmatch 262and 263.Fa nmatch . 264This is an extension, compatible with but not specified by 265.St -p1003.2-92 , 266and should be used with caution in software intended to be portable to 267other systems. 268Note that a non-zero 269.Fa rm_so 270does not imply 271.Dv REG_NOTBOL ; 272.Dv REG_STARTEND 273affects only the location of the string, not how it is matched. 274.El 275.Pp 276See 277.Xr re_format 7 278for a discussion of what is matched in situations where an RE or a 279portion thereof could match any of several substrings of 280.Fa string . 281.Pp 282Normally, 283.Fn regexec 284returns 0 for success and the non-zero code 285.Dv REG_NOMATCH 286for failure. 287Other non-zero error codes may be returned in exceptional situations; 288see 289.Sx DIAGNOSTICS . 290.Pp 291If 292.Dv REG_NOSUB 293was specified in the compilation of the RE, or if 294.Fa nmatch 295is 0, 296.Fn regexec 297ignores the 298.Fa pmatch 299argument (but see below for the case where 300.Dv REG_STARTEND 301is specified). 302Otherwise, 303.Fa pmatch 304points to an array of 305.Fa nmatch 306structures of type 307.Fa regmatch_t . 308Such a structure has at least the members 309.Fa rm_so 310and 311.Fa rm_eo , 312both of type 313.Fa regoff_t 314(a signed arithmetic type at least as large as an 315.Fa off_t 316and a 317.Fa ssize_t ) , 318containing respectively the offset of the first character of a substring 319and the offset of the first character after the end of the substring. 320Offsets are measured from the beginning of the 321.Fa string 322argument given to 323.Fn regexec . 324An empty substring is denoted by equal offsets, 325both indicating the character following the empty substring. 326.Pp 327The 0th member of the 328.Fa pmatch 329array is filled in to indicate what substring of 330.Fa string 331was matched by the entire RE. 332Remaining members report what substring was matched by parenthesized 333subexpressions within the RE; 334member 335.Fa i 336reports subexpression 337.Fa i , 338with subexpressions counted (starting at 1) by the order of their 339opening parentheses in the RE, left to right. 340Unused entries in the array\(emcorresponding either to subexpressions that 341did not participate in the match at all, or to subexpressions that do not 342exist in the RE (that is, 343.Fa i 344> 345.Fa preg->re_nsub ) 346\(emhave both 347.Fa rm_so 348and 349.Fa rm_eo 350set to -1. 351If a subexpression participated in the match several times, 352the reported substring is the last one it matched. 353(Note, as an example in particular, that when the RE `(b*)+' matches `bbb', 354the parenthesized subexpression matches each of the three `b's and then 355an infinite number of empty strings following the last `b', 356so the reported substring is one of the empties.) 357.Pp 358If 359.Dv REG_STARTEND 360is specified, 361.Fa pmatch 362must point to at least one 363.Fa regmatch_t 364(even if 365.Fa nmatch 366is 0 or 367.Dv REG_NOSUB 368was specified), 369to hold the input offsets for 370.Dv REG_STARTEND . 371Use for output is still entirely controlled by 372.Fa nmatch ; 373if 374.Fa nmatch 375is 0 or 376.Dv REG_NOSUB 377was specified, 378the value of 379.Fa pmatch [0] 380will not be changed by a successful 381.Fn regexec . 382.Pp 383.Fn regerror 384maps a non-zero 385.Fa errcode 386from either 387.Fn regcomp 388or 389.Fn regexec 390to a human-readable, printable message. 391If 392.Fa preg 393is non-NULL, 394the error code should have arisen from use of the 395.Fa regex_t 396pointed to by 397.Fa preg , 398and if the error code came from 399.Fn regcomp , 400it should have been the result from the most recent 401.Fn regcomp 402using that 403.Fa regex_t . 404.Po Fn regerror 405may be able to supply a more detailed message using information 406from the 407.Fa regex_t . Pc 408.Fn regerror 409places the NUL-terminated message into the buffer pointed to by 410.Fa errbuf , 411limiting the length (including the NUL) to at most 412.Fa errbuf_size 413bytes. 414If the whole message won't fit, 415as much of it as will fit before the terminating NUL is supplied. 416In any case, 417the returned value is the size of buffer needed to hold the whole 418message (including terminating NUL). 419If 420.Fa errbuf_size 421is 0, 422.Fa errbuf 423is ignored but the return value is still correct. 424.Pp 425If the 426.Fa errcode 427given to 428.Fn regerror 429is first ORed with 430.Dv REG_ITOA , 431the ``message'' that results is the printable name of the error code, 432e.g. ``REG_NOMATCH'', 433rather than an explanation thereof. 434If 435.Fa errcode 436is 437.Dv REG_ATOI , 438then 439.Fa preg 440shall be non-NULL and the 441.Fa re_endp 442member of the structure it points to 443must point to the printable name of an error code; 444in this case, the result in 445.Fa errbuf 446is the decimal digits of 447the numeric value of the error code 448(0 if the name is not recognized). 449.Dv REG_ITOA 450and 451.Dv REG_ATOI 452are intended primarily as debugging facilities; 453they are extensions, compatible with but not specified by 454.St -p1003.2-92 , 455and should be used with caution in software intended to be portable to 456other systems. 457Be warned also that they are considered experimental and changes are possible. 458.Pp 459.Fn regfree 460frees any dynamically-allocated storage associated with the compiled RE 461pointed to by 462.Fa preg . 463The remaining 464.Fa regex_t 465is no longer a valid compiled RE 466and the effect of supplying it to 467.Fn regexec 468or 469.Fn regerror 470is undefined. 471.Pp 472None of these functions references global variables except for tables 473of constants; 474all are safe for use from multiple threads if the arguments are safe. 475.Pp 476The 477.Fn regnsub 478and 479.Fn regasub 480functions perform substitutions using 481.Xr sed 1 482like syntax. 483They return the length of the string that would have been created 484if there was enough space or 485.Dv \-1 486on error, setting 487.Dv errno . 488The result 489is being placed in 490.Fa buf 491which is user-supplied in 492.Fn regnsub 493and dynamically allocated in 494.Fn regasub . 495The 496.Fa sub 497argument contains a substitution string which might refer to the first 4989 regular expression strings using 499.Dq \e<n> 500to refer to the nth matched 501item, or 502.Dq & 503(which is equivalent to 504.Dq \e0 ) 505to refer to the full match. 506The 507.Fa rm 508array must be at least 10 elements long, and should contain the result 509of the matches from a previous 510.Fn regexec 511call. 512Only 10 elements of the 513.Fa rm 514array can be used. 515The 516.Fa str 517argument contains the source string to apply the transformation to. 518.Sh IMPLEMENTATION CHOICES 519There are a number of decisions that 520.St -p1003.2-92 521leaves up to the implementor, 522either by explicitly saying ``undefined'' or by virtue of them being 523forbidden by the RE grammar. 524This implementation treats them as follows. 525.Pp 526See 527.Xr re_format 7 528for a discussion of the definition of case-independent matching. 529.Pp 530There is no particular limit on the length of REs, 531except insofar as memory is limited. 532Memory usage is approximately linear in RE size, and largely insensitive 533to RE complexity, except for bounded repetitions. 534See BUGS for one short RE using them 535that will run almost any system out of memory. 536.Pp 537A backslashed character other than one specifically given a magic meaning 538by 539.St -p1003.2-92 540(such magic meanings occur only in obsolete [``basic''] REs) 541is taken as an ordinary character. 542.Pp 543Any unmatched [ is a 544.Dv REG_EBRACK 545error. 546.Pp 547Equivalence classes cannot begin or end bracket-expression ranges. 548The endpoint of one range cannot begin another. 549.Pp 550.Dv RE_DUP_MAX , 551the limit on repetition counts in bounded repetitions, is 255. 552.Pp 553A repetition operator (?, *, +, or bounds) cannot follow another 554repetition operator. 555A repetition operator cannot begin an expression or subexpression 556or follow `^' or `|'. 557.Pp 558`|' cannot appear first or last in a (sub)expression or after another `|', 559i.e. an operand of `|' cannot be an empty subexpression. 560An empty parenthesized subexpression, `()', is legal and matches an 561empty (sub)string. 562An empty string is not a legal RE. 563.Pp 564A `{' followed by a digit is considered the beginning of bounds for a 565bounded repetition, which must then follow the syntax for bounds. 566A `{' 567.Em not 568followed by a digit is considered an ordinary character. 569.Pp 570`^' and `$' beginning and ending subexpressions in obsolete (``basic'') 571REs are anchors, not ordinary characters. 572.Sh DIAGNOSTICS 573Non-zero error codes from 574.Fn regcomp 575and 576.Fn regexec 577include the following: 578.Pp 579.Bl -tag -width XXXREG_ECOLLATE -compact 580.It Dv REG_NOMATCH 581.Fn regexec 582failed to match 583.It Dv REG_BADPAT 584invalid regular expression 585.It Dv REG_ECOLLATE 586invalid collating element 587.It Dv REG_ECTYPE 588invalid character class 589.It Dv REG_EESCAPE 590\e applied to unescapable character 591.It Dv REG_ESUBREG 592invalid backreference number 593.It Dv REG_EBRACK 594brackets [ ] not balanced 595.It Dv REG_EPAREN 596parentheses ( ) not balanced 597.It Dv REG_EBRACE 598braces { } not balanced 599.It Dv REG_BADBR 600invalid repetition count(s) in { } 601.It Dv REG_ERANGE 602invalid character range in [ ] 603.It Dv REG_ESPACE 604ran out of memory 605.It Dv REG_BADRPT 606?, *, or + operand invalid 607.It Dv REG_EMPTY 608empty (sub)expression 609.It Dv REG_ASSERT 610``can't happen''\(emyou found a bug 611.It Dv REG_INVARG 612invalid argument, e.g. negative-length string 613.El 614.Sh SEE ALSO 615.Xr grep 1 , 616.Xr sed 1 , 617.Xr re_format 7 618.Pp 619.St -p1003.2-92 , 620sections 2.8 (Regular Expression Notation) 621and 622B.5 (C Binding for Regular Expression Matching). 623.Sh HISTORY 624Originally written by Henry Spencer. 625Altered for inclusion in the 626.Bx 4.4 627distribution. 628.Pp 629The 630.Fn regnsub 631and 632.Fn regasub 633functions appeared in 634.Nx 8 . 635.Sh BUGS 636There is one known functionality bug. 637The implementation of internationalization is incomplete: 638the locale is always assumed to be the default one of 639.St -p1003.2-92 , 640and only the collating elements etc. of that locale are available. 641.Pp 642The back-reference code is subtle and doubts linger about its correctness 643in complex cases. 644.Pp 645.Fn regexec 646performance is poor. 647This will improve with later releases. 648.Fa nmatch 649exceeding 0 is expensive; 650.Fa nmatch 651exceeding 1 is worse. 652.Fa regexec 653is largely insensitive to RE complexity 654.Em except 655that back references are massively expensive. 656RE length does matter; in particular, there is a strong speed bonus 657for keeping RE length under about 30 characters, 658with most special characters counting roughly double. 659.Pp 660.Fn regcomp 661implements bounded repetitions by macro expansion, 662which is costly in time and space if counts are large 663or bounded repetitions are nested. 664An RE like, say, 665`((((a{1,100}){1,100}){1,100}){1,100}){1,100}' 666will (eventually) run almost any existing machine out of swap space. 667.Pp 668There are suspected problems with response to obscure error conditions. 669Notably, 670certain kinds of internal overflow, 671produced only by truly enormous REs or by multiply nested bounded repetitions, 672are probably not handled well. 673.Pp 674Due to a mistake in 675.St -p1003.2-92 , 676things like `a)b' are legal REs because `)' is a special character 677only in the presence of a previous unmatched `('. 678This can't be fixed until the spec is fixed. 679.Pp 680The standard's definition of back references is vague. 681For example, does 682`a\e(\e(b\e)*\e2\e)*d' match `abbbd'? 683Until the standard is clarified, behavior in such cases should not be 684relied on. 685.Pp 686The implementation of word-boundary matching is a bit of a kludge, 687and bugs may lurk in combinations of word-boundary matching and anchoring. 688