1.\" $NetBSD: regex.3,v 1.9 1998/08/29 08:32:34 lukem Exp $ 2.\" 3.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 4.\" Copyright (c) 1992, 1993, 1994 5.\" The Regents of the University of California. All rights reserved. 6.\" 7.\" This code is derived from software contributed to Berkeley by 8.\" Henry Spencer. 9.\" 10.\" Redistribution and use in source and binary forms, with or without 11.\" modification, are permitted provided that the following conditions 12.\" are met: 13.\" 1. Redistributions of source code must retain the above copyright 14.\" notice, this list of conditions and the following disclaimer. 15.\" 2. Redistributions in binary form must reproduce the above copyright 16.\" notice, this list of conditions and the following disclaimer in the 17.\" documentation and/or other materials provided with the distribution. 18.\" 3. All advertising materials mentioning features or use of this software 19.\" must display the following acknowledgement: 20.\" This product includes software developed by the University of 21.\" California, Berkeley and its contributors. 22.\" 4. Neither the name of the University nor the names of its contributors 23.\" may be used to endorse or promote products derived from this software 24.\" without specific prior written permission. 25.\" 26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 29.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 36.\" SUCH DAMAGE. 37.\" 38.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 39.\" 40.Dd March 20, 1994 41.Dt REGEX 3 42.Os 43.Sh NAME 44.Nm regex , 45.Nm regcomp , 46.Nm regexec , 47.Nm regerror , 48.Nm regfree 49.Nd regular-expression library 50.Sh LIBRARY 51.Lb libc 52.Sh SYNOPSIS 53.Fd #include <sys/types.h> 54.Fd #include <regex.h> 55.Ft int 56.Fn regcomp "regex_t *preg" "const char *pattern" "int cflags" 57.Ft int 58.Fn regexec "const regex_t *preg" "const char *string" "size_t nmatch" "regmatch_t pmatch[]" "int eflags" 59.Ft size_t 60.Fn regerror "int errcode" "const regex_t *preg" "char *errbuf" "size_t errbuf_size" 61.Ft void 62.Fn regfree "regex_t *preg" 63.Sh DESCRIPTION 64These routines implement 65.St -p1003.2-92 66regular expressions (``RE''s); 67see 68.Xr re_format 7 . 69.Fn regcomp 70compiles an RE written as a string into an internal form, 71.Fn regexec 72matches that internal form against a string and reports results, 73.Fn regerror 74transforms error codes from either into human-readable messages, 75and 76.Fn regfree 77frees any dynamically-allocated storage used by the internal form 78of an RE. 79.Pp 80The header 81.Em <regex.h> 82declares two structure types, 83.Fa regex_t 84and 85.Fa regmatch_t , 86the former for compiled internal forms and the latter for match reporting. 87It also declares the four functions, 88a type 89.Fa regoff_t , 90and a number of constants with names starting with ``REG_''. 91.Pp 92.Fn regcomp 93compiles the regular expression contained in the 94.Fa pattern 95string, 96subject to the flags in 97.Fa cflags , 98and places the results in the 99.Fa regex_t 100structure pointed to by 101.Fa preg . 102.Fa cflags 103is the bitwise OR of zero or more of the following flags: 104.Bl -tag -width XXXREG_EXTENDED 105.It Dv REG_EXTENDED 106Compile modern (``extended'') REs, rather than the obsolete 107(``basic'') REs that are the default. 108.It Dv REG_BASIC 109This is a synonym for 0, 110provided as a counterpart to REG_EXTENDED to improve readability. 111.It Dv REG_NOSPEC 112Compile with recognition of all special characters turned off. All 113characters are thus considered ordinary, so the ``RE'' is a literal 114string. 115This is an extension, compatible with but not specified by 116.St -p1003.2-92 , 117and should be used with caution in software intended to be portable to 118other systems. 119.Dv REG_EXTENDED 120and 121.Dv REG_NOSPEC 122may not be used in the same call to 123.Fn regcomp . 124.It Dv REG_ICASE 125Compile for matching that ignores upper/lower case distinctions. See 126.Xr re_format 7 . 127.It Dv REG_NOSUB 128Compile for matching that need only report success or failure, not 129what was matched. 130.It Dv REG_NEWLINE 131Compile for newline-sensitive matching. 132By default, newline is a completely ordinary character with no special 133meaning in either REs or strings. 134With this flag, 135`[^' bracket expressions and `.' never match newline, 136a `^' anchor matches the null string after any newline in the string 137in addition to its normal function, 138and the `$' anchor matches the null string before any newline in the 139string in addition to its normal function. 140.It Dv REG_PEND 141The regular expression ends, not at the first NUL, but just before the 142character pointed to by the 143.Fa re_endp 144member of the structure pointed to by 145.Fa preg . 146The 147.Fa re_endp 148member is of type 149.Fa "const\ char\ *" . 150This flag permits inclusion of NULs in the RE; they are considered 151ordinary characters. 152This is an extension, compatible with but not specified by 153.St -p1003.2-92 , 154and should be used with caution in software intended to be portable to 155other systems. 156.El 157.Pp 158When successful, 159.Fn regcomp 160returns 0 and fills in the structure pointed to by 161.Fa preg . 162One member of that structure (other than 163.Fa re_endp ) 164is publicized: 165.Fa re_nsub , 166of type 167.Fa size_t , 168contains the number of parenthesized subexpressions within the RE 169(except that the value of this member is undefined if the 170.Dv REG_NOSUB 171flag was used). 172If 173.Fn regcomp 174fails, it returns a non-zero error code; 175see DIAGNOSTICS. 176.Pp 177.Fn regexec 178matches the compiled RE pointed to by 179.Fa preg 180against the 181.Fa string , 182subject to the flags in 183.Fa eflags , 184and reports results using 185.Fa nmatch , 186.Fa pmatch , 187and the returned value. 188The RE must have been compiled by a previous invocation of 189.Fn regcomp . 190The compiled form is not altered during execution of 191.Fn regexec , 192so a single compiled RE can be used simultaneously by multiple threads. 193.Pp 194By default, 195the NUL-terminated string pointed to by 196.Fa string 197is considered to be the text of an entire line, minus any terminating 198newline. 199The 200.Fa eflags 201argument is the bitwise OR of zero or more of the following flags: 202.Bl -tag -width XXXREG_NOTBOL 203.It Dv REG_NOTBOL 204The first character of the string 205is not the beginning of a line, so the `^' anchor should not match before it. 206This does not affect the behavior of newlines under 207.Dv REG_NEWLINE . 208.It Dv REG_NOTEOL 209The NUL terminating the string does not end a line, so the `$' anchor 210should not match before it. This does not affect the behavior of 211newlines under 212.Dv REG_NEWLINE . 213.It Dv REG_STARTEND 214The string is considered to start at 215.Fa string 216+ 217.Fa pmatch[0].rm_so 218and to have a terminating NUL located at 219.Fa string 220+ 221.Fa pmatch[0].rm_eo 222(there need not actually be a NUL at that location), 223regardless of the value of 224.Fa nmatch . 225See below for the definition of 226.Fa pmatch 227and 228.Fa nmatch . 229This is an extension, compatible with but not specified by 230.St -p1003.2-92 , 231and should be used with caution in software intended to be portable to 232other systems. 233Note that a non-zero 234.Fa rm_so 235does not imply 236.Dv REG_NOTBOL ; 237.Dv REG_STARTEND 238affects only the location of the string, not how it is matched. 239.El 240.Pp 241See 242.Xr re_format 7 243for a discussion of what is matched in situations where an RE or a 244portion thereof could match any of several substrings of 245.Fa string . 246.Pp 247Normally, 248.Fn regexec 249returns 0 for success and the non-zero code 250.Dv REG_NOMATCH 251for failure. 252Other non-zero error codes may be returned in exceptional situations; 253see DIAGNOSTICS. 254.Pp 255If 256.Dv REG_NOSUB 257was specified in the compilation of the RE, or if 258.Fa nmatch 259is 0, 260.Fn regexec 261ignores the 262.Fa pmatch 263argument (but see below for the case where 264.Dv REG_STARTEND 265is specified). 266Otherwise, 267.Fa pmatch 268points to an array of 269.Fa nmatch 270structures of type 271.Fa regmatch_t . 272Such a structure has at least the members 273.Fa rm_so 274and 275.Fa rm_eo , 276both of type 277.Fa regoff_t 278(a signed arithmetic type at least as large as an 279.Fa off_t 280and a 281.Fa ssize_t ) , 282containing respectively the offset of the first character of a substring 283and the offset of the first character after the end of the substring. 284Offsets are measured from the beginning of the 285.Fa string 286argument given to 287.Fn regexec . 288An empty substring is denoted by equal offsets, 289both indicating the character following the empty substring. 290.Pp 291The 0th member of the 292.Fa pmatch 293array is filled in to indicate what substring of 294.Fa string 295was matched by the entire RE. 296Remaining members report what substring was matched by parenthesized 297subexpressions within the RE; 298member 299.Fa i 300reports subexpression 301.Fa i , 302with subexpressions counted (starting at 1) by the order of their 303opening parentheses in the RE, left to right. 304Unused entries in the array\(emcorresponding either to subexpressions that 305did not participate in the match at all, or to subexpressions that do not 306exist in the RE (that is, 307.Fa i 308> 309.Fa preg->re_nsub ) 310\(emhave both 311.Fa rm_so 312and 313.Fa rm_eo 314set to -1. 315If a subexpression participated in the match several times, 316the reported substring is the last one it matched. 317(Note, as an example in particular, that when the RE `(b*)+' matches `bbb', 318the parenthesized subexpression matches each of the three `b's and then 319an infinite number of empty strings following the last `b', 320so the reported substring is one of the empties.) 321.Pp 322If 323.Dv REG_STARTEND 324is specified, 325.Fa pmatch 326must point to at least one 327.Fa regmatch_t 328(even if 329.Fa nmatch 330is 0 or 331.Dv REG_NOSUB 332was specified), 333to hold the input offsets for 334.Dv REG_STARTEND . 335Use for output is still entirely controlled by 336.Fa nmatch ; 337if 338.Fa nmatch 339is 0 or 340.Dv REG_NOSUB 341was specified, 342the value of 343.Fa pmatch [0] 344will not be changed by a successful 345.Fn regexec . 346.Pp 347.Fn regerror 348maps a non-zero 349.Fa errcode 350from either 351.Fn regcomp 352or 353.Fn regexec 354to a human-readable, printable message. 355If 356.Fa preg 357is non-NULL, 358the error code should have arisen from use of the 359.Fa regex_t 360pointed to by 361.Fa preg , 362and if the error code came from 363.Fn regcomp , 364it should have been the result from the most recent 365.Fn regcomp 366using that 367.Fa regex_t . ( 368.Fn regerror 369may be able to supply a more detailed message using information 370from the 371.Fa regex_t . ) 372.Fn regerror 373places the NUL-terminated message into the buffer pointed to by 374.Fa errbuf , 375limiting the length (including the NUL) to at most 376.Fa errbuf_size 377bytes. 378If the whole message won't fit, 379as much of it as will fit before the terminating NUL is supplied. 380In any case, 381the returned value is the size of buffer needed to hold the whole 382message (including terminating NUL). 383If 384.Fa errbuf_size 385is 0, 386.Fa errbuf 387is ignored but the return value is still correct. 388.Pp 389If the 390.Fa errcode 391given to 392.Fn regerror 393is first ORed with 394.Dv REG_ITOA , 395the ``message'' that results is the printable name of the error code, 396e.g. ``REG_NOMATCH'', 397rather than an explanation thereof. 398If 399.Fa errcode 400.Dv REG_ATOI , 401then 402.Fa preg 403shall be non-NULL and the 404.Fa re_endp 405member of the structure it points to 406must point to the printable name of an error code; 407in this case, the result in 408.Fa errbuf 409is the decimal digits of 410the numeric value of the error code 411(0 if the name is not recognized). 412.Dv REG_ITOA 413and 414.Dv REG_ATOI 415are intended primarily as debugging facilities; 416they are extensions, compatible with but not specified by 417.St -p1003.2-92 , 418and should be used with caution in software intended to be portable to 419other systems. 420Be warned also that they are considered experimental and changes are possible. 421.Pp 422.Fn regfree 423frees any dynamically-allocated storage associated with the compiled RE 424pointed to by 425.Fa preg . 426The remaining 427.Fa regex_t 428is no longer a valid compiled RE 429and the effect of supplying it to 430.Fn regexec 431or 432.Fn regerror 433is undefined. 434.Pp 435None of these functions references global variables except for tables 436of constants; 437all are safe for use from multiple threads if the arguments are safe. 438.Sh IMPLEMENTATION CHOICES 439There are a number of decisions that 440.St -p1003.2-92 441leaves up to the implementor, 442either by explicitly saying ``undefined'' or by virtue of them being 443forbidden by the RE grammar. 444This implementation treats them as follows. 445.Pp 446See 447.Xr re_format 7 448for a discussion of the definition of case-independent matching. 449.Pp 450There is no particular limit on the length of REs, 451except insofar as memory is limited. 452Memory usage is approximately linear in RE size, and largely insensitive 453to RE complexity, except for bounded repetitions. 454See BUGS for one short RE using them 455that will run almost any system out of memory. 456.Pp 457A backslashed character other than one specifically given a magic meaning 458by 459.St -p1003.2-92 460(such magic meanings occur only in obsolete [``basic''] REs) 461is taken as an ordinary character. 462.Pp 463Any unmatched [ is a 464.Dv REG_EBRACK 465error. 466.Pp 467Equivalence classes cannot begin or end bracket-expression ranges. 468The endpoint of one range cannot begin another. 469.Pp 470.Dv RE_DUP_MAX , 471the limit on repetition counts in bounded repetitions, is 255. 472.Pp 473A repetition operator (?, *, +, or bounds) cannot follow another 474repetition operator. 475A repetition operator cannot begin an expression or subexpression 476or follow `^' or `|'. 477.Pp 478`|' cannot appear first or last in a (sub)expression or after another `|', 479i.e. an operand of `|' cannot be an empty subexpression. 480An empty parenthesized subexpression, `()', is legal and matches an 481empty (sub)string. 482An empty string is not a legal RE. 483.Pp 484A `{' followed by a digit is considered the beginning of bounds for a 485bounded repetition, which must then follow the syntax for bounds. 486A `{' \fInot\fR followed by a digit is considered an ordinary character. 487.Pp 488`^' and `$' beginning and ending subexpressions in obsolete (``basic'') 489REs are anchors, not ordinary characters. 490.Sh SEE ALSO 491.Xr grep 1 , 492.Xr sed 1 , 493.Xr re_format 7 494.Pp 495.St -p1003.2-92 , 496sections 2.8 (Regular Expression Notation) 497and 498B.5 (C Binding for Regular Expression Matching). 499.Sh DIAGNOSTICS 500Non-zero error codes from 501.Fn regcomp 502and 503.Fn regexec 504include the following: 505.Pp 506.Bl -tag -width XXXREG_ECOLLATE -compact 507.It Dv REG_NOMATCH 508regexec() failed to match 509.It Dv REG_BADPAT 510invalid regular expression 511.It Dv REG_ECOLLATE 512invalid collating element 513.It Dv REG_ECTYPE 514invalid character class 515.It Dv REG_EESCAPE 516\e applied to unescapable character 517.It Dv REG_ESUBREG 518invalid backreference number 519.It Dv REG_EBRACK 520brackets [ ] not balanced 521.It Dv REG_EPAREN 522parentheses ( ) not balanced 523.It Dv REG_EBRACE 524braces { } not balanced 525.It Dv REG_BADBR 526invalid repetition count(s) in { } 527.It Dv REG_ERANGE 528invalid character range in [ ] 529.It Dv REG_ESPACE 530ran out of memory 531.It Dv REG_BADRPT 532?, *, or + operand invalid 533.It Dv REG_EMPTY 534empty (sub)expression 535.It Dv REG_ASSERT 536``can't happen''\(emyou found a bug 537.It Dv REG_INVARG 538invalid argument, e.g. negative-length string 539.El 540.Sh HISTORY 541Originally written by Henry Spencer. 542Altered for inclusion in the 543.Bx 4.4 544distribution. 545.Sh BUGS 546This is an alpha release with known defects. 547Please report problems. 548.Pp 549There is one known functionality bug. 550The implementation of internationalization is incomplete: 551the locale is always assumed to be the default one of 552.St -p1003.2-92 , 553and only the collating elements etc. of that locale are available. 554.Pp 555The back-reference code is subtle and doubts linger about its correctness 556in complex cases. 557.Pp 558.Fn regexec 559performance is poor. 560This will improve with later releases. 561.Fa nmatch 562exceeding 0 is expensive; 563.Fa nmatch 564exceeding 1 is worse. 565.Fa regexec 566is largely insensitive to RE complexity 567.Em except 568that back references are massively expensive. 569RE length does matter; in particular, there is a strong speed bonus 570for keeping RE length under about 30 characters, 571with most special characters counting roughly double. 572.Pp 573.Fn regcomp 574implements bounded repetitions by macro expansion, 575which is costly in time and space if counts are large 576or bounded repetitions are nested. 577An RE like, say, 578`((((a{1,100}){1,100}){1,100}){1,100}){1,100}' 579will (eventually) run almost any existing machine out of swap space. 580.Pp 581There are suspected problems with response to obscure error conditions. 582Notably, 583certain kinds of internal overflow, 584produced only by truly enormous REs or by multiply nested bounded repetitions, 585are probably not handled well. 586.Pp 587Due to a mistake in 588.St -p1003.2-92 , 589things like `a)b' are legal REs because `)' is a special character 590only in the presence of a previous unmatched `('. This can't be fixed 591until the spec is fixed. 592.Pp 593The standard's definition of back references is vague. 594For example, does 595`a\e(\e(b\e)*\e2\e)*d' match `abbbd'? 596Until the standard is clarified, behavior in such cases should not be 597relied on. 598.Pp 599The implementation of word-boundary matching is a bit of a kludge, 600and bugs may lurk in combinations of word-boundary matching and anchoring. 601