1.\" $NetBSD: regex.3,v 1.13 2002/04/29 01:41:44 simonb Exp $ 2.\" 3.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 4.\" Copyright (c) 1992, 1993, 1994 5.\" The Regents of the University of California. All rights reserved. 6.\" 7.\" This code is derived from software contributed to Berkeley by 8.\" Henry Spencer. 9.\" 10.\" Redistribution and use in source and binary forms, with or without 11.\" modification, are permitted provided that the following conditions 12.\" are met: 13.\" 1. Redistributions of source code must retain the above copyright 14.\" notice, this list of conditions and the following disclaimer. 15.\" 2. Redistributions in binary form must reproduce the above copyright 16.\" notice, this list of conditions and the following disclaimer in the 17.\" documentation and/or other materials provided with the distribution. 18.\" 3. All advertising materials mentioning features or use of this software 19.\" must display the following acknowledgement: 20.\" This product includes software developed by the University of 21.\" California, Berkeley and its contributors. 22.\" 4. Neither the name of the University nor the names of its contributors 23.\" may be used to endorse or promote products derived from this software 24.\" without specific prior written permission. 25.\" 26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 29.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 36.\" SUCH DAMAGE. 37.\" 38.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 39.\" 40.Dd March 20, 1994 41.Dt REGEX 3 42.Os 43.Sh NAME 44.Nm regex , 45.Nm regcomp , 46.Nm regexec , 47.Nm regerror , 48.Nm regfree 49.Nd regular-expression library 50.Sh LIBRARY 51.Lb libc 52.Sh SYNOPSIS 53.Fd #include \*[Lt]regex.h\*[Gt] 54.Ft int 55.Fn regcomp "regex_t *preg" "const char *pattern" "int cflags" 56.Ft int 57.Fn regexec "const regex_t *preg" "const char *string" "size_t nmatch" "regmatch_t pmatch[]" "int eflags" 58.Ft size_t 59.Fn regerror "int errcode" "const regex_t *preg" "char *errbuf" "size_t errbuf_size" 60.Ft void 61.Fn regfree "regex_t *preg" 62.Sh DESCRIPTION 63These routines implement 64.St -p1003.2-92 65regular expressions (``RE''s); 66see 67.Xr re_format 7 . 68.Fn regcomp 69compiles an RE written as a string into an internal form, 70.Fn regexec 71matches that internal form against a string and reports results, 72.Fn regerror 73transforms error codes from either into human-readable messages, 74and 75.Fn regfree 76frees any dynamically-allocated storage used by the internal form 77of an RE. 78.Pp 79The header 80.Em \*[Lt]regex.h\*[Gt] 81declares two structure types, 82.Fa regex_t 83and 84.Fa regmatch_t , 85the former for compiled internal forms and the latter for match reporting. 86It also declares the four functions, 87a type 88.Fa regoff_t , 89and a number of constants with names starting with ``REG_''. 90.Pp 91.Fn regcomp 92compiles the regular expression contained in the 93.Fa pattern 94string, 95subject to the flags in 96.Fa cflags , 97and places the results in the 98.Fa regex_t 99structure pointed to by 100.Fa preg . 101.Fa cflags 102is the bitwise OR of zero or more of the following flags: 103.Bl -tag -width XXXREG_EXTENDED 104.It Dv REG_EXTENDED 105Compile modern (``extended'') REs, rather than the obsolete 106(``basic'') REs that are the default. 107.It Dv REG_BASIC 108This is a synonym for 0, 109provided as a counterpart to REG_EXTENDED to improve readability. 110.It Dv REG_NOSPEC 111Compile with recognition of all special characters turned off. All 112characters are thus considered ordinary, so the ``RE'' is a literal 113string. 114This is an extension, compatible with but not specified by 115.St -p1003.2-92 , 116and should be used with caution in software intended to be portable to 117other systems. 118.Dv REG_EXTENDED 119and 120.Dv REG_NOSPEC 121may not be used in the same call to 122.Fn regcomp . 123.It Dv REG_ICASE 124Compile for matching that ignores upper/lower case distinctions. See 125.Xr re_format 7 . 126.It Dv REG_NOSUB 127Compile for matching that need only report success or failure, not 128what was matched. 129.It Dv REG_NEWLINE 130Compile for newline-sensitive matching. 131By default, newline is a completely ordinary character with no special 132meaning in either REs or strings. 133With this flag, 134`[^' bracket expressions and `.' never match newline, 135a `^' anchor matches the null string after any newline in the string 136in addition to its normal function, 137and the `$' anchor matches the null string before any newline in the 138string in addition to its normal function. 139.It Dv REG_PEND 140The regular expression ends, not at the first NUL, but just before the 141character pointed to by the 142.Fa re_endp 143member of the structure pointed to by 144.Fa preg . 145The 146.Fa re_endp 147member is of type 148.Fa "const\ char\ *" . 149This flag permits inclusion of NULs in the RE; they are considered 150ordinary characters. 151This is an extension, compatible with but not specified by 152.St -p1003.2-92 , 153and should be used with caution in software intended to be portable to 154other systems. 155.El 156.Pp 157When successful, 158.Fn regcomp 159returns 0 and fills in the structure pointed to by 160.Fa preg . 161One member of that structure (other than 162.Fa re_endp ) 163is publicized: 164.Fa re_nsub , 165of type 166.Fa size_t , 167contains the number of parenthesized subexpressions within the RE 168(except that the value of this member is undefined if the 169.Dv REG_NOSUB 170flag was used). 171If 172.Fn regcomp 173fails, it returns a non-zero error code; 174see 175.Sx DIAGNOSTICS . 176.Pp 177.Fn regexec 178matches the compiled RE pointed to by 179.Fa preg 180against the 181.Fa string , 182subject to the flags in 183.Fa eflags , 184and reports results using 185.Fa nmatch , 186.Fa pmatch , 187and the returned value. 188The RE must have been compiled by a previous invocation of 189.Fn regcomp . 190The compiled form is not altered during execution of 191.Fn regexec , 192so a single compiled RE can be used simultaneously by multiple threads. 193.Pp 194By default, 195the NUL-terminated string pointed to by 196.Fa string 197is considered to be the text of an entire line, minus any terminating 198newline. 199The 200.Fa eflags 201argument is the bitwise OR of zero or more of the following flags: 202.Bl -tag -width XXXREG_NOTBOL 203.It Dv REG_NOTBOL 204The first character of the string 205is not the beginning of a line, so the `^' anchor should not match before it. 206This does not affect the behavior of newlines under 207.Dv REG_NEWLINE . 208.It Dv REG_NOTEOL 209The NUL terminating the string does not end a line, so the `$' anchor 210should not match before it. This does not affect the behavior of 211newlines under 212.Dv REG_NEWLINE . 213.It Dv REG_STARTEND 214The string is considered to start at 215.Fa string 216+ 217.Fa pmatch[0].rm_so 218and to have a terminating NUL located at 219.Fa string 220+ 221.Fa pmatch[0].rm_eo 222(there need not actually be a NUL at that location), 223regardless of the value of 224.Fa nmatch . 225See below for the definition of 226.Fa pmatch 227and 228.Fa nmatch . 229This is an extension, compatible with but not specified by 230.St -p1003.2-92 , 231and should be used with caution in software intended to be portable to 232other systems. 233Note that a non-zero 234.Fa rm_so 235does not imply 236.Dv REG_NOTBOL ; 237.Dv REG_STARTEND 238affects only the location of the string, not how it is matched. 239.El 240.Pp 241See 242.Xr re_format 7 243for a discussion of what is matched in situations where an RE or a 244portion thereof could match any of several substrings of 245.Fa string . 246.Pp 247Normally, 248.Fn regexec 249returns 0 for success and the non-zero code 250.Dv REG_NOMATCH 251for failure. 252Other non-zero error codes may be returned in exceptional situations; 253see 254.Sx DIAGNOSTICS . 255.Pp 256If 257.Dv REG_NOSUB 258was specified in the compilation of the RE, or if 259.Fa nmatch 260is 0, 261.Fn regexec 262ignores the 263.Fa pmatch 264argument (but see below for the case where 265.Dv REG_STARTEND 266is specified). 267Otherwise, 268.Fa pmatch 269points to an array of 270.Fa nmatch 271structures of type 272.Fa regmatch_t . 273Such a structure has at least the members 274.Fa rm_so 275and 276.Fa rm_eo , 277both of type 278.Fa regoff_t 279(a signed arithmetic type at least as large as an 280.Fa off_t 281and a 282.Fa ssize_t ) , 283containing respectively the offset of the first character of a substring 284and the offset of the first character after the end of the substring. 285Offsets are measured from the beginning of the 286.Fa string 287argument given to 288.Fn regexec . 289An empty substring is denoted by equal offsets, 290both indicating the character following the empty substring. 291.Pp 292The 0th member of the 293.Fa pmatch 294array is filled in to indicate what substring of 295.Fa string 296was matched by the entire RE. 297Remaining members report what substring was matched by parenthesized 298subexpressions within the RE; 299member 300.Fa i 301reports subexpression 302.Fa i , 303with subexpressions counted (starting at 1) by the order of their 304opening parentheses in the RE, left to right. 305Unused entries in the array\(emcorresponding either to subexpressions that 306did not participate in the match at all, or to subexpressions that do not 307exist in the RE (that is, 308.Fa i 309\*[Gt] 310.Fa preg-\*[Gt]re_nsub ) 311\(emhave both 312.Fa rm_so 313and 314.Fa rm_eo 315set to -1. 316If a subexpression participated in the match several times, 317the reported substring is the last one it matched. 318(Note, as an example in particular, that when the RE `(b*)+' matches `bbb', 319the parenthesized subexpression matches each of the three `b's and then 320an infinite number of empty strings following the last `b', 321so the reported substring is one of the empties.) 322.Pp 323If 324.Dv REG_STARTEND 325is specified, 326.Fa pmatch 327must point to at least one 328.Fa regmatch_t 329(even if 330.Fa nmatch 331is 0 or 332.Dv REG_NOSUB 333was specified), 334to hold the input offsets for 335.Dv REG_STARTEND . 336Use for output is still entirely controlled by 337.Fa nmatch ; 338if 339.Fa nmatch 340is 0 or 341.Dv REG_NOSUB 342was specified, 343the value of 344.Fa pmatch [0] 345will not be changed by a successful 346.Fn regexec . 347.Pp 348.Fn regerror 349maps a non-zero 350.Fa errcode 351from either 352.Fn regcomp 353or 354.Fn regexec 355to a human-readable, printable message. 356If 357.Fa preg 358is non-NULL, 359the error code should have arisen from use of the 360.Fa regex_t 361pointed to by 362.Fa preg , 363and if the error code came from 364.Fn regcomp , 365it should have been the result from the most recent 366.Fn regcomp 367using that 368.Fa regex_t . ( 369.Fn regerror 370may be able to supply a more detailed message using information 371from the 372.Fa regex_t . ) 373.Fn regerror 374places the NUL-terminated message into the buffer pointed to by 375.Fa errbuf , 376limiting the length (including the NUL) to at most 377.Fa errbuf_size 378bytes. 379If the whole message won't fit, 380as much of it as will fit before the terminating NUL is supplied. 381In any case, 382the returned value is the size of buffer needed to hold the whole 383message (including terminating NUL). 384If 385.Fa errbuf_size 386is 0, 387.Fa errbuf 388is ignored but the return value is still correct. 389.Pp 390If the 391.Fa errcode 392given to 393.Fn regerror 394is first ORed with 395.Dv REG_ITOA , 396the ``message'' that results is the printable name of the error code, 397e.g. ``REG_NOMATCH'', 398rather than an explanation thereof. 399If 400.Fa errcode 401is 402.Dv REG_ATOI , 403then 404.Fa preg 405shall be non-NULL and the 406.Fa re_endp 407member of the structure it points to 408must point to the printable name of an error code; 409in this case, the result in 410.Fa errbuf 411is the decimal digits of 412the numeric value of the error code 413(0 if the name is not recognized). 414.Dv REG_ITOA 415and 416.Dv REG_ATOI 417are intended primarily as debugging facilities; 418they are extensions, compatible with but not specified by 419.St -p1003.2-92 , 420and should be used with caution in software intended to be portable to 421other systems. 422Be warned also that they are considered experimental and changes are possible. 423.Pp 424.Fn regfree 425frees any dynamically-allocated storage associated with the compiled RE 426pointed to by 427.Fa preg . 428The remaining 429.Fa regex_t 430is no longer a valid compiled RE 431and the effect of supplying it to 432.Fn regexec 433or 434.Fn regerror 435is undefined. 436.Pp 437None of these functions references global variables except for tables 438of constants; 439all are safe for use from multiple threads if the arguments are safe. 440.Sh IMPLEMENTATION CHOICES 441There are a number of decisions that 442.St -p1003.2-92 443leaves up to the implementor, 444either by explicitly saying ``undefined'' or by virtue of them being 445forbidden by the RE grammar. 446This implementation treats them as follows. 447.Pp 448See 449.Xr re_format 7 450for a discussion of the definition of case-independent matching. 451.Pp 452There is no particular limit on the length of REs, 453except insofar as memory is limited. 454Memory usage is approximately linear in RE size, and largely insensitive 455to RE complexity, except for bounded repetitions. 456See BUGS for one short RE using them 457that will run almost any system out of memory. 458.Pp 459A backslashed character other than one specifically given a magic meaning 460by 461.St -p1003.2-92 462(such magic meanings occur only in obsolete [``basic''] REs) 463is taken as an ordinary character. 464.Pp 465Any unmatched [ is a 466.Dv REG_EBRACK 467error. 468.Pp 469Equivalence classes cannot begin or end bracket-expression ranges. 470The endpoint of one range cannot begin another. 471.Pp 472.Dv RE_DUP_MAX , 473the limit on repetition counts in bounded repetitions, is 255. 474.Pp 475A repetition operator (?, *, +, or bounds) cannot follow another 476repetition operator. 477A repetition operator cannot begin an expression or subexpression 478or follow `^' or `|'. 479.Pp 480`|' cannot appear first or last in a (sub)expression or after another `|', 481i.e. an operand of `|' cannot be an empty subexpression. 482An empty parenthesized subexpression, `()', is legal and matches an 483empty (sub)string. 484An empty string is not a legal RE. 485.Pp 486A `{' followed by a digit is considered the beginning of bounds for a 487bounded repetition, which must then follow the syntax for bounds. 488A `{' \fInot\fR followed by a digit is considered an ordinary character. 489.Pp 490`^' and `$' beginning and ending subexpressions in obsolete (``basic'') 491REs are anchors, not ordinary characters. 492.Sh DIAGNOSTICS 493Non-zero error codes from 494.Fn regcomp 495and 496.Fn regexec 497include the following: 498.Pp 499.Bl -tag -width XXXREG_ECOLLATE -compact 500.It Dv REG_NOMATCH 501regexec() failed to match 502.It Dv REG_BADPAT 503invalid regular expression 504.It Dv REG_ECOLLATE 505invalid collating element 506.It Dv REG_ECTYPE 507invalid character class 508.It Dv REG_EESCAPE 509\e applied to unescapable character 510.It Dv REG_ESUBREG 511invalid backreference number 512.It Dv REG_EBRACK 513brackets [ ] not balanced 514.It Dv REG_EPAREN 515parentheses ( ) not balanced 516.It Dv REG_EBRACE 517braces { } not balanced 518.It Dv REG_BADBR 519invalid repetition count(s) in { } 520.It Dv REG_ERANGE 521invalid character range in [ ] 522.It Dv REG_ESPACE 523ran out of memory 524.It Dv REG_BADRPT 525?, *, or + operand invalid 526.It Dv REG_EMPTY 527empty (sub)expression 528.It Dv REG_ASSERT 529``can't happen''\(emyou found a bug 530.It Dv REG_INVARG 531invalid argument, e.g. negative-length string 532.El 533.Sh SEE ALSO 534.Xr grep 1 , 535.Xr sed 1 , 536.Xr re_format 7 537.Pp 538.St -p1003.2-92 , 539sections 2.8 (Regular Expression Notation) 540and 541B.5 (C Binding for Regular Expression Matching). 542.Sh HISTORY 543Originally written by Henry Spencer. 544Altered for inclusion in the 545.Bx 4.4 546distribution. 547.Sh BUGS 548This is an alpha release with known defects. 549Please report problems. 550.Pp 551There is one known functionality bug. 552The implementation of internationalization is incomplete: 553the locale is always assumed to be the default one of 554.St -p1003.2-92 , 555and only the collating elements etc. of that locale are available. 556.Pp 557The back-reference code is subtle and doubts linger about its correctness 558in complex cases. 559.Pp 560.Fn regexec 561performance is poor. 562This will improve with later releases. 563.Fa nmatch 564exceeding 0 is expensive; 565.Fa nmatch 566exceeding 1 is worse. 567.Fa regexec 568is largely insensitive to RE complexity 569.Em except 570that back references are massively expensive. 571RE length does matter; in particular, there is a strong speed bonus 572for keeping RE length under about 30 characters, 573with most special characters counting roughly double. 574.Pp 575.Fn regcomp 576implements bounded repetitions by macro expansion, 577which is costly in time and space if counts are large 578or bounded repetitions are nested. 579An RE like, say, 580`((((a{1,100}){1,100}){1,100}){1,100}){1,100}' 581will (eventually) run almost any existing machine out of swap space. 582.Pp 583There are suspected problems with response to obscure error conditions. 584Notably, 585certain kinds of internal overflow, 586produced only by truly enormous REs or by multiply nested bounded repetitions, 587are probably not handled well. 588.Pp 589Due to a mistake in 590.St -p1003.2-92 , 591things like `a)b' are legal REs because `)' is a special character 592only in the presence of a previous unmatched `('. This can't be fixed 593until the spec is fixed. 594.Pp 595The standard's definition of back references is vague. 596For example, does 597`a\e(\e(b\e)*\e2\e)*d' match `abbbd'? 598Until the standard is clarified, behavior in such cases should not be 599relied on. 600.Pp 601The implementation of word-boundary matching is a bit of a kludge, 602and bugs may lurk in combinations of word-boundary matching and anchoring. 603