1.\" $NetBSD: regex.3,v 1.22 2011/05/17 03:35:38 enami Exp $ 2.\" 3.\" Copyright (c) 1992, 1993, 1994 4.\" The Regents of the University of California. All rights reserved. 5.\" 6.\" This code is derived from software contributed to Berkeley by 7.\" Henry Spencer. 8.\" 9.\" Redistribution and use in source and binary forms, with or without 10.\" modification, are permitted provided that the following conditions 11.\" are met: 12.\" 1. Redistributions of source code must retain the above copyright 13.\" notice, this list of conditions and the following disclaimer. 14.\" 2. Redistributions in binary form must reproduce the above copyright 15.\" notice, this list of conditions and the following disclaimer in the 16.\" documentation and/or other materials provided with the distribution. 17.\" 3. Neither the name of the University nor the names of its contributors 18.\" may be used to endorse or promote products derived from this software 19.\" without specific prior written permission. 20.\" 21.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 22.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 23.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 24.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 25.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 26.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 27.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 28.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 29.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 30.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 31.\" SUCH DAMAGE. 32.\" 33.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 34.\" 35.\" This code is derived from software contributed to Berkeley by 36.\" Henry Spencer. 37.\" 38.\" Redistribution and use in source and binary forms, with or without 39.\" modification, are permitted provided that the following conditions 40.\" are met: 41.\" 1. Redistributions of source code must retain the above copyright 42.\" notice, this list of conditions and the following disclaimer. 43.\" 2. Redistributions in binary form must reproduce the above copyright 44.\" notice, this list of conditions and the following disclaimer in the 45.\" documentation and/or other materials provided with the distribution. 46.\" 3. All advertising materials mentioning features or use of this software 47.\" must display the following acknowledgement: 48.\" This product includes software developed by the University of 49.\" California, Berkeley and its contributors. 50.\" 4. Neither the name of the University nor the names of its contributors 51.\" may be used to endorse or promote products derived from this software 52.\" without specific prior written permission. 53.\" 54.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 55.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 56.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 57.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 58.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 59.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 60.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 61.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 62.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 63.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 64.\" SUCH DAMAGE. 65.\" 66.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 67.\" 68.Dd December 29, 2003 69.Dt REGEX 3 70.Os 71.Sh NAME 72.Nm regex , 73.Nm regcomp , 74.Nm regexec , 75.Nm regerror , 76.Nm regfree 77.Nd regular-expression library 78.Sh LIBRARY 79.Lb libc 80.Sh SYNOPSIS 81.In regex.h 82.Ft int 83.Fn regcomp "regex_t * restrict preg" "const char * restrict pattern" "int cflags" 84.Ft int 85.Fn regexec "const regex_t * restrict preg" "const char * restrict string" "size_t nmatch" "regmatch_t pmatch[]" "int eflags" 86.Ft size_t 87.Fn regerror "int errcode" "const regex_t * restrict preg" "char * restrict errbuf" "size_t errbuf_size" 88.Ft void 89.Fn regfree "regex_t *preg" 90.Sh DESCRIPTION 91These routines implement 92.St -p1003.2-92 93regular expressions (``RE''s); 94see 95.Xr re_format 7 . 96.Fn regcomp 97compiles an RE written as a string into an internal form, 98.Fn regexec 99matches that internal form against a string and reports results, 100.Fn regerror 101transforms error codes from either into human-readable messages, 102and 103.Fn regfree 104frees any dynamically-allocated storage used by the internal form 105of an RE. 106.Pp 107The header 108.In regex.h 109declares two structure types, 110.Fa regex_t 111and 112.Fa regmatch_t , 113the former for compiled internal forms and the latter for match reporting. 114It also declares the four functions, 115a type 116.Fa regoff_t , 117and a number of constants with names starting with ``REG_''. 118.Pp 119.Fn regcomp 120compiles the regular expression contained in the 121.Fa pattern 122string, 123subject to the flags in 124.Fa cflags , 125and places the results in the 126.Fa regex_t 127structure pointed to by 128.Fa preg . 129.Fa cflags 130is the bitwise OR of zero or more of the following flags: 131.Bl -tag -width XXXREG_EXTENDED 132.It Dv REG_EXTENDED 133Compile modern (``extended'') REs, rather than the obsolete 134(``basic'') REs that are the default. 135.It Dv REG_BASIC 136This is a synonym for 0, 137provided as a counterpart to REG_EXTENDED to improve readability. 138.It Dv REG_NOSPEC 139Compile with recognition of all special characters turned off. 140All characters are thus considered ordinary, so the ``RE'' is a literal 141string. 142This is an extension, compatible with but not specified by 143.St -p1003.2-92 , 144and should be used with caution in software intended to be portable to 145other systems. 146.Dv REG_EXTENDED 147and 148.Dv REG_NOSPEC 149may not be used in the same call to 150.Fn regcomp . 151.It Dv REG_ICASE 152Compile for matching that ignores upper/lower case distinctions. 153See 154.Xr re_format 7 . 155.It Dv REG_NOSUB 156Compile for matching that need only report success or failure, not 157what was matched. 158.It Dv REG_NEWLINE 159Compile for newline-sensitive matching. 160By default, newline is a completely ordinary character with no special 161meaning in either REs or strings. 162With this flag, 163`[^' bracket expressions and `.' never match newline, 164a `^' anchor matches the null string after any newline in the string 165in addition to its normal function, 166and the `$' anchor matches the null string before any newline in the 167string in addition to its normal function. 168.It Dv REG_PEND 169The regular expression ends, not at the first NUL, but just before the 170character pointed to by the 171.Fa re_endp 172member of the structure pointed to by 173.Fa preg . 174The 175.Fa re_endp 176member is of type 177.Fa "const\ char\ *" . 178This flag permits inclusion of NULs in the RE; they are considered 179ordinary characters. 180This is an extension, compatible with but not specified by 181.St -p1003.2-92 , 182and should be used with caution in software intended to be portable to 183other systems. 184.El 185.Pp 186When successful, 187.Fn regcomp 188returns 0 and fills in the structure pointed to by 189.Fa preg . 190One member of that structure (other than 191.Fa re_endp ) 192is publicized: 193.Fa re_nsub , 194of type 195.Fa size_t , 196contains the number of parenthesized subexpressions within the RE 197(except that the value of this member is undefined if the 198.Dv REG_NOSUB 199flag was used). 200If 201.Fn regcomp 202fails, it returns a non-zero error code; 203see 204.Sx DIAGNOSTICS . 205.Pp 206.Fn regexec 207matches the compiled RE pointed to by 208.Fa preg 209against the 210.Fa string , 211subject to the flags in 212.Fa eflags , 213and reports results using 214.Fa nmatch , 215.Fa pmatch , 216and the returned value. 217The RE must have been compiled by a previous invocation of 218.Fn regcomp . 219The compiled form is not altered during execution of 220.Fn regexec , 221so a single compiled RE can be used simultaneously by multiple threads. 222.Pp 223By default, 224the NUL-terminated string pointed to by 225.Fa string 226is considered to be the text of an entire line, minus any terminating 227newline. 228The 229.Fa eflags 230argument is the bitwise OR of zero or more of the following flags: 231.Bl -tag -width XXXREG_NOTBOL 232.It Dv REG_NOTBOL 233The first character of the string 234is not the beginning of a line, so the `^' anchor should not match before it. 235This does not affect the behavior of newlines under 236.Dv REG_NEWLINE . 237.It Dv REG_NOTEOL 238The NUL terminating the string does not end a line, so the `$' anchor 239should not match before it. 240This does not affect the behavior of newlines under 241.Dv REG_NEWLINE . 242.It Dv REG_STARTEND 243The string is considered to start at 244.Fa string 245+ 246.Fa pmatch[0].rm_so 247and to have a terminating NUL located at 248.Fa string 249+ 250.Fa pmatch[0].rm_eo 251(there need not actually be a NUL at that location), 252regardless of the value of 253.Fa nmatch . 254See below for the definition of 255.Fa pmatch 256and 257.Fa nmatch . 258This is an extension, compatible with but not specified by 259.St -p1003.2-92 , 260and should be used with caution in software intended to be portable to 261other systems. 262Note that a non-zero 263.Fa rm_so 264does not imply 265.Dv REG_NOTBOL ; 266.Dv REG_STARTEND 267affects only the location of the string, not how it is matched. 268.El 269.Pp 270See 271.Xr re_format 7 272for a discussion of what is matched in situations where an RE or a 273portion thereof could match any of several substrings of 274.Fa string . 275.Pp 276Normally, 277.Fn regexec 278returns 0 for success and the non-zero code 279.Dv REG_NOMATCH 280for failure. 281Other non-zero error codes may be returned in exceptional situations; 282see 283.Sx DIAGNOSTICS . 284.Pp 285If 286.Dv REG_NOSUB 287was specified in the compilation of the RE, or if 288.Fa nmatch 289is 0, 290.Fn regexec 291ignores the 292.Fa pmatch 293argument (but see below for the case where 294.Dv REG_STARTEND 295is specified). 296Otherwise, 297.Fa pmatch 298points to an array of 299.Fa nmatch 300structures of type 301.Fa regmatch_t . 302Such a structure has at least the members 303.Fa rm_so 304and 305.Fa rm_eo , 306both of type 307.Fa regoff_t 308(a signed arithmetic type at least as large as an 309.Fa off_t 310and a 311.Fa ssize_t ) , 312containing respectively the offset of the first character of a substring 313and the offset of the first character after the end of the substring. 314Offsets are measured from the beginning of the 315.Fa string 316argument given to 317.Fn regexec . 318An empty substring is denoted by equal offsets, 319both indicating the character following the empty substring. 320.Pp 321The 0th member of the 322.Fa pmatch 323array is filled in to indicate what substring of 324.Fa string 325was matched by the entire RE. 326Remaining members report what substring was matched by parenthesized 327subexpressions within the RE; 328member 329.Fa i 330reports subexpression 331.Fa i , 332with subexpressions counted (starting at 1) by the order of their 333opening parentheses in the RE, left to right. 334Unused entries in the array\(emcorresponding either to subexpressions that 335did not participate in the match at all, or to subexpressions that do not 336exist in the RE (that is, 337.Fa i 338\*[Gt] 339.Fa preg-\*[Gt]re_nsub ) 340\(emhave both 341.Fa rm_so 342and 343.Fa rm_eo 344set to -1. 345If a subexpression participated in the match several times, 346the reported substring is the last one it matched. 347(Note, as an example in particular, that when the RE `(b*)+' matches `bbb', 348the parenthesized subexpression matches each of the three `b's and then 349an infinite number of empty strings following the last `b', 350so the reported substring is one of the empties.) 351.Pp 352If 353.Dv REG_STARTEND 354is specified, 355.Fa pmatch 356must point to at least one 357.Fa regmatch_t 358(even if 359.Fa nmatch 360is 0 or 361.Dv REG_NOSUB 362was specified), 363to hold the input offsets for 364.Dv REG_STARTEND . 365Use for output is still entirely controlled by 366.Fa nmatch ; 367if 368.Fa nmatch 369is 0 or 370.Dv REG_NOSUB 371was specified, 372the value of 373.Fa pmatch [0] 374will not be changed by a successful 375.Fn regexec . 376.Pp 377.Fn regerror 378maps a non-zero 379.Fa errcode 380from either 381.Fn regcomp 382or 383.Fn regexec 384to a human-readable, printable message. 385If 386.Fa preg 387is non-NULL, 388the error code should have arisen from use of the 389.Fa regex_t 390pointed to by 391.Fa preg , 392and if the error code came from 393.Fn regcomp , 394it should have been the result from the most recent 395.Fn regcomp 396using that 397.Fa regex_t . 398.Po Fn regerror 399may be able to supply a more detailed message using information 400from the 401.Fa regex_t . Pc 402.Fn regerror 403places the NUL-terminated message into the buffer pointed to by 404.Fa errbuf , 405limiting the length (including the NUL) to at most 406.Fa errbuf_size 407bytes. 408If the whole message won't fit, 409as much of it as will fit before the terminating NUL is supplied. 410In any case, 411the returned value is the size of buffer needed to hold the whole 412message (including terminating NUL). 413If 414.Fa errbuf_size 415is 0, 416.Fa errbuf 417is ignored but the return value is still correct. 418.Pp 419If the 420.Fa errcode 421given to 422.Fn regerror 423is first ORed with 424.Dv REG_ITOA , 425the ``message'' that results is the printable name of the error code, 426e.g. ``REG_NOMATCH'', 427rather than an explanation thereof. 428If 429.Fa errcode 430is 431.Dv REG_ATOI , 432then 433.Fa preg 434shall be non-NULL and the 435.Fa re_endp 436member of the structure it points to 437must point to the printable name of an error code; 438in this case, the result in 439.Fa errbuf 440is the decimal digits of 441the numeric value of the error code 442(0 if the name is not recognized). 443.Dv REG_ITOA 444and 445.Dv REG_ATOI 446are intended primarily as debugging facilities; 447they are extensions, compatible with but not specified by 448.St -p1003.2-92 , 449and should be used with caution in software intended to be portable to 450other systems. 451Be warned also that they are considered experimental and changes are possible. 452.Pp 453.Fn regfree 454frees any dynamically-allocated storage associated with the compiled RE 455pointed to by 456.Fa preg . 457The remaining 458.Fa regex_t 459is no longer a valid compiled RE 460and the effect of supplying it to 461.Fn regexec 462or 463.Fn regerror 464is undefined. 465.Pp 466None of these functions references global variables except for tables 467of constants; 468all are safe for use from multiple threads if the arguments are safe. 469.Sh IMPLEMENTATION CHOICES 470There are a number of decisions that 471.St -p1003.2-92 472leaves up to the implementor, 473either by explicitly saying ``undefined'' or by virtue of them being 474forbidden by the RE grammar. 475This implementation treats them as follows. 476.Pp 477See 478.Xr re_format 7 479for a discussion of the definition of case-independent matching. 480.Pp 481There is no particular limit on the length of REs, 482except insofar as memory is limited. 483Memory usage is approximately linear in RE size, and largely insensitive 484to RE complexity, except for bounded repetitions. 485See BUGS for one short RE using them 486that will run almost any system out of memory. 487.Pp 488A backslashed character other than one specifically given a magic meaning 489by 490.St -p1003.2-92 491(such magic meanings occur only in obsolete [``basic''] REs) 492is taken as an ordinary character. 493.Pp 494Any unmatched [ is a 495.Dv REG_EBRACK 496error. 497.Pp 498Equivalence classes cannot begin or end bracket-expression ranges. 499The endpoint of one range cannot begin another. 500.Pp 501.Dv RE_DUP_MAX , 502the limit on repetition counts in bounded repetitions, is 255. 503.Pp 504A repetition operator (?, *, +, or bounds) cannot follow another 505repetition operator. 506A repetition operator cannot begin an expression or subexpression 507or follow `^' or `|'. 508.Pp 509`|' cannot appear first or last in a (sub)expression or after another `|', 510i.e. an operand of `|' cannot be an empty subexpression. 511An empty parenthesized subexpression, `()', is legal and matches an 512empty (sub)string. 513An empty string is not a legal RE. 514.Pp 515A `{' followed by a digit is considered the beginning of bounds for a 516bounded repetition, which must then follow the syntax for bounds. 517A `{' 518.Em not 519followed by a digit is considered an ordinary character. 520.Pp 521`^' and `$' beginning and ending subexpressions in obsolete (``basic'') 522REs are anchors, not ordinary characters. 523.Sh DIAGNOSTICS 524Non-zero error codes from 525.Fn regcomp 526and 527.Fn regexec 528include the following: 529.Pp 530.Bl -tag -width XXXREG_ECOLLATE -compact 531.It Dv REG_NOMATCH 532.Fn regexec 533failed to match 534.It Dv REG_BADPAT 535invalid regular expression 536.It Dv REG_ECOLLATE 537invalid collating element 538.It Dv REG_ECTYPE 539invalid character class 540.It Dv REG_EESCAPE 541\e applied to unescapable character 542.It Dv REG_ESUBREG 543invalid backreference number 544.It Dv REG_EBRACK 545brackets [ ] not balanced 546.It Dv REG_EPAREN 547parentheses ( ) not balanced 548.It Dv REG_EBRACE 549braces { } not balanced 550.It Dv REG_BADBR 551invalid repetition count(s) in { } 552.It Dv REG_ERANGE 553invalid character range in [ ] 554.It Dv REG_ESPACE 555ran out of memory 556.It Dv REG_BADRPT 557?, *, or + operand invalid 558.It Dv REG_EMPTY 559empty (sub)expression 560.It Dv REG_ASSERT 561``can't happen''\(emyou found a bug 562.It Dv REG_INVARG 563invalid argument, e.g. negative-length string 564.El 565.Sh SEE ALSO 566.Xr grep 1 , 567.Xr sed 1 , 568.Xr re_format 7 569.Pp 570.St -p1003.2-92 , 571sections 2.8 (Regular Expression Notation) 572and 573B.5 (C Binding for Regular Expression Matching). 574.Sh HISTORY 575Originally written by Henry Spencer. 576Altered for inclusion in the 577.Bx 4.4 578distribution. 579.Sh BUGS 580There is one known functionality bug. 581The implementation of internationalization is incomplete: 582the locale is always assumed to be the default one of 583.St -p1003.2-92 , 584and only the collating elements etc. of that locale are available. 585.Pp 586The back-reference code is subtle and doubts linger about its correctness 587in complex cases. 588.Pp 589.Fn regexec 590performance is poor. 591This will improve with later releases. 592.Fa nmatch 593exceeding 0 is expensive; 594.Fa nmatch 595exceeding 1 is worse. 596.Fa regexec 597is largely insensitive to RE complexity 598.Em except 599that back references are massively expensive. 600RE length does matter; in particular, there is a strong speed bonus 601for keeping RE length under about 30 characters, 602with most special characters counting roughly double. 603.Pp 604.Fn regcomp 605implements bounded repetitions by macro expansion, 606which is costly in time and space if counts are large 607or bounded repetitions are nested. 608An RE like, say, 609`((((a{1,100}){1,100}){1,100}){1,100}){1,100}' 610will (eventually) run almost any existing machine out of swap space. 611.Pp 612There are suspected problems with response to obscure error conditions. 613Notably, 614certain kinds of internal overflow, 615produced only by truly enormous REs or by multiply nested bounded repetitions, 616are probably not handled well. 617.Pp 618Due to a mistake in 619.St -p1003.2-92 , 620things like `a)b' are legal REs because `)' is a special character 621only in the presence of a previous unmatched `('. 622This can't be fixed until the spec is fixed. 623.Pp 624The standard's definition of back references is vague. 625For example, does 626`a\e(\e(b\e)*\e2\e)*d' match `abbbd'? 627Until the standard is clarified, behavior in such cases should not be 628relied on. 629.Pp 630The implementation of word-boundary matching is a bit of a kludge, 631and bugs may lurk in combinations of word-boundary matching and anchoring. 632