1.\" $OpenBSD: regex.3,v 1.21 2007/05/31 19:19:30 jmc Exp $ 2.\" 3.\" Copyright (c) 1997, Phillip F Knaack. All rights reserved. 4.\" 5.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 6.\" Copyright (c) 1992, 1993, 1994 7.\" The Regents of the University of California. All rights reserved. 8.\" 9.\" This code is derived from software contributed to Berkeley by 10.\" Henry Spencer. 11.\" 12.\" Redistribution and use in source and binary forms, with or without 13.\" modification, are permitted provided that the following conditions 14.\" are met: 15.\" 1. Redistributions of source code must retain the above copyright 16.\" notice, this list of conditions and the following disclaimer. 17.\" 2. Redistributions in binary form must reproduce the above copyright 18.\" notice, this list of conditions and the following disclaimer in the 19.\" documentation and/or other materials provided with the distribution. 20.\" 3. Neither the name of the University nor the names of its contributors 21.\" may be used to endorse or promote products derived from this software 22.\" without specific prior written permission. 23.\" 24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 27.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 34.\" SUCH DAMAGE. 35.\" 36.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 37.\" 38.Dd $Mdocdate: May 31 2007 $ 39.Dt REGEX 3 40.Os 41.Sh NAME 42.Nm regcomp , 43.Nm regexec , 44.Nm regerror , 45.Nm regfree 46.Nd regular expression routines 47.Sh SYNOPSIS 48.Fd #include <sys/types.h> 49.Fd #include <regex.h> 50.Ft int 51.Fn regcomp "regex_t *preg" "const char *pattern" "int cflags" 52.Pp 53.Ft int 54.Fn regexec "const regex_t *preg" "const char *string" "size_t nmatch" \ 55 "regmatch_t pmatch[]" "int eflags" 56.Pp 57.Ft size_t 58.Fn regerror "int errcode" "const regex_t *preg" "char *errbuf" \ 59 "size_t errbuf_size" 60.Pp 61.Ft void 62.Fn regfree "regex_t *preg" 63.Sh DESCRIPTION 64These routines implement 65.St -p1003.2 66regular expressions 67.Pq Dq REs ; 68see 69.Xr re_format 7 . 70.Fn regcomp 71compiles an RE written as a string into an internal form, 72.Fn regexec 73matches that internal form against a string and reports results, 74.Fn regerror 75transforms error codes from either into human-readable messages, and 76.Fn regfree 77frees any dynamically allocated storage used by the internal form 78of an RE. 79.Pp 80The header 81.Aq Pa regex.h 82declares two structure types, 83.Li regex_t 84and 85.Li regmatch_t , 86the former for compiled internal forms and the latter for match reporting. 87It also declares the four functions, 88a type 89.Li regoff_t , 90and a number of constants with names starting with 91.Dv REG_ . 92.Pp 93.Fn regcomp 94compiles the regular expression contained in the 95.Fa pattern 96string, 97subject to the flags in 98.Fa cflags , 99and places the results in the 100.Li regex_t 101structure pointed to by 102.Fa preg . 103.Fa cflags 104is the bitwise 105.Tn OR 106of zero or more of the following flags: 107.Bl -tag -width XREG_EXTENDEDX 108.It Dv REG_EXTENDED 109Compile modern 110.Pq Dq extended 111REs, 112rather than the obsolete 113.Pq Dq basic 114REs that are the default. 115.It Dv REG_BASIC 116This is a synonym for 0, 117provided as a counterpart to 118.Dv REG_EXTENDED 119to improve readability. 120.It Dv REG_NOSPEC 121Compile with recognition of all special characters turned off. 122All characters are thus considered ordinary, 123so the RE is a literal string. 124This is an extension, 125compatible with but not specified by 126.St -p1003.2 , 127and should be used with 128caution in software intended to be portable to other systems. 129.Dv REG_EXTENDED 130and 131.Dv REG_NOSPEC 132may not be used in the same call to 133.Fn regcomp . 134.It Dv REG_ICASE 135Compile for matching that ignores upper/lower case distinctions. 136See 137.Xr re_format 7 . 138.It Dv REG_NOSUB 139Compile for matching that need only report success or failure, 140not what was matched. 141.It Dv REG_NEWLINE 142Compile for newline-sensitive matching. 143By default, newline is a completely ordinary character with no special 144meaning in either REs or strings. 145With this flag, 146.Ql \&[^ 147bracket expressions and 148.Ql \&. 149never match newline, 150a 151.Ql ^ 152anchor matches the null string after any newline in the string 153in addition to its normal function, 154and the 155.Ql $ 156anchor matches the null string before any newline in the 157string in addition to its normal function. 158.It Dv REG_PEND 159The regular expression ends, 160not at the first NUL, 161but just before the character pointed to by the 162.Fa re_endp 163member of the structure pointed to by 164.Fa preg . 165The 166.Fa re_endp 167member is of type 168.Fa const\ char\ * . 169This flag permits inclusion of NULs in the RE; 170they are considered ordinary characters. 171This is an extension, 172compatible with but not specified by 173.St -p1003.2 , 174and should be used with 175caution in software intended to be portable to other systems. 176.El 177.Pp 178When successful, 179.Fn regcomp 180returns 0 and fills in the structure pointed to by 181.Fa preg . 182One member of that structure 183(other than 184.Fa re_endp ) 185is publicized: 186.Fa re_nsub , 187of type 188.Fa size_t , 189contains the number of parenthesized subexpressions within the RE 190(except that the value of this member is undefined if the 191.Dv REG_NOSUB 192flag was used). 193If 194.Fn regcomp 195fails, it returns a non-zero error code; 196see DIAGNOSTICS. 197.Pp 198.Fn regexec 199matches the compiled RE pointed to by 200.Fa preg 201against the 202.Fa string , 203subject to the flags in 204.Fa eflags , 205and reports results using 206.Fa nmatch , 207.Fa pmatch , 208and the returned value. 209The RE must have been compiled by a previous invocation of 210.Fn regcomp . 211The compiled form is not altered during execution of 212.Fn regexec , 213so a single compiled RE can be used simultaneously by multiple threads. 214.Pp 215By default, 216the NUL-terminated string pointed to by 217.Fa string 218is considered to be the text of an entire line, minus any terminating 219newline. 220The 221.Fa eflags 222argument is the bitwise 223.Tn OR 224of zero or more of the following flags: 225.Bl -tag -width XREG_STARTENDX 226.It Dv REG_NOTBOL 227The first character of 228the string 229is not the beginning of a line, so the 230.Ql ^ 231anchor should not match before it. 232This does not affect the behavior of newlines under 233.Dv REG_NEWLINE . 234.It Dv REG_NOTEOL 235The NUL terminating 236the string 237does not end a line, so the 238.Ql $ 239anchor should not match before it. 240This does not affect the behavior of newlines under 241.Dv REG_NEWLINE . 242.It Dv REG_STARTEND 243The string is considered to start at 244\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR 245and to have a terminating NUL located at 246\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR 247(there need not actually be a NUL at that location), 248regardless of the value of 249.Fa nmatch . 250See below for the definition of 251.Fa pmatch 252and 253.Fa nmatch . 254This is an extension, 255compatible with but not specified by 256.St -p1003.2 , 257and should be used with 258caution in software intended to be portable to other systems. 259Note that a non-zero \fIrm_so\fR does not imply 260.Dv REG_NOTBOL ; 261.Dv REG_STARTEND 262affects only the location of the string, 263not how it is matched. 264.El 265.Pp 266See 267.Xr re_format 7 268for a discussion of what is matched in situations where an RE or a 269portion thereof could match any of several substrings of 270.Fa string . 271.Pp 272Normally, 273.Fn regexec 274returns 0 for success and the non-zero code 275.Dv REG_NOMATCH 276for failure. 277Other non-zero error codes may be returned in exceptional situations; 278see DIAGNOSTICS. 279.Pp 280If 281.Dv REG_NOSUB 282was specified in the compilation of the RE, 283or if 284.Fa nmatch 285is 0, 286.Fn regexec 287ignores the 288.Fa pmatch 289argument (but see below for the case where 290.Dv REG_STARTEND 291is specified). 292Otherwise, 293.Fa pmatch 294points to an array of 295.Fa nmatch 296structures of type 297.Li regmatch_t . 298Such a structure has at least the members 299.Fa rm_so 300and 301.Fa rm_eo , 302both of type 303.Fa regoff_t 304(a signed arithmetic type at least as large as an 305.Li off_t 306and a 307.Li ssize_t ) , 308containing respectively the offset of the first character of a substring 309and the offset of the first character after the end of the substring. 310Offsets are measured from the beginning of the 311.Fa string 312argument given to 313.Fn regexec . 314An empty substring is denoted by equal offsets, 315both indicating the character following the empty substring. 316.Pp 317The 0th member of the 318.Fa pmatch 319array is filled in to indicate what substring of 320.Fa string 321was matched by the entire RE. 322Remaining members report what substring was matched by parenthesized 323subexpressions within the RE; 324member 325.Va i 326reports subexpression 327.Va i , 328with subexpressions counted (starting at 1) by the order of their opening 329parentheses in the RE, left to right. 330Unused entries in the array\(emcorresponding either to subexpressions that 331did not participate in the match at all, or to subexpressions that do not 332exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both 333.Fa rm_so 334and 335.Fa rm_eo 336set to \-1. 337If a subexpression participated in the match several times, 338the reported substring is the last one it matched. 339(Note, as an example in particular, that when the RE 340.Dq (b*)+ 341matches 342.Dq bbb , 343the parenthesized subexpression matches each of the three 344.Sq b Ns s 345and then 346an infinite number of empty strings following the last 347.Sq b , 348so the reported substring is one of the empties.) 349.Pp 350If 351.Dv REG_STARTEND 352is specified, 353.Fa pmatch 354must point to at least one 355.Li regmatch_t 356(even if 357.Fa nmatch 358is 0 or 359.Dv REG_NOSUB 360was specified), 361to hold the input offsets for 362.Dv REG_STARTEND . 363Use for output is still entirely controlled by 364.Fa nmatch ; 365if 366.Fa nmatch 367is 0 or 368.Dv REG_NOSUB 369was specified, 370the value of 371.Fa pmatch[0] 372will not be changed by a successful 373.Fn regexec . 374.Pp 375.Fn regerror 376maps a non-zero 377.Va errcode 378from either 379.Fn regcomp 380or 381.Fn regexec 382to a human-readable, printable message. 383If 384.Fa preg 385is non-NULL, 386the error code should have arisen from use of 387the 388.Li regex_t 389pointed to by 390.Fa preg , 391and if the error code came from 392.Fn regcomp , 393it should have been the result from the most recent 394.Fn regcomp 395using that 396.Li regex_t . 397.Pf ( Fn regerror 398may be able to supply a more detailed message using information 399from the 400.Li regex_t . ) 401.Fn regerror 402places the NUL-terminated message into the buffer pointed to by 403.Fa errbuf , 404limiting the length (including the NUL) to at most 405.Fa errbuf_size 406bytes. 407If the whole message won't fit, 408as much of it as will fit before the terminating NUL is supplied. 409In any case, 410the returned value is the size of buffer needed to hold the whole 411message (including the terminating NUL). 412If 413.Fa errbuf_size 414is 0, 415.Fa errbuf 416is ignored but the return value is still correct. 417.Pp 418If the 419.Fa errcode 420given to 421.Fn regerror 422is first 423.Tn OR Ns 'ed 424with 425.Dv REG_ITOA , 426the 427.Dq message 428that results is the printable name of the error code, 429e.g., 430.Dq REG_NOMATCH , 431rather than an explanation thereof. 432If 433.Fa errcode 434is 435.Dv REG_ATOI , 436then 437.Fa preg 438shall be non-null and the 439.Fa re_endp 440member of the structure it points to 441must point to the printable name of an error code; 442in this case, the result in 443.Fa errbuf 444is the decimal digits of 445the numeric value of the error code 446(0 if the name is not recognized). 447.Dv REG_ITOA 448and 449.Dv REG_ATOI 450are intended primarily as debugging facilities; 451they are extensions, 452compatible with but not specified by 453.St -p1003.2 454and should be used with 455caution in software intended to be portable to other systems. 456Be warned also that they are considered experimental and changes are possible. 457.Pp 458.Fn regfree 459frees any dynamically allocated storage associated with the compiled RE 460pointed to by 461.Fa preg . 462The remaining 463.Li regex_t 464is no longer a valid compiled RE 465and the effect of supplying it to 466.Fn regexec 467or 468.Fn regerror 469is undefined. 470.Pp 471None of these functions references global variables except for tables 472of constants; 473all are safe for use from multiple threads if the arguments are safe. 474.Sh IMPLEMENTATION CHOICES 475There are a number of decisions that 476.St -p1003.2 477leaves up to the implementor, 478either by explicitly saying 479.Dq undefined 480or by virtue of them being 481forbidden by the RE grammar. 482This implementation treats them as follows. 483.Pp 484See 485.Xr re_format 7 486for a discussion of the definition of case-independent matching. 487.Pp 488There is no particular limit on the length of REs, 489except insofar as memory is limited. 490Memory usage is approximately linear in RE size, and largely insensitive 491to RE complexity, except for bounded repetitions. 492See 493.Sx BUGS 494for one short RE using them 495that will run almost any system out of memory. 496.Pp 497A backslashed character other than one specifically given a magic meaning 498by 499.St -p1003.2 500(such magic meanings occur only in obsolete REs) 501is taken as an ordinary character. 502.Pp 503Any unmatched 504.Ql \&[ 505is a 506.Dv REG_EBRACK 507error. 508.Pp 509Equivalence classes cannot begin or end bracket-expression ranges. 510The endpoint of one range cannot begin another. 511.Pp 512RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255. 513.Pp 514A repetition operator (?, *, +, or bounds) cannot follow another 515repetition operator. 516A repetition operator cannot begin an expression or subexpression 517or follow 518.Ql ^ 519or 520.Ql | . 521.Pp 522A 523.Ql | 524cannot appear first or last in a (sub)expression, or after another 525.Ql | , 526i.e., an operand of 527.Ql | 528cannot be an empty subexpression. 529An empty parenthesized subexpression, 530.Ql \&(\&) , 531is legal and matches an 532empty (sub)string. 533An empty string is not a legal RE. 534.Pp 535A 536.Ql { 537followed by a digit is considered the beginning of bounds for a 538bounded repetition, which must then follow the syntax for bounds. 539A 540.Ql { 541.Em not 542followed by a digit is considered an ordinary character. 543.Pp 544.Ql ^ 545and 546.Ql $ 547beginning and ending subexpressions in obsolete 548.Pq Dq basic 549REs are anchors, not ordinary characters. 550.Sh DIAGNOSTICS 551Non-zero error codes from 552.Fn regcomp 553and 554.Fn regexec 555include the following: 556.Pp 557.Bl -tag -compact -width XREG_ECOLLATEX 558.It Er REG_NOMATCH 559regexec() failed to match 560.It Er REG_BADPAT 561invalid regular expression 562.It Er REG_ECOLLATE 563invalid collating element 564.It Er REG_ECTYPE 565invalid character class 566.It Er REG_EESCAPE 567\e applied to unescapable character 568.It Er REG_ESUBREG 569invalid backreference number 570.It Er REG_EBRACK 571brackets [ ] not balanced 572.It Er REG_EPAREN 573parentheses ( ) not balanced 574.It Er REG_EBRACE 575braces { } not balanced 576.It Er REG_BADBR 577invalid repetition count(s) in { } 578.It Er REG_ERANGE 579invalid character range in [ ] 580.It Er REG_ESPACE 581ran out of memory 582.It Er REG_BADRPT 583?, *, or + operand invalid 584.It Er REG_EMPTY 585empty (sub)expression 586.It Er REG_ASSERT 587.Dq can't happen 588\(emyou found a bug 589.It Er REG_INVARG 590invalid argument, e.g., negative-length string 591.El 592.Sh SEE ALSO 593.Xr grep 1 , 594.Xr re_format 7 595.Pp 596.St -p1003.2 , 597sections 2.8 (Regular Expression Notation) 598and 599B.5 (C Binding for Regular Expression Matching). 600.Sh HISTORY 601Originally written by Henry Spencer. 602Altered for inclusion in the 603.Bx 4.4 604distribution. 605.Sh BUGS 606This is an alpha release with known defects. 607Please report problems. 608.Pp 609There is one known functionality bug. 610The implementation of internationalization is incomplete: 611the locale is always assumed to be the default one of 612.St -p1003.2 , 613and only the collating elements etc. of that locale are available. 614.Pp 615The back-reference code is subtle and doubts linger about its correctness 616in complex cases. 617.Pp 618.Fn regexec 619performance is poor. 620This will improve with later releases. 621.Fa nmatch 622exceeding 0 is expensive; 623.Fa nmatch 624exceeding 1 is worse. 625.Fn regexec 626is largely insensitive to RE complexity 627.Em except 628that back references are massively expensive. 629RE length does matter; in particular, there is a strong speed bonus 630for keeping RE length under about 30 characters, 631with most special characters counting roughly double. 632.Pp 633.Fn regcomp 634implements bounded repetitions by macro expansion, 635which is costly in time and space if counts are large 636or bounded repetitions are nested. 637A RE like, say, 638.Dq ((((a{1,100}){1,100}){1,100}){1,100}){1,100} 639will (eventually) run almost any existing machine out of swap space. 640.Pp 641There are suspected problems with response to obscure error conditions. 642Notably, 643certain kinds of internal overflow, 644produced only by truly enormous REs or by multiply nested bounded repetitions, 645are probably not handled well. 646.Pp 647Due to a mistake in 648.St -p1003.2 , 649things like 650.Ql a)b 651are legal REs because 652.Ql \&) 653is 654a special character only in the presence of a previous unmatched 655.Ql \&( . 656This can't be fixed until the spec is fixed. 657.Pp 658The standard's definition of back references is vague. 659For example, does 660.Dq a\e(\e(b\e)*\e2\e)*d 661match 662.Dq abbbd ? 663Until the standard is clarified, 664behavior in such cases should not be relied on. 665.Pp 666The implementation of word-boundary matching is a bit of a kludge, 667and bugs may lurk in combinations of word-boundary matching and anchoring. 668