1.\" $OpenBSD: regex.3,v 1.15 2000/08/09 15:56:45 aaron Exp $ 2.\" 3.\" Copyright (c) 1997, Phillip F Knaack. All rights reserved. 4.\" 5.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 6.\" Copyright (c) 1992, 1993, 1994 7.\" The Regents of the University of California. All rights reserved. 8.\" 9.\" This code is derived from software contributed to Berkeley by 10.\" Henry Spencer. 11.\" 12.\" Redistribution and use in source and binary forms, with or without 13.\" modification, are permitted provided that the following conditions 14.\" are met: 15.\" 1. Redistributions of source code must retain the above copyright 16.\" notice, this list of conditions and the following disclaimer. 17.\" 2. Redistributions in binary form must reproduce the above copyright 18.\" notice, this list of conditions and the following disclaimer in the 19.\" documentation and/or other materials provided with the distribution. 20.\" 3. All advertising materials mentioning features or use of this software 21.\" must display the following acknowledgement: 22.\" This product includes software developed by the University of 23.\" California, Berkeley and its contributors. 24.\" 4. Neither the name of the University nor the names of its contributors 25.\" may be used to endorse or promote products derived from this software 26.\" without specific prior written permission. 27.\" 28.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 29.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 30.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 31.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 32.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 33.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 34.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 35.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 36.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 37.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 38.\" SUCH DAMAGE. 39.\" 40.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 41.\" 42.Dd March 20, 1994 43.Dt REGEX 3 44.Os 45.Sh NAME 46.Nm regcomp , 47.Nm regexec , 48.Nm regerror , 49.Nm regfree 50.Nd regular-expression library 51.Sh SYNOPSIS 52.Fd #include <sys/types.h> 53.Fd #include <regex.h> 54.Fo "int regcomp" 55.Fa "regex_t *preg" 56.Fa "const char *pattern" 57.Fa "int cflags" 58.Fc 59.Fo "int regexec" 60.Fa "const regex_t *preg" 61.Fa "const char *string" 62.Fa "size_t nmatch" 63.Fa "regmatch_t pmatch[]" 64.Fa "int eflags" 65.Fc 66.Fo "size_t regerror" 67.Fa "int errcode" 68.Fa "const regex_t *preg" 69.Fa "char *errbuf" 70.Fa "size_t errbuf_size" 71.Fc 72.Fo "void regfree" 73.Fa "regex_t *preg" 74.Fc 75.Sh DESCRIPTION 76These routines implement POSIX 1003.2 regular expressions 77.Pq Dq REs ; 78see 79.Xr re_format 7 . 80.Fn regcomp 81compiles an RE written as a string into an internal form, 82.Fn regexec 83matches that internal form against a string and reports results, 84.Fn regerror 85transforms error codes from either into human-readable messages, and 86.Fn regfree 87frees any dynamically allocated storage used by the internal form 88of an RE. 89.Pp 90The header 91.Aq Pa regex.h 92declares two structure types, 93.Li regex_t 94and 95.Li regmatch_t , 96the former for compiled internal forms and the latter for match reporting. 97It also declares the four functions, 98a type 99.Li regoff_t , 100and a number of constants with names starting with 101.Dv REG_ . 102.Pp 103.Fn regcomp 104compiles the regular expression contained in the 105.Fa pattern 106string, 107subject to the flags in 108.Fa cflags , 109and places the results in the 110.Li regex_t 111structure pointed to by 112.Fa preg . 113.Fa cflags 114is the bitwise 115.Tn OR 116of zero or more of the following flags: 117.Bl -tag -width XREG_EXTENDEDX 118.It Dv REG_EXTENDED 119Compile modern 120.Pq Dq extended 121REs, 122rather than the obsolete 123.Pq Dq basic 124REs that are the default. 125.It Dv REG_BASIC 126This is a synonym for 0, 127provided as a counterpart to 128.Dv REG_EXTENDED 129to improve readability. 130.It Dv REG_NOSPEC 131Compile with recognition of all special characters turned off. 132All characters are thus considered ordinary, 133so the RE is a literal string. 134This is an extension, 135compatible with but not specified by POSIX 1003.2, 136and should be used with 137caution in software intended to be portable to other systems. 138.Dv REG_EXTENDED 139and 140.Dv REG_NOSPEC 141may not be used in the same call to 142.Fn regcomp . 143.It Dv REG_ICASE 144Compile for matching that ignores upper/lower case distinctions. 145See 146.Xr re_format 7 . 147.It Dv REG_NOSUB 148Compile for matching that need only report success or failure, 149not what was matched. 150.It Dv REG_NEWLINE 151Compile for newline-sensitive matching. 152By default, newline is a completely ordinary character with no special 153meaning in either REs or strings. 154With this flag, 155.Ql \&[^ 156bracket expressions and 157.Ql \&. 158never match newline, 159a 160.Ql ^ 161anchor matches the null string after any newline in the string 162in addition to its normal function, 163and the 164.Ql $ 165anchor matches the null string before any newline in the 166string in addition to its normal function. 167.It Dv REG_PEND 168The regular expression ends, 169not at the first NUL, 170but just before the character pointed to by the 171.Fa re_endp 172member of the structure pointed to by 173.Fa preg . 174The 175.Fa re_endp 176member is of type 177.Fa const\ char\ * . 178This flag permits inclusion of NULs in the RE; 179they are considered ordinary characters. 180This is an extension, 181compatible with but not specified by POSIX 1003.2, 182and should be used with 183caution in software intended to be portable to other systems. 184.El 185.Pp 186When successful, 187.Fn regcomp 188returns 0 and fills in the structure pointed to by 189.Fa preg . 190One member of that structure 191(other than 192.Fa re_endp ) 193is publicized: 194.Fa re_nsub , 195of type 196.Fa size_t , 197contains the number of parenthesized subexpressions within the RE 198(except that the value of this member is undefined if the 199.Dv REG_NOSUB 200flag was used). 201If 202.Fn regcomp 203fails, it returns a non-zero error code; 204see DIAGNOSTICS. 205.Pp 206.Fn regexec 207matches the compiled RE pointed to by 208.Fa preg 209against the 210.Fa string , 211subject to the flags in 212.Fa eflags , 213and reports results using 214.Fa nmatch , 215.Fa pmatch , 216and the returned value. 217The RE must have been compiled by a previous invocation of 218.Fn regcomp . 219The compiled form is not altered during execution of 220.Fn regexec , 221so a single compiled RE can be used simultaneously by multiple threads. 222.Pp 223By default, 224the null-terminated string pointed to by 225.Fa string 226is considered to be the text of an entire line, minus any terminating 227newline. 228The 229.Fa eflags 230argument is the bitwise 231.Tn OR 232of zero or more of the following flags: 233.Bl -tag -width XREG_STARTENDX 234.It Dv REG_NOTBOL 235The first character of 236the string 237is not the beginning of a line, so the 238.Ql ^ 239anchor should not match before it. 240This does not affect the behavior of newlines under 241.Dv REG_NEWLINE . 242.It Dv REG_NOTEOL 243The NUL terminating 244the string 245does not end a line, so the 246.Ql $ 247anchor should not match before it. 248This does not affect the behavior of newlines under 249.Dv REG_NEWLINE . 250.It Dv REG_STARTEND 251The string is considered to start at 252\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR 253and to have a terminating NUL located at 254\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR 255(there need not actually be a NUL at that location), 256regardless of the value of 257.Fa nmatch . 258See below for the definition of 259.Fa pmatch 260and 261.Fa nmatch . 262This is an extension, 263compatible with but not specified by POSIX 1003.2, 264and should be used with 265caution in software intended to be portable to other systems. 266Note that a non-zero \fIrm_so\fR does not imply 267.Dv REG_NOTBOL ; 268.Dv REG_STARTEND 269affects only the location of the string, 270not how it is matched. 271.El 272.Pp 273See 274.Xr re_format 7 275for a discussion of what is matched in situations where an RE or a 276portion thereof could match any of several substrings of 277.Fa string . 278.Pp 279Normally, 280.Fn regexec 281returns 0 for success and the non-zero code 282.Dv REG_NOMATCH 283for failure. 284Other non-zero error codes may be returned in exceptional situations; 285see DIAGNOSTICS. 286.Pp 287If 288.Dv REG_NOSUB 289was specified in the compilation of the RE, 290or if 291.Fa nmatch 292is 0, 293.Fn regexec 294ignores the 295.Fa pmatch 296argument (but see below for the case where 297.Dv REG_STARTEND 298is specified). 299Otherwise, 300.Fa pmatch 301points to an array of 302.Fa nmatch 303structures of type 304.Li regmatch_t . 305Such a structure has at least the members 306.Fa rm_so 307and 308.Fa rm_eo , 309both of type 310.Fa regoff_t 311(a signed arithmetic type at least as large as an 312.Li off_t 313and a 314.Li ssize_t ) , 315containing respectively the offset of the first character of a substring 316and the offset of the first character after the end of the substring. 317Offsets are measured from the beginning of the 318.Fa string 319argument given to 320.Fn regexec . 321An empty substring is denoted by equal offsets, 322both indicating the character following the empty substring. 323.Pp 324The 0th member of the 325.Fa pmatch 326array is filled in to indicate what substring of 327.Fa string 328was matched by the entire RE. 329Remaining members report what substring was matched by parenthesized 330subexpressions within the RE; 331member 332.Va i 333reports subexpression 334.Va i , 335with subexpressions counted (starting at 1) by the order of their opening 336parentheses in the RE, left to right. 337Unused entries in the array\(emcorresponding either to subexpressions that 338did not participate in the match at all, or to subexpressions that do not 339exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both 340.Fa rm_so 341and 342.Fa rm_eo 343set to \-1. 344If a subexpression participated in the match several times, 345the reported substring is the last one it matched. 346(Note, as an example in particular, that when the RE 347.Dq (b*)+ 348matches 349.Dq bbb , 350the parenthesized subexpression matches each of the three 351.Sq b Ns s 352and then 353an infinite number of empty strings following the last 354.Sq b , 355so the reported substring is one of the empties.) 356.Pp 357If 358.Dv REG_STARTEND 359is specified, 360.Fa pmatch 361must point to at least one 362.Li regmatch_t 363(even if 364.Fa nmatch 365is 0 or 366.Dv REG_NOSUB 367was specified), 368to hold the input offsets for 369.Dv REG_STARTEND . 370Use for output is still entirely controlled by 371.Fa nmatch ; 372if 373.Fa nmatch 374is 0 or 375.Dv REG_NOSUB 376was specified, 377the value of 378.Fa pmatch[0] 379will not be changed by a successful 380.Fn regexec . 381.Pp 382.Fn regerror 383maps a non-zero 384.Va errcode 385from either 386.Fn regcomp 387or 388.Fn regexec 389to a human-readable, printable message. 390If 391.Fa preg 392is non-NULL, 393the error code should have arisen from use of 394the 395.Li regex_t 396pointed to by 397.Fa preg , 398and if the error code came from 399.Fn regcomp , 400it should have been the result from the most recent 401.Fn regcomp 402using that 403.Li regex_t . 404.Pf ( Fn regerror 405may be able to supply a more detailed message using information 406from the 407.Li regex_t . ) 408.Fn regerror 409places the null-terminated message into the buffer pointed to by 410.Fa errbuf , 411limiting the length (including the NUL) to at most 412.Fa errbuf_size 413bytes. 414If the whole message won't fit, 415as much of it as will fit before the terminating NUL is supplied. 416In any case, 417the returned value is the size of buffer needed to hold the whole 418message (including the terminating NUL). 419If 420.Fa errbuf_size 421is 0, 422.Fa errbuf 423is ignored but the return value is still correct. 424.Pp 425If the 426.Fa errcode 427given to 428.Fn regerror 429is first 430.Tn OR Ns 'ed 431with 432.Dv REG_ITOA , 433the 434.Dq message 435that results is the printable name of the error code, 436e.g., 437.Dq REG_NOMATCH , 438rather than an explanation thereof. 439If 440.Fa errcode 441is 442.Dv REG_ATOI , 443then 444.Fa preg 445shall be non-null and the 446.Fa re_endp 447member of the structure it points to 448must point to the printable name of an error code; 449in this case, the result in 450.Fa errbuf 451is the decimal digits of 452the numeric value of the error code 453(0 if the name is not recognized). 454.Dv REG_ITOA 455and 456.Dv REG_ATOI 457are intended primarily as debugging facilities; 458they are extensions, 459compatible with but not specified by POSIX 1003.2, 460and should be used with 461caution in software intended to be portable to other systems. 462Be warned also that they are considered experimental and changes are possible. 463.Pp 464.Fn regfree 465frees any dynamically allocated storage associated with the compiled RE 466pointed to by 467.Fa preg . 468The remaining 469.Li regex_t 470is no longer a valid compiled RE 471and the effect of supplying it to 472.Fn regexec 473or 474.Fn regerror 475is undefined. 476.Pp 477None of these functions references global variables except for tables 478of constants; 479all are safe for use from multiple threads if the arguments are safe. 480.Sh IMPLEMENTATION CHOICES 481There are a number of decisions that 1003.2 leaves up to the implementor, 482either by explicitly saying 483.Dq undefined 484or by virtue of them being 485forbidden by the RE grammar. 486This implementation treats them as follows. 487.Pp 488See 489.Xr re_format 7 490for a discussion of the definition of case-independent matching. 491.Pp 492There is no particular limit on the length of REs, 493except insofar as memory is limited. 494Memory usage is approximately linear in RE size, and largely insensitive 495to RE complexity, except for bounded repetitions. 496See 497.Sx BUGS 498for one short RE using them 499that will run almost any system out of memory. 500.Pp 501A backslashed character other than one specifically given a magic meaning 502by 1003.2 (such magic meanings occur only in obsolete REs) 503is taken as an ordinary character. 504.Pp 505Any unmatched 506.Ql \&[ 507is a 508.Dv REG_EBRACK 509error. 510.Pp 511Equivalence classes cannot begin or end bracket-expression ranges. 512The endpoint of one range cannot begin another. 513.Pp 514RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255. 515.Pp 516A repetition operator (?, *, +, or bounds) cannot follow another 517repetition operator. 518A repetition operator cannot begin an expression or subexpression 519or follow 520.Ql ^ 521or 522.Ql | . 523.Pp 524A 525.Ql | 526cannot appear first or last in a (sub)expression, or after another 527.Ql | , 528i.e., an operand of 529.Ql | 530cannot be an empty subexpression. 531An empty parenthesized subexpression, 532.Ql \&(\&) , 533is legal and matches an 534empty (sub)string. 535An empty string is not a legal RE. 536.Pp 537A 538.Ql { 539followed by a digit is considered the beginning of bounds for a 540bounded repetition, which must then follow the syntax for bounds. 541A 542.Ql { 543.Em not 544followed by a digit is considered an ordinary character. 545.Pp 546.Ql ^ 547and 548.Ql $ 549beginning and ending subexpressions in obsolete 550.Pq Dq basic 551REs are anchors, not ordinary characters. 552.Sh SEE ALSO 553.Xr grep 1 , 554.Xr re_format 7 555.Pp 556POSIX 1003.2, sections 2.8 (Regular Expression Notation) 557and 558B.5 (C Binding for Regular Expression Matching). 559.Sh DIAGNOSTICS 560Non-zero error codes from 561.Fn regcomp 562and 563.Fn regexec 564include the following: 565.Pp 566.Bl -tag -compact -width XREG_ECOLLATEX 567.It Er REG_NOMATCH 568regexec() failed to match 569.It Er REG_BADPAT 570invalid regular expression 571.It Er REG_ECOLLATE 572invalid collating element 573.It Er REG_ECTYPE 574invalid character class 575.It Er REG_EESCAPE 576\e applied to unescapable character 577.It Er REG_ESUBREG 578invalid backreference number 579.It Er REG_EBRACK 580brackets [ ] not balanced 581.It Er REG_EPAREN 582parentheses ( ) not balanced 583.It Er REG_EBRACE 584braces { } not balanced 585.It Er REG_BADBR 586invalid repetition count(s) in { } 587.It Er REG_ERANGE 588invalid character range in [ ] 589.It Er REG_ESPACE 590ran out of memory 591.It Er REG_BADRPT 592?, *, or + operand invalid 593.It Er REG_EMPTY 594empty (sub)expression 595.It Er REG_ASSERT 596.Dq can't happen 597\(emyou found a bug 598.It Er REG_INVARG 599invalid argument, e.g., negative-length string 600.El 601.Sh HISTORY 602Originally written by Henry Spencer. 603Altered for inclusion in the 604.Bx 4.4 605distribution. 606.Sh BUGS 607This is an alpha release with known defects. 608Please report problems. 609.Pp 610There is one known functionality bug. 611The implementation of internationalization is incomplete: 612the locale is always assumed to be the default one of 1003.2, 613and only the collating elements etc. of that locale are available. 614.Pp 615The back-reference code is subtle and doubts linger about its correctness 616in complex cases. 617.Pp 618.Fn regexec 619performance is poor. 620This will improve with later releases. 621.Fa nmatch 622exceeding 0 is expensive; 623.Fa nmatch 624exceeding 1 is worse. 625.Fn regexec 626is largely insensitive to RE complexity 627.Em except 628that back references are massively expensive. 629RE length does matter; in particular, there is a strong speed bonus 630for keeping RE length under about 30 characters, 631with most special characters counting roughly double. 632.Pp 633.Fn regcomp 634implements bounded repetitions by macro expansion, 635which is costly in time and space if counts are large 636or bounded repetitions are nested. 637A RE like, say, 638.Dq ((((a{1,100}){1,100}){1,100}){1,100}){1,100} 639will (eventually) run almost any existing machine out of swap space. 640.Pp 641There are suspected problems with response to obscure error conditions. 642Notably, 643certain kinds of internal overflow, 644produced only by truly enormous REs or by multiply nested bounded repetitions, 645are probably not handled well. 646.Pp 647Due to a mistake in 1003.2, things like 648.Ql a)b 649are legal REs because 650.Ql \&) 651is 652a special character only in the presence of a previous unmatched 653.Ql \&( . 654This can't be fixed until the spec is fixed. 655.Pp 656The standard's definition of back references is vague. 657For example, does 658.Dq a\e(\e(b\e)*\e2\e)*d 659match 660.Dq abbbd ? 661Until the standard is clarified, 662behavior in such cases should not be relied on. 663.Pp 664The implementation of word-boundary matching is a bit of a kludge, 665and bugs may lurk in combinations of word-boundary matching and anchoring. 666