1.\" $OpenBSD: regex.3,v 1.26 2015/11/10 23:48:18 jmc Exp $ 2.\" 3.\" Copyright (c) 1997, Phillip F Knaack. All rights reserved. 4.\" 5.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 6.\" Copyright (c) 1992, 1993, 1994 7.\" The Regents of the University of California. All rights reserved. 8.\" 9.\" This code is derived from software contributed to Berkeley by 10.\" Henry Spencer. 11.\" 12.\" Redistribution and use in source and binary forms, with or without 13.\" modification, are permitted provided that the following conditions 14.\" are met: 15.\" 1. Redistributions of source code must retain the above copyright 16.\" notice, this list of conditions and the following disclaimer. 17.\" 2. Redistributions in binary form must reproduce the above copyright 18.\" notice, this list of conditions and the following disclaimer in the 19.\" documentation and/or other materials provided with the distribution. 20.\" 3. Neither the name of the University nor the names of its contributors 21.\" may be used to endorse or promote products derived from this software 22.\" without specific prior written permission. 23.\" 24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 27.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 34.\" SUCH DAMAGE. 35.\" 36.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 37.\" 38.Dd $Mdocdate: November 10 2015 $ 39.Dt REGEXEC 3 40.Os 41.Sh NAME 42.Nm regcomp , 43.Nm regexec , 44.Nm regerror , 45.Nm regfree 46.Nd regular expression routines 47.Sh SYNOPSIS 48.In sys/types.h 49.In regex.h 50.Ft int 51.Fn regcomp "regex_t *preg" "const char *pattern" "int cflags" 52.Pp 53.Ft int 54.Fn regexec "const regex_t *preg" "const char *string" "size_t nmatch" \ 55 "regmatch_t pmatch[]" "int eflags" 56.Pp 57.Ft size_t 58.Fn regerror "int errcode" "const regex_t *preg" "char *errbuf" \ 59 "size_t errbuf_size" 60.Pp 61.Ft void 62.Fn regfree "regex_t *preg" 63.Sh DESCRIPTION 64These routines implement 65.St -p1003.2 66regular expressions 67.Pq Dq REs ; 68see 69.Xr re_format 7 . 70.Fn regcomp 71compiles an RE written as a string into an internal form, 72.Fn regexec 73matches that internal form against a string and reports results, 74.Fn regerror 75transforms error codes from either into human-readable messages, and 76.Fn regfree 77frees any dynamically allocated storage used by the internal form 78of an RE. 79.Pp 80The header 81.In regex.h 82declares two structure types, 83.Li regex_t 84and 85.Li regmatch_t , 86the former for compiled internal forms and the latter for match reporting. 87It also declares the four functions, 88a type 89.Li regoff_t , 90and a number of constants with names starting with 91.Dv REG_ . 92.Pp 93.Fn regcomp 94compiles the regular expression contained in the 95.Fa pattern 96string, 97subject to the flags in 98.Fa cflags , 99and places the results in the 100.Li regex_t 101structure pointed to by 102.Fa preg . 103The 104.Fa cflags 105argument is the bitwise 106.Tn OR 107of zero or more of the following values: 108.Bl -tag -width XREG_EXTENDEDX 109.It Dv REG_EXTENDED 110Compile modern 111.Pq Dq extended 112REs, 113rather than the obsolete 114.Pq Dq basic 115REs that are the default. 116.It Dv REG_BASIC 117This is a synonym for 0, 118provided as a counterpart to 119.Dv REG_EXTENDED 120to improve readability. 121.It Dv REG_NOSPEC 122Compile with recognition of all special characters turned off. 123All characters are thus considered ordinary, 124so the RE is a literal string. 125This is an extension, 126compatible with but not specified by 127.St -p1003.2 , 128and should be used with 129caution in software intended to be portable to other systems. 130.Dv REG_EXTENDED 131and 132.Dv REG_NOSPEC 133may not be used in the same call to 134.Fn regcomp . 135.It Dv REG_ICASE 136Compile for matching that ignores upper/lower case distinctions. 137See 138.Xr re_format 7 . 139.It Dv REG_NOSUB 140Compile for matching that need only report success or failure, 141not what was matched. 142.It Dv REG_NEWLINE 143Compile for newline-sensitive matching. 144By default, newline is a completely ordinary character with no special 145meaning in either REs or strings. 146With this flag, 147.Ql \&[^ 148bracket expressions and 149.Ql \&. 150never match newline, 151a 152.Ql ^ 153anchor matches the null string after any newline in the string 154in addition to its normal function, 155and the 156.Ql $ 157anchor matches the null string before any newline in the 158string in addition to its normal function. 159.It Dv REG_PEND 160The regular expression ends, 161not at the first NUL, 162but just before the character pointed to by the 163.Fa re_endp 164member of the structure pointed to by 165.Fa preg . 166The 167.Fa re_endp 168member is of type 169.Fa const\ char\ * . 170This flag permits inclusion of NULs in the RE; 171they are considered ordinary characters. 172This is an extension, 173compatible with but not specified by 174.St -p1003.2 , 175and should be used with 176caution in software intended to be portable to other systems. 177.El 178.Pp 179When successful, 180.Fn regcomp 181returns 0 and fills in the structure pointed to by 182.Fa preg . 183One member of that structure 184(other than 185.Fa re_endp ) 186is publicized: 187.Fa re_nsub , 188of type 189.Fa size_t , 190contains the number of parenthesized subexpressions within the RE 191(except that the value of this member is undefined if the 192.Dv REG_NOSUB 193flag was used). 194If 195.Fn regcomp 196fails, it returns a non-zero error code; 197see DIAGNOSTICS. 198.Pp 199.Fn regexec 200matches the compiled RE pointed to by 201.Fa preg 202against the 203.Fa string , 204subject to the flags in 205.Fa eflags , 206and reports results using 207.Fa nmatch , 208.Fa pmatch , 209and the returned value. 210The RE must have been compiled by a previous invocation of 211.Fn regcomp . 212The compiled form is not altered during execution of 213.Fn regexec , 214so a single compiled RE can be used simultaneously by multiple threads. 215.Pp 216By default, 217the NUL-terminated string pointed to by 218.Fa string 219is considered to be the text of an entire line, minus any terminating 220newline. 221The 222.Fa eflags 223argument is the bitwise 224.Tn OR 225of zero or more of the following values: 226.Bl -tag -width XREG_STARTENDX 227.It Dv REG_NOTBOL 228The first character of 229the string 230is not the beginning of a line, so the 231.Ql ^ 232anchor should not match before it. 233This does not affect the behavior of newlines under 234.Dv REG_NEWLINE . 235.It Dv REG_NOTEOL 236The NUL terminating 237the string 238does not end a line, so the 239.Ql $ 240anchor should not match before it. 241This does not affect the behavior of newlines under 242.Dv REG_NEWLINE . 243.It Dv REG_STARTEND 244The string is considered to start at 245\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR 246and to have a terminating NUL located at 247\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR 248(there need not actually be a NUL at that location), 249regardless of the value of 250.Fa nmatch . 251See below for the definition of 252.Fa pmatch 253and 254.Fa nmatch . 255This is an extension, 256compatible with but not specified by 257.St -p1003.2 , 258and should be used with 259caution in software intended to be portable to other systems. 260Note that a non-zero \fIrm_so\fR does not imply 261.Dv REG_NOTBOL ; 262.Dv REG_STARTEND 263affects only the location of the string, 264not how it is matched. 265.El 266.Pp 267See 268.Xr re_format 7 269for a discussion of what is matched in situations where an RE or a 270portion thereof could match any of several substrings of 271.Fa string . 272.Pp 273Normally, 274.Fn regexec 275returns 0 for success and the non-zero code 276.Dv REG_NOMATCH 277for failure. 278Other non-zero error codes may be returned in exceptional situations; 279see DIAGNOSTICS. 280.Pp 281If 282.Dv REG_NOSUB 283was specified in the compilation of the RE, 284or if 285.Fa nmatch 286is 0, 287.Fn regexec 288ignores the 289.Fa pmatch 290argument (but see below for the case where 291.Dv REG_STARTEND 292is specified). 293Otherwise, 294.Fa pmatch 295points to an array of 296.Fa nmatch 297structures of type 298.Li regmatch_t . 299Such a structure has at least the members 300.Fa rm_so 301and 302.Fa rm_eo , 303both of type 304.Fa regoff_t 305(a signed arithmetic type at least as large as an 306.Li off_t 307and a 308.Li ssize_t ) , 309containing respectively the offset of the first character of a substring 310and the offset of the first character after the end of the substring. 311Offsets are measured from the beginning of the 312.Fa string 313argument given to 314.Fn regexec . 315An empty substring is denoted by equal offsets, 316both indicating the character following the empty substring. 317.Pp 318The 0th member of the 319.Fa pmatch 320array is filled in to indicate what substring of 321.Fa string 322was matched by the entire RE. 323Remaining members report what substring was matched by parenthesized 324subexpressions within the RE; 325member 326.Va i 327reports subexpression 328.Va i , 329with subexpressions counted (starting at 1) by the order of their opening 330parentheses in the RE, left to right. 331Unused entries in the array\(emcorresponding either to subexpressions that 332did not participate in the match at all, or to subexpressions that do not 333exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both 334.Fa rm_so 335and 336.Fa rm_eo 337set to \-1. 338If a subexpression participated in the match several times, 339the reported substring is the last one it matched. 340(Note, as an example in particular, that when the RE 341.Dq (b*)+ 342matches 343.Dq bbb , 344the parenthesized subexpression matches each of the three 345.Sq b Ns s 346and then 347an infinite number of empty strings following the last 348.Sq b , 349so the reported substring is one of the empties.) 350.Pp 351If 352.Dv REG_STARTEND 353is specified, 354.Fa pmatch 355must point to at least one 356.Li regmatch_t 357(even if 358.Fa nmatch 359is 0 or 360.Dv REG_NOSUB 361was specified), 362to hold the input offsets for 363.Dv REG_STARTEND . 364Use for output is still entirely controlled by 365.Fa nmatch ; 366if 367.Fa nmatch 368is 0 or 369.Dv REG_NOSUB 370was specified, 371the value of 372.Fa pmatch[0] 373will not be changed by a successful 374.Fn regexec . 375.Pp 376.Fn regerror 377maps a non-zero 378.Va errcode 379from either 380.Fn regcomp 381or 382.Fn regexec 383to a human-readable, printable message. 384If 385.Fa preg 386is non-NULL, 387the error code should have arisen from use of 388the 389.Li regex_t 390pointed to by 391.Fa preg , 392and if the error code came from 393.Fn regcomp , 394it should have been the result from the most recent 395.Fn regcomp 396using that 397.Li regex_t . 398.Pf ( Fn regerror 399may be able to supply a more detailed message using information 400from the 401.Li regex_t . ) 402.Fn regerror 403places the NUL-terminated message into the buffer pointed to by 404.Fa errbuf , 405limiting the length (including the NUL) to at most 406.Fa errbuf_size 407bytes. 408If the whole message won't fit, 409as much of it as will fit before the terminating NUL is supplied. 410In any case, 411the returned value is the size of buffer needed to hold the whole 412message (including the terminating NUL). 413If 414.Fa errbuf_size 415is 0, 416.Fa errbuf 417is ignored but the return value is still correct. 418.Pp 419If the 420.Fa errcode 421given to 422.Fn regerror 423is first 424.Tn OR Ns 'ed 425with 426.Dv REG_ITOA , 427the 428.Dq message 429that results is the printable name of the error code, 430e.g., 431.Dq REG_NOMATCH , 432rather than an explanation thereof. 433If 434.Fa errcode 435is 436.Dv REG_ATOI , 437then 438.Fa preg 439shall be non-null and the 440.Fa re_endp 441member of the structure it points to 442must point to the printable name of an error code; 443in this case, the result in 444.Fa errbuf 445is the decimal digits of 446the numeric value of the error code 447(0 if the name is not recognized). 448.Dv REG_ITOA 449and 450.Dv REG_ATOI 451are intended primarily as debugging facilities; 452they are extensions, 453compatible with but not specified by 454.St -p1003.2 455and should be used with 456caution in software intended to be portable to other systems. 457Be warned also that they are considered experimental and changes are possible. 458.Pp 459.Fn regfree 460frees any dynamically allocated storage associated with the compiled RE 461pointed to by 462.Fa preg . 463The remaining 464.Li regex_t 465is no longer a valid compiled RE 466and the effect of supplying it to 467.Fn regexec 468or 469.Fn regerror 470is undefined. 471.Pp 472None of these functions references global variables except for tables 473of constants; 474all are safe for use from multiple threads if the arguments are safe. 475.Sh IMPLEMENTATION CHOICES 476There are a number of decisions that 477.St -p1003.2 478leaves up to the implementor, 479either by explicitly saying 480.Dq undefined 481or by virtue of them being 482forbidden by the RE grammar. 483This implementation treats them as follows. 484.Pp 485See 486.Xr re_format 7 487for a discussion of the definition of case-independent matching. 488.Pp 489There is no particular limit on the length of REs, 490except insofar as memory is limited. 491Memory usage is approximately linear in RE size, and largely insensitive 492to RE complexity, except for bounded repetitions. 493See 494.Sx BUGS 495for one short RE using them 496that will run almost any system out of memory. 497.Pp 498A backslashed character other than one specifically given a magic meaning 499by 500.St -p1003.2 501(such magic meanings occur only in obsolete REs) 502is taken as an ordinary character. 503.Pp 504Any unmatched 505.Ql \&[ 506is a 507.Dv REG_EBRACK 508error. 509.Pp 510Equivalence classes cannot begin or end bracket-expression ranges. 511The endpoint of one range cannot begin another. 512.Pp 513RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255. 514.Pp 515A repetition operator (?, *, +, or bounds) cannot follow another 516repetition operator. 517A repetition operator cannot begin an expression or subexpression 518or follow 519.Ql ^ 520or 521.Ql | . 522.Pp 523A 524.Ql | 525cannot appear first or last in a (sub)expression, or after another 526.Ql | , 527i.e., an operand of 528.Ql | 529cannot be an empty subexpression. 530An empty parenthesized subexpression, 531.Ql \&(\&) , 532is legal and matches an 533empty (sub)string. 534An empty string is not a legal RE. 535.Pp 536A 537.Ql { 538followed by a digit is considered the beginning of bounds for a 539bounded repetition, which must then follow the syntax for bounds. 540A 541.Ql { 542.Em not 543followed by a digit is considered an ordinary character. 544.Pp 545.Ql ^ 546and 547.Ql $ 548beginning and ending subexpressions in obsolete 549.Pq Dq basic 550REs are anchors, not ordinary characters. 551.Sh DIAGNOSTICS 552Non-zero error codes from 553.Fn regcomp 554and 555.Fn regexec 556include the following: 557.Pp 558.Bl -tag -compact -width XREG_ECOLLATEX 559.It Er REG_NOMATCH 560regexec() failed to match 561.It Er REG_BADPAT 562invalid regular expression 563.It Er REG_ECOLLATE 564invalid collating element 565.It Er REG_ECTYPE 566invalid character class 567.It Er REG_EESCAPE 568\e applied to unescapable character 569.It Er REG_ESUBREG 570invalid backreference number 571.It Er REG_EBRACK 572brackets [ ] not balanced 573.It Er REG_EPAREN 574parentheses ( ) not balanced 575.It Er REG_EBRACE 576braces { } not balanced 577.It Er REG_BADBR 578invalid repetition count(s) in { } 579.It Er REG_ERANGE 580invalid character range in [ ] 581.It Er REG_ESPACE 582ran out of memory 583.It Er REG_BADRPT 584?, *, or + operand invalid 585.It Er REG_EMPTY 586empty (sub)expression 587.It Er REG_ASSERT 588.Dq can't happen 589\(emyou found a bug 590.It Er REG_INVARG 591invalid argument, e.g., negative-length string 592.El 593.Sh SEE ALSO 594.Xr grep 1 , 595.Xr re_format 7 596.Pp 597.St -p1003.2 , 598sections 2.8 (Regular Expression Notation) 599and 600B.5 (C Binding for Regular Expression Matching). 601.Sh HISTORY 602Predecessors to 603.Fn regcomp 604and 605.Fn regexec , 606.Fn regcmp 607and 608.Fn regex , 609first appeared in the Programmer's Workbench (PWB/UNIX). 610.Pp 611The present functions were originally written by Henry Spencer 612and altered for inclusion in the 613.Bx 4.4 614distribution. 615.Sh BUGS 616This is an alpha release with known defects. 617Please report problems. 618.Pp 619There is one known functionality bug. 620The implementation of internationalization is incomplete: 621the locale is always assumed to be the default one of 622.St -p1003.2 , 623and only the collating elements etc. of that locale are available. 624.Pp 625The back-reference code is subtle and doubts linger about its correctness 626in complex cases. 627.Pp 628.Fn regexec 629performance is poor. 630This will improve with later releases. 631.Fa nmatch 632exceeding 0 is expensive; 633.Fa nmatch 634exceeding 1 is worse. 635.Fn regexec 636is largely insensitive to RE complexity 637.Em except 638that back references are massively expensive. 639RE length does matter; in particular, there is a strong speed bonus 640for keeping RE length under about 30 characters, 641with most special characters counting roughly double. 642.Pp 643.Fn regcomp 644implements bounded repetitions by macro expansion, 645which is costly in time and space if counts are large 646or bounded repetitions are nested. 647A RE like, say, 648.Dq ((((a{1,100}){1,100}){1,100}){1,100}){1,100} 649will (eventually) run almost any existing machine out of swap space. 650.Pp 651There are suspected problems with response to obscure error conditions. 652Notably, 653certain kinds of internal overflow, 654produced only by truly enormous REs or by multiply nested bounded repetitions, 655are probably not handled well. 656.Pp 657Due to a mistake in 658.St -p1003.2 , 659things like 660.Ql a)b 661are legal REs because 662.Ql \&) 663is 664a special character only in the presence of a previous unmatched 665.Ql \&( . 666This can't be fixed until the spec is fixed. 667.Pp 668The standard's definition of back references is vague. 669For example, does 670.Dq a\e(\e(b\e)*\e2\e)*d 671match 672.Dq abbbd ? 673Until the standard is clarified, 674behavior in such cases should not be relied on. 675.Pp 676The implementation of word-boundary matching is a bit of a kludge, 677and bugs may lurk in combinations of word-boundary matching and anchoring. 678