1.\" $OpenBSD: regex.3,v 1.30 2022/09/11 06:38:10 jmc Exp $ 2.\" 3.\" Copyright (c) 1997, Phillip F Knaack. All rights reserved. 4.\" 5.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 6.\" Copyright (c) 1992, 1993, 1994 7.\" The Regents of the University of California. All rights reserved. 8.\" 9.\" This code is derived from software contributed to Berkeley by 10.\" Henry Spencer. 11.\" 12.\" Redistribution and use in source and binary forms, with or without 13.\" modification, are permitted provided that the following conditions 14.\" are met: 15.\" 1. Redistributions of source code must retain the above copyright 16.\" notice, this list of conditions and the following disclaimer. 17.\" 2. Redistributions in binary form must reproduce the above copyright 18.\" notice, this list of conditions and the following disclaimer in the 19.\" documentation and/or other materials provided with the distribution. 20.\" 3. Neither the name of the University nor the names of its contributors 21.\" may be used to endorse or promote products derived from this software 22.\" without specific prior written permission. 23.\" 24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 27.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 34.\" SUCH DAMAGE. 35.\" 36.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 37.\" 38.Dd $Mdocdate: September 11 2022 $ 39.Dt REGEXEC 3 40.Os 41.Sh NAME 42.Nm regcomp , 43.Nm regexec , 44.Nm regerror , 45.Nm regfree 46.Nd regular expression routines 47.Sh SYNOPSIS 48.In sys/types.h 49.In regex.h 50.Ft int 51.Fn regcomp "regex_t *preg" "const char *pattern" "int cflags" 52.Pp 53.Ft int 54.Fn regexec "const regex_t *preg" "const char *string" "size_t nmatch" \ 55 "regmatch_t pmatch[]" "int eflags" 56.Pp 57.Ft size_t 58.Fn regerror "int errcode" "const regex_t *preg" "char *errbuf" \ 59 "size_t errbuf_size" 60.Pp 61.Ft void 62.Fn regfree "regex_t *preg" 63.Sh DESCRIPTION 64These routines implement 65.St -p1003.2 66regular expressions 67.Pq Dq REs ; 68see 69.Xr re_format 7 . 70.Fn regcomp 71compiles an RE written as a string into an internal form, 72.Fn regexec 73matches that internal form against a string and reports results, 74.Fn regerror 75transforms error codes from either into human-readable messages, and 76.Fn regfree 77frees any dynamically allocated storage used by the internal form 78of an RE. 79.Pp 80The header 81.In regex.h 82declares two structure types, 83.Vt regex_t 84and 85.Vt regmatch_t , 86the former for compiled internal forms and the latter for match reporting. 87It also declares the four functions, 88a type 89.Vt regoff_t , 90and a number of constants with names starting with 91.Dv REG_ . 92.Pp 93.Fn regcomp 94compiles the regular expression contained in the 95.Fa pattern 96string, 97subject to the flags in 98.Fa cflags , 99and places the results in the 100.Vt regex_t 101structure pointed to by 102.Fa preg . 103The 104.Fa cflags 105argument is the bitwise OR of zero or more of the following values: 106.Bl -tag -width XREG_EXTENDEDX 107.It Dv REG_EXTENDED 108Compile modern 109.Pq Dq extended 110REs, 111rather than the obsolete 112.Pq Dq basic 113REs that are the default. 114.It Dv REG_BASIC 115This is a synonym for 0, 116provided as a counterpart to 117.Dv REG_EXTENDED 118to improve readability. 119.It Dv REG_NOSPEC 120Compile with recognition of all special characters turned off. 121All characters are thus considered ordinary, 122so the RE is a literal string. 123This is an extension, 124compatible with but not specified by 125.St -p1003.2 , 126and should be used with 127caution in software intended to be portable to other systems. 128.Dv REG_EXTENDED 129and 130.Dv REG_NOSPEC 131may not be used in the same call to 132.Fn regcomp . 133.It Dv REG_ICASE 134Compile for matching that ignores upper/lower case distinctions. 135See 136.Xr re_format 7 . 137.It Dv REG_NOSUB 138Compile for matching that need only report success or failure, 139not what was matched. 140.It Dv REG_NEWLINE 141Compile for newline-sensitive matching. 142By default, newline is a completely ordinary character with no special 143meaning in either REs or strings. 144With this flag, 145.Ql \&[^ 146bracket expressions and 147.Ql \&. 148never match newline, 149a 150.Ql ^ 151anchor matches the null string after any newline in the string 152in addition to its normal function, 153and the 154.Ql $ 155anchor matches the null string before any newline in the 156string in addition to its normal function. 157.It Dv REG_PEND 158The regular expression ends, 159not at the first NUL, 160but just before the character pointed to by the 161.Fa re_endp 162member of the structure pointed to by 163.Fa preg . 164The 165.Fa re_endp 166member is of type 167.Fa const\ char\ * . 168This flag permits inclusion of NULs in the RE; 169they are considered ordinary characters. 170This is an extension, 171compatible with but not specified by 172.St -p1003.2 , 173and should be used with 174caution in software intended to be portable to other systems. 175.El 176.Pp 177When successful, 178.Fn regcomp 179returns 0 and fills in the structure pointed to by 180.Fa preg . 181One member of that structure 182(other than 183.Fa re_endp ) 184is publicized: 185.Fa re_nsub , 186of type 187.Fa size_t , 188contains the number of parenthesized subexpressions within the RE 189(except that the value of this member is undefined if the 190.Dv REG_NOSUB 191flag was used). 192If 193.Fn regcomp 194fails, it returns a non-zero error code; 195see DIAGNOSTICS. 196.Pp 197.Fn regexec 198matches the compiled RE pointed to by 199.Fa preg 200against the 201.Fa string , 202subject to the flags in 203.Fa eflags , 204and reports results using 205.Fa nmatch , 206.Fa pmatch , 207and the returned value. 208The RE must have been compiled by a previous invocation of 209.Fn regcomp . 210The compiled form is not altered during execution of 211.Fn regexec , 212so a single compiled RE can be used simultaneously by multiple threads. 213.Pp 214By default, 215the NUL-terminated string pointed to by 216.Fa string 217is considered to be the text of an entire line, minus any terminating 218newline. 219The 220.Fa eflags 221argument is the bitwise OR of zero or more of the following values: 222.Bl -tag -width XREG_STARTENDX 223.It Dv REG_NOTBOL 224The first character of the string is treated as the continuation 225of a line. 226This means that the anchors 227.Ql ^ , 228.Ql [[:<:]] , 229and 230.Ql \e< 231do not match before it; but see 232.Dv REG_STARTEND 233below. 234This does not affect the behavior of newlines under 235.Dv REG_NEWLINE . 236.It Dv REG_NOTEOL 237The NUL terminating 238the string 239does not end a line, so the 240.Ql $ 241anchor does not match before it. 242This does not affect the behavior of newlines under 243.Dv REG_NEWLINE . 244.It Dv REG_STARTEND 245The string is considered to start at 246.Fa string No + 247.Fa pmatch Ns [0]. Ns Fa rm_so 248and to end before the byte located at 249.Fa string No + 250.Fa pmatch Ns [0]. Ns Fa rm_eo , 251regardless of the value of 252.Fa nmatch . 253See below for the definition of 254.Fa pmatch 255and 256.Fa nmatch . 257This is an extension, 258compatible with but not specified by 259.St -p1003.2 , 260and should be used with 261caution in software intended to be portable to other systems. 262.Pp 263Without 264.Dv REG_NOTBOL , 265the position 266.Fa rm_so 267is considered the beginning of a line, such that 268.Ql ^ 269matches before it, and the beginning of a word if there is a word 270character at this position, such that 271.Ql [[:<:]] 272and 273.Ql \e< 274match before it. 275.Pp 276With 277.Dv REG_NOTBOL , 278the character at position 279.Fa rm_so 280is treated as the continuation of a line, and if 281.Fa rm_so 282is greater than 0, the preceding character is taken into consideration. 283If the preceding character is a newline and the regular expression was compiled 284with 285.Dv REG_NEWLINE , 286.Ql ^ 287matches before the string; if the preceding character is not a word character 288but the string starts with a word character, 289.Ql [[:<:]] 290and 291.Ql \e< 292match before the string. 293.El 294.Pp 295See 296.Xr re_format 7 297for a discussion of what is matched in situations where an RE or a 298portion thereof could match any of several substrings of 299.Fa string . 300.Pp 301Normally, 302.Fn regexec 303returns 0 for success and the non-zero code 304.Dv REG_NOMATCH 305for failure. 306Other non-zero error codes may be returned in exceptional situations; 307see DIAGNOSTICS. 308.Pp 309If 310.Dv REG_NOSUB 311was specified in the compilation of the RE, 312or if 313.Fa nmatch 314is 0, 315.Fn regexec 316ignores the 317.Fa pmatch 318argument (but see below for the case where 319.Dv REG_STARTEND 320is specified). 321Otherwise, 322.Fa pmatch 323points to an array of 324.Fa nmatch 325structures of type 326.Vt regmatch_t . 327Such a structure has at least the members 328.Fa rm_so 329and 330.Fa rm_eo , 331both of type 332.Fa regoff_t 333(a signed arithmetic type at least as large as an 334.Vt off_t 335and a 336.Vt ssize_t ) , 337containing respectively the offset of the first character of a substring 338and the offset of the first character after the end of the substring. 339Offsets are measured from the beginning of the 340.Fa string 341argument given to 342.Fn regexec . 343An empty substring is denoted by equal offsets, 344both indicating the character following the empty substring. 345.Pp 346The 0th member of the 347.Fa pmatch 348array is filled in to indicate what substring of 349.Fa string 350was matched by the entire RE. 351Remaining members report what substring was matched by parenthesized 352subexpressions within the RE; 353member 354.Va i 355reports subexpression 356.Va i , 357with subexpressions counted (starting at 1) by the order of their opening 358parentheses in the RE, left to right. 359Unused entries in the array\(emcorresponding either to subexpressions that 360did not participate in the match at all, or to subexpressions that do not 361exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both 362.Fa rm_so 363and 364.Fa rm_eo 365set to \-1. 366If a subexpression participated in the match several times, 367the reported substring is the last one it matched. 368(Note, as an example in particular, that when the RE 369.Dq (b*)+ 370matches 371.Dq bbb , 372the parenthesized subexpression matches each of the three 373.Sq b Ns s 374and then 375an infinite number of empty strings following the last 376.Sq b , 377so the reported substring is one of the empties.) 378.Pp 379If 380.Dv REG_STARTEND 381is specified, 382.Fa pmatch 383must point to at least one 384.Vt regmatch_t 385(even if 386.Fa nmatch 387is 0 or 388.Dv REG_NOSUB 389was specified), 390to hold the input offsets for 391.Dv REG_STARTEND . 392Use for output is still entirely controlled by 393.Fa nmatch ; 394if 395.Fa nmatch 396is 0 or 397.Dv REG_NOSUB 398was specified, 399the value of 400.Fa pmatch[0] 401will not be changed by a successful 402.Fn regexec . 403.Pp 404.Fn regerror 405maps a non-zero 406.Va errcode 407from either 408.Fn regcomp 409or 410.Fn regexec 411to a human-readable, printable message. 412If 413.Fa preg 414is non-NULL, 415the error code should have arisen from use of 416the 417.Vt regex_t 418pointed to by 419.Fa preg , 420and if the error code came from 421.Fn regcomp , 422it should have been the result from the most recent 423.Fn regcomp 424using that 425.Vt regex_t . 426.Pf ( Fn regerror 427may be able to supply a more detailed message using information 428from the 429.Vt regex_t . ) 430.Fn regerror 431places the NUL-terminated message into the buffer pointed to by 432.Fa errbuf , 433limiting the length (including the NUL) to at most 434.Fa errbuf_size 435bytes. 436If the whole message won't fit, 437as much of it as will fit before the terminating NUL is supplied. 438In any case, 439the returned value is the size of buffer needed to hold the whole 440message (including the terminating NUL). 441If 442.Fa errbuf_size 443is 0, 444.Fa errbuf 445is ignored but the return value is still correct. 446.Pp 447If the 448.Fa errcode 449given to 450.Fn regerror 451is first OR'ed with 452.Dv REG_ITOA , 453the 454.Dq message 455that results is the printable name of the error code, 456e.g., 457.Dq REG_NOMATCH , 458rather than an explanation thereof. 459If 460.Fa errcode 461is 462.Dv REG_ATOI , 463then 464.Fa preg 465shall be non-null and the 466.Fa re_endp 467member of the structure it points to 468must point to the printable name of an error code; 469in this case, the result in 470.Fa errbuf 471is the decimal digits of 472the numeric value of the error code 473(0 if the name is not recognized). 474.Dv REG_ITOA 475and 476.Dv REG_ATOI 477are intended primarily as debugging facilities; 478they are extensions, 479compatible with but not specified by 480.St -p1003.2 481and should be used with 482caution in software intended to be portable to other systems. 483Be warned also that they are considered experimental and changes are possible. 484.Pp 485.Fn regfree 486frees any dynamically allocated storage associated with the compiled RE 487pointed to by 488.Fa preg . 489The remaining 490.Vt regex_t 491is no longer a valid compiled RE 492and the effect of supplying it to 493.Fn regexec 494or 495.Fn regerror 496is undefined. 497.Pp 498None of these functions references global variables except for tables 499of constants; 500all are safe for use from multiple threads if the arguments are safe. 501.Sh IMPLEMENTATION CHOICES 502There are a number of decisions that 503.St -p1003.2 504leaves up to the implementor, 505either by explicitly saying 506.Dq undefined 507or by virtue of them being 508forbidden by the RE grammar. 509This implementation treats them as follows. 510.Pp 511See 512.Xr re_format 7 513for a discussion of the definition of case-independent matching. 514.Pp 515There is no particular limit on the length of REs, 516except insofar as memory is limited. 517Memory usage is approximately linear in RE size, and largely insensitive 518to RE complexity, except for bounded repetitions. 519See 520.Sx BUGS 521for one short RE using them 522that will run almost any system out of memory. 523.Pp 524A backslashed character other than one specifically given a magic meaning 525by 526.St -p1003.2 527(such magic meanings occur only in obsolete REs) 528is taken as an ordinary character. 529.Pp 530Any unmatched 531.Ql \&[ 532is a 533.Dv REG_EBRACK 534error. 535.Pp 536Equivalence classes cannot begin or end bracket-expression ranges. 537The endpoint of one range cannot begin another. 538.Pp 539RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255. 540.Pp 541A repetition operator (?, *, +, or bounds) cannot follow another 542repetition operator. 543A repetition operator cannot begin an expression or subexpression 544or follow 545.Ql ^ 546or 547.Ql | . 548.Pp 549A 550.Ql | 551cannot appear first or last in a (sub)expression, or after another 552.Ql | , 553i.e., an operand of 554.Ql | 555cannot be an empty subexpression. 556An empty parenthesized subexpression, 557.Ql \&(\&) , 558is legal and matches an 559empty (sub)string. 560An empty string is not a legal RE. 561.Pp 562A 563.Ql { 564followed by a digit is considered the beginning of bounds for a 565bounded repetition, which must then follow the syntax for bounds. 566A 567.Ql { 568.Em not 569followed by a digit is considered an ordinary character. 570.Pp 571.Ql ^ 572and 573.Ql $ 574beginning and ending subexpressions in obsolete 575.Pq Dq basic 576REs are anchors, not ordinary characters. 577.Sh DIAGNOSTICS 578Non-zero error codes from 579.Fn regcomp 580and 581.Fn regexec 582include the following: 583.Pp 584.Bl -tag -compact -width XREG_ECOLLATEX 585.It Er REG_NOMATCH 586.Fn regexec 587failed to match 588.It Er REG_BADPAT 589invalid regular expression 590.It Er REG_ECOLLATE 591invalid collating element 592.It Er REG_ECTYPE 593invalid character class 594.It Er REG_EESCAPE 595\e applied to unescapable character 596.It Er REG_ESUBREG 597invalid backreference number 598.It Er REG_EBRACK 599brackets [ ] not balanced 600.It Er REG_EPAREN 601parentheses ( ) not balanced 602.It Er REG_EBRACE 603braces { } not balanced 604.It Er REG_BADBR 605invalid repetition count(s) in { } 606.It Er REG_ERANGE 607invalid character range in [ ] 608.It Er REG_ESPACE 609ran out of memory 610.It Er REG_BADRPT 611?, *, or + operand invalid 612.It Er REG_EMPTY 613empty (sub)expression 614.It Er REG_ASSERT 615.Dq can't happen 616\(emyou found a bug 617.It Er REG_INVARG 618invalid argument, e.g., negative-length string 619.El 620.Sh SEE ALSO 621.Xr grep 1 , 622.Xr re_format 7 623.Pp 624.St -p1003.2 , 625sections 2.8 (Regular Expression Notation) 626and 627B.5 (C Binding for Regular Expression Matching). 628.Sh HISTORY 629Predecessors called 630.Fn regcmp 631and 632.Fn regex 633first appeared in PWB/UNIX 1.0. 634.Pp 635Predecessors 636.Fn re_comp 637and 638.Fn re_exec 639first appeared in 640.Bx 4.0 , 641became part of 642.In unistd.h 643in 644.Bx 4.4 , 645and were deleted after 646.Ox 5.4 . 647.Pp 648Functions called 649.Fn regcomp , 650.Fn regexec , 651.Fn regerror , 652and 653.Fn regsub 654first appeared in Version\~8 655.At , 656were reimplemented and declared in 657.In regexp.h 658for 659.Bx 4.3 Tahoe , 660and were also deleted after 661.Ox 5.4 . 662.Pp 663Taking different arguments, the POSIX 664.In regex.h 665functions 666.Fn regcomp , 667.Fn regexec , 668.Fn regerror , 669and 670.Fn regfree 671appeared in 672.Bx 4.4 . 673.Sh AUTHORS 674.An -nosplit 675The 676Version\~8 677.At 678code was implemented by 679.An Rob Pike 680and extracted into a library by 681.An Dave Presotto . 682The 683.Bx 4.3 Tahoe 684and 685.Bx 4.4 686versions were both written by 687.An Henry Spencer . 688.Sh BUGS 689The implementation of internationalization is incomplete: 690the locale is always assumed to be the default one of 691.St -p1003.2 , 692and only the collating elements etc. of that locale are available. 693.Pp 694The back-reference code is subtle and doubts linger about its correctness 695in complex cases. 696.Pp 697.Fn regexec 698performance is poor. 699This will improve with later releases. 700.Fa nmatch 701exceeding 0 is expensive; 702.Fa nmatch 703exceeding 1 is worse. 704.Fn regexec 705is largely insensitive to RE complexity 706.Em except 707that back references are massively expensive. 708RE length does matter; in particular, there is a strong speed bonus 709for keeping RE length under about 30 characters, 710with most special characters counting roughly double. 711.Pp 712.Fn regcomp 713implements bounded repetitions by macro expansion, 714which is costly in time and space if counts are large 715or bounded repetitions are nested. 716A RE like, say, 717.Dq ((((a{1,100}){1,100}){1,100}){1,100}){1,100} 718will (eventually) run almost any existing machine out of swap space. 719.Pp 720There are suspected problems with response to obscure error conditions. 721Notably, 722certain kinds of internal overflow, 723produced only by truly enormous REs or by multiply nested bounded repetitions, 724are probably not handled well. 725.Pp 726Due to a mistake in 727.St -p1003.2 , 728things like 729.Ql a)b 730are legal REs because 731.Ql \&) 732is 733a special character only in the presence of a previous unmatched 734.Ql \&( . 735This can't be fixed until the spec is fixed. 736.Pp 737The standard's definition of back references is vague. 738For example, does 739.Dq a\e(\e(b\e)*\e2\e)*d 740match 741.Dq abbbd ? 742Until the standard is clarified, 743behavior in such cases should not be relied on. 744.Pp 745The implementation of word-boundary matching is a bit of a kludge, 746and bugs may lurk in combinations of word-boundary matching and anchoring. 747