1.\" $OpenBSD: regex.3,v 1.11 1999/07/09 13:35:22 aaron Exp $ 2.\" 3.\" Copyright (c) 1997, Phillip F Knaack. All rights reserved. 4.\" 5.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 6.\" Copyright (c) 1992, 1993, 1994 7.\" The Regents of the University of California. All rights reserved. 8.\" 9.\" This code is derived from software contributed to Berkeley by 10.\" Henry Spencer. 11.\" 12.\" Redistribution and use in source and binary forms, with or without 13.\" modification, are permitted provided that the following conditions 14.\" are met: 15.\" 1. Redistributions of source code must retain the above copyright 16.\" notice, this list of conditions and the following disclaimer. 17.\" 2. Redistributions in binary form must reproduce the above copyright 18.\" notice, this list of conditions and the following disclaimer in the 19.\" documentation and/or other materials provided with the distribution. 20.\" 3. All advertising materials mentioning features or use of this software 21.\" must display the following acknowledgement: 22.\" This product includes software developed by the University of 23.\" California, Berkeley and its contributors. 24.\" 4. Neither the name of the University nor the names of its contributors 25.\" may be used to endorse or promote products derived from this software 26.\" without specific prior written permission. 27.\" 28.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 29.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 30.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 31.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 32.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 33.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 34.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 35.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 36.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 37.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 38.\" SUCH DAMAGE. 39.\" 40.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 41.\" 42.Dd March 20, 1994 43.Dt REGEX 3 44.Os 45.Sh NAME 46.Nm regcomp, regexec, regerror, regfree 47.Nd regular-expression library 48.Sh SYNOPSIS 49.Fd #include <sys/types.h> 50.Fd #include <regex.h> 51.Fo "int regcomp" 52.Fa "regex_t *preg" 53.Fa "const char *pattern" 54.Fa "int cflags" 55.Fc 56.Fo "int regexec" 57.Fa "const regex_t *preg" 58.Fa "const char *string" 59.Fa "size_t nmatch" 60.Fa "regmatch_t pmatch[]" 61.Fa "int eflags" 62.Fc 63.Fo "size_t regerror" 64.Fa "int errcode" 65.Fa "const regex_t *preg" 66.Fa "char *errbuf" 67.Fa "size_t errbuf_size" 68.Fc 69.Fo "void regfree" 70.Fa "regex_t *preg" 71.Fc 72.Sh DESCRIPTION 73These routines implement POSIX 1003.2 regular expressions 74.Pq Dq REs ; 75see 76.Xr re_format 7 . 77.Fn regcomp 78compiles an RE written as a string into an internal form, 79.Fn regexec 80matches that internal form against a string and reports results, 81.Fn regerror 82transforms error codes from either into human-readable messages, and 83.Fn regfree 84frees any dynamically allocated storage used by the internal form 85of an RE. 86.Pp 87The header 88.Aq Pa regex.h 89declares two structure types, 90.Li regex_t 91and 92.Li regmatch_t , 93the former for compiled internal forms and the latter for match reporting. 94It also declares the four functions, 95a type 96.Li regoff_t , 97and a number of constants with names starting with 98.Dv REG_ . 99.Pp 100.Fn regcomp 101compiles the regular expression contained in the 102.Fa pattern 103string, 104subject to the flags in 105.Fa cflags , 106and places the results in the 107.Li regex_t 108structure pointed to by 109.Fa preg . 110.Fa cflags 111is the bitwise 112.Tn OR 113of zero or more of the following flags: 114.Bl -tag -width XREG_EXTENDEDX 115.It Dv REG_EXTENDED 116Compile modern 117.Pq Dq extended 118REs, 119rather than the obsolete 120.Pq Dq basic 121REs that are the default. 122.It Dv REG_BASIC 123This is a synonym for 0, 124provided as a counterpart to 125.Dv REG_EXTENDED 126to improve readability. 127.It Dv REG_NOSPEC 128Compile with recognition of all special characters turned off. 129All characters are thus considered ordinary, 130so the RE is a literal string. 131This is an extension, 132compatible with but not specified by POSIX 1003.2, 133and should be used with 134caution in software intended to be portable to other systems. 135.Dv REG_EXTENDED 136and 137.Dv REG_NOSPEC 138may not be used in the same call to 139.Fn regcomp . 140.It Dv REG_ICASE 141Compile for matching that ignores upper/lower case distinctions. 142See 143.Xr re_format 7 . 144.It Dv REG_NOSUB 145Compile for matching that need only report success or failure, 146not what was matched. 147.It Dv REG_NEWLINE 148Compile for newline-sensitive matching. 149By default, newline is a completely ordinary character with no special 150meaning in either REs or strings. 151With this flag, 152.Ql \&[^ 153bracket expressions and 154.Ql \&. 155never match newline, 156a 157.Ql ^ 158anchor matches the null string after any newline in the string 159in addition to its normal function, 160and the 161.Ql $ 162anchor matches the null string before any newline in the 163string in addition to its normal function. 164.It Dv REG_PEND 165The regular expression ends, 166not at the first NUL, 167but just before the character pointed to by the 168.Fa re_endp 169member of the structure pointed to by 170.Fa preg . 171The 172.Fa re_endp 173member is of type 174.Fa const\ char\ * . 175This flag permits inclusion of NULs in the RE; 176they are considered ordinary characters. 177This is an extension, 178compatible with but not specified by POSIX 1003.2, 179and should be used with 180caution in software intended to be portable to other systems. 181.El 182.Pp 183When successful, 184.Fn regcomp 185returns 0 and fills in the structure pointed to by 186.Fa preg . 187One member of that structure 188(other than 189.Fa re_endp ) 190is publicized: 191.Fa re_nsub , 192of type 193.Fa size_t , 194contains the number of parenthesized subexpressions within the RE 195(except that the value of this member is undefined if the 196.Dv REG_NOSUB 197flag was used). 198If 199.Fn regcomp 200fails, it returns a non-zero error code; 201see DIAGNOSTICS. 202.Pp 203.Fn regexec 204matches the compiled RE pointed to by 205.Fa preg 206against the 207.Fa string , 208subject to the flags in 209.Fa eflags , 210and reports results using 211.Fa nmatch , 212.Fa pmatch , 213and the returned value. 214The RE must have been compiled by a previous invocation of 215.Fn regcomp . 216The compiled form is not altered during execution of 217.Fn regexec , 218so a single compiled RE can be used simultaneously by multiple threads. 219.Pp 220By default, 221the null-terminated string pointed to by 222.Fa string 223is considered to be the text of an entire line, minus any terminating 224newline. 225The 226.Fa eflags 227argument is the bitwise 228.Tn OR 229of zero or more of the following flags: 230.Bl -tag -width XREG_STARTENDX 231.It Dv REG_NOTBOL 232The first character of 233the string 234is not the beginning of a line, so the 235.Ql ^ 236anchor should not match before it. 237This does not affect the behavior of newlines under 238.Dv REG_NEWLINE . 239.It Dv REG_NOTEOL 240The NUL terminating 241the string 242does not end a line, so the 243.Ql $ 244anchor should not match before it. 245This does not affect the behavior of newlines under 246.Dv REG_NEWLINE . 247.It Dv REG_STARTEND 248The string is considered to start at 249\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR 250and to have a terminating NUL located at 251\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR 252(there need not actually be a NUL at that location), 253regardless of the value of 254.Fa nmatch . 255See below for the definition of 256.Fa pmatch 257and 258.Fa nmatch . 259This is an extension, 260compatible with but not specified by POSIX 1003.2, 261and should be used with 262caution in software intended to be portable to other systems. 263Note that a non-zero \fIrm_so\fR does not imply 264.Dv REG_NOTBOL ; 265.Dv REG_STARTEND 266affects only the location of the string, 267not how it is matched. 268.El 269.Pp 270See 271.Xr re_format 7 272for a discussion of what is matched in situations where an RE or a 273portion thereof could match any of several substrings of 274.Fa string . 275.Pp 276Normally, 277.Fn regexec 278returns 0 for success and the non-zero code 279.Dv REG_NOMATCH 280for failure. 281Other non-zero error codes may be returned in exceptional situations; 282see DIAGNOSTICS. 283.Pp 284If 285.Dv REG_NOSUB 286was specified in the compilation of the RE, 287or if 288.Fa nmatch 289is 0, 290.Fn regexec 291ignores the 292.Fa pmatch 293argument (but see below for the case where 294.Dv REG_STARTEND 295is specified). 296Otherwise, 297.Fa pmatch 298points to an array of 299.Fa nmatch 300structures of type 301.Li regmatch_t . 302Such a structure has at least the members 303.Fa rm_so 304and 305.Fa rm_eo , 306both of type 307.Fa regoff_t 308(a signed arithmetic type at least as large as an 309.Li off_t 310and a 311.Li ssize_t ), 312containing respectively the offset of the first character of a substring 313and the offset of the first character after the end of the substring. 314Offsets are measured from the beginning of the 315.Fa string 316argument given to 317.Fn regexec . 318An empty substring is denoted by equal offsets, 319both indicating the character following the empty substring. 320.Pp 321The 0th member of the 322.Fa pmatch 323array is filled in to indicate what substring of 324.Fa string 325was matched by the entire RE. 326Remaining members report what substring was matched by parenthesized 327subexpressions within the RE; 328member 329.Va i 330reports subexpression 331.Va i , 332with subexpressions counted (starting at 1) by the order of their opening 333parentheses in the RE, left to right. 334Unused entries in the array\(emcorresponding either to subexpressions that 335did not participate in the match at all, or to subexpressions that do not 336exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both 337.Fa rm_so 338and 339.Fa rm_eo 340set to \-1. 341If a subexpression participated in the match several times, 342the reported substring is the last one it matched. 343(Note, as an example in particular, that when the RE 344.Dq (b*)+ 345matches 346.Dq bbb , 347the parenthesized subexpression matches each of the three 348.Sq b Ns s 349and then 350an infinite number of empty strings following the last 351.Sq b , 352so the reported substring is one of the empties.) 353.Pp 354If 355.Dv REG_STARTEND 356is specified, 357.Fa pmatch 358must point to at least one 359.Li regmatch_t 360(even if 361.Fa nmatch 362is 0 or 363.Dv REG_NOSUB 364was specified), 365to hold the input offsets for 366.Dv REG_STARTEND . 367Use for output is still entirely controlled by 368.Fa nmatch ; 369if 370.Fa nmatch 371is 0 or 372.Dv REG_NOSUB 373was specified, 374the value of 375.Fa pmatch[0] 376will not be changed by a successful 377.Fn regexec . 378.Pp 379.Fn regerror 380maps a non-zero 381.Va errcode 382from either 383.Fn regcomp 384or 385.Fn regexec 386to a human-readable, printable message. 387If 388.Fa preg 389is non-NULL, 390the error code should have arisen from use of 391the 392.Li regex_t 393pointed to by 394.Fa preg , 395and if the error code came from 396.Fn regcomp , 397it should have been the result from the most recent 398.Fn regcomp 399using that 400.Li regex_t . ( 401.Fn regerror 402may be able to supply a more detailed message using information 403from the 404.Li regex_t . ) 405.Fn regerror 406places the null-terminated message into the buffer pointed to by 407.Fa errbuf , 408limiting the length (including the NUL) to at most 409.Fa errbuf_size 410bytes. 411If the whole message won't fit, 412as much of it as will fit before the terminating NUL is supplied. 413In any case, 414the returned value is the size of buffer needed to hold the whole 415message (including the terminating NUL). 416If 417.Fa errbuf_size 418is 0, 419.Fa errbuf 420is ignored but the return value is still correct. 421.Pp 422If the 423.Fa errcode 424given to 425.Fn regerror 426is first 427.Tn OR Ns 'ed 428with 429.Dv REG_ITOA , 430the 431.Dq message 432that results is the printable name of the error code, 433e.g., 434.Dq REG_NOMATCH , 435rather than an explanation thereof. 436If 437.Fa errcode 438is 439.Dv REG_ATOI , 440then 441.Fa preg 442shall be non-null and the 443.Fa re_endp 444member of the structure it points to 445must point to the printable name of an error code; 446in this case, the result in 447.Fa errbuf 448is the decimal digits of 449the numeric value of the error code 450(0 if the name is not recognized). 451.Dv REG_ITOA 452and 453.Dv REG_ATOI 454are intended primarily as debugging facilities; 455they are extensions, 456compatible with but not specified by POSIX 1003.2, 457and should be used with 458caution in software intended to be portable to other systems. 459Be warned also that they are considered experimental and changes are possible. 460.Pp 461.Fn regfree 462frees any dynamically allocated storage associated with the compiled RE 463pointed to by 464.Fa preg . 465The remaining 466.Li regex_t 467is no longer a valid compiled RE 468and the effect of supplying it to 469.Fn regexec 470or 471.Fn regerror 472is undefined. 473.Pp 474None of these functions references global variables except for tables 475of constants; 476all are safe for use from multiple threads if the arguments are safe. 477.Sh IMPLEMENTATION CHOICES 478There are a number of decisions that 1003.2 leaves up to the implementor, 479either by explicitly saying 480.Dq undefined 481or by virtue of them being 482forbidden by the RE grammar. 483This implementation treats them as follows. 484.Pp 485See 486.Xr re_format 7 487for a discussion of the definition of case-independent matching. 488.Pp 489There is no particular limit on the length of REs, 490except insofar as memory is limited. 491Memory usage is approximately linear in RE size, and largely insensitive 492to RE complexity, except for bounded repetitions. 493See 494.Sx BUGS 495for one short RE using them 496that will run almost any system out of memory. 497.Pp 498A backslashed character other than one specifically given a magic meaning 499by 1003.2 (such magic meanings occur only in obsolete REs) 500is taken as an ordinary character. 501.Pp 502Any unmatched 503.Ql \&[ 504is a 505.Dv REG_EBRACK 506error. 507.Pp 508Equivalence classes cannot begin or end bracket-expression ranges. 509The endpoint of one range cannot begin another. 510.Pp 511RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255. 512.Pp 513A repetition operator (?, *, +, or bounds) cannot follow another 514repetition operator. 515A repetition operator cannot begin an expression or subexpression 516or follow 517.Ql ^ 518or 519.Ql | . 520.Pp 521A 522.Ql | 523cannot appear first or last in a (sub)expression, or after another 524.Ql | , 525i.e., an operand of 526.Ql | 527cannot be an empty subexpression. 528An empty parenthesized subexpression, 529.Ql \&(\&) , 530is legal and matches an 531empty (sub)string. 532An empty string is not a legal RE. 533.Pp 534A 535.Ql { 536followed by a digit is considered the beginning of bounds for a 537bounded repetition, which must then follow the syntax for bounds. 538A 539.Ql { 540.Em not 541followed by a digit is considered an ordinary character. 542.Pp 543.Ql ^ 544and 545.Ql $ 546beginning and ending subexpressions in obsolete 547.Pq Dq basic 548REs are anchors, not ordinary characters. 549.Sh SEE ALSO 550.Xr grep 1 , 551.Xr re_format 7 552.Pp 553POSIX 1003.2, sections 2.8 (Regular Expression Notation) 554and 555B.5 (C Binding for Regular Expression Matching). 556.Sh DIAGNOSTICS 557Non-zero error codes from 558.Fn regcomp 559and 560.Fn regexec 561include the following: 562.Pp 563.Bl -tag -compact -width XREG_ECOLLATEX 564.It Er REG_NOMATCH 565regexec() failed to match 566.It Er REG_BADPAT 567invalid regular expression 568.It Er REG_ECOLLATE 569invalid collating element 570.It Er REG_ECTYPE 571invalid character class 572.It Er REG_EESCAPE 573\e applied to unescapable character 574.It Er REG_ESUBREG 575invalid backreference number 576.It Er REG_EBRACK 577brackets [ ] not balanced 578.It Er REG_EPAREN 579parentheses ( ) not balanced 580.It Er REG_EBRACE 581braces { } not balanced 582.It Er REG_BADBR 583invalid repetition count(s) in { } 584.It Er REG_ERANGE 585invalid character range in [ ] 586.It Er REG_ESPACE 587ran out of memory 588.It Er REG_BADRPT 589?, *, or + operand invalid 590.It Er REG_EMPTY 591empty (sub)expression 592.It Er REG_ASSERT 593.Dq can't happen 594\(emyou found a bug 595.It Er REG_INVARG 596invalid argument, e.g. negative-length string 597.El 598.Sh HISTORY 599Originally written by Henry Spencer. 600Altered for inclusion in the 601.Bx 4.4 602distribution. 603.Sh BUGS 604This is an alpha release with known defects. 605Please report problems. 606.Pp 607There is one known functionality bug. 608The implementation of internationalization is incomplete: 609the locale is always assumed to be the default one of 1003.2, 610and only the collating elements etc. of that locale are available. 611.Pp 612The back-reference code is subtle and doubts linger about its correctness 613in complex cases. 614.Pp 615.Fn regexec 616performance is poor. 617This will improve with later releases. 618.Fa nmatch 619exceeding 0 is expensive; 620.Fa nmatch 621exceeding 1 is worse. 622.Fn regexec 623is largely insensitive to RE complexity 624.Em except 625that back references are massively expensive. 626RE length does matter; in particular, there is a strong speed bonus 627for keeping RE length under about 30 characters, 628with most special characters counting roughly double. 629.Pp 630.Fn regcomp 631implements bounded repetitions by macro expansion, 632which is costly in time and space if counts are large 633or bounded repetitions are nested. 634A RE like, say, 635.Dq ((((a{1,100}){1,100}){1,100}){1,100}){1,100} 636will (eventually) run almost any existing machine out of swap space. 637.Pp 638There are suspected problems with response to obscure error conditions. 639Notably, 640certain kinds of internal overflow, 641produced only by truly enormous REs or by multiply nested bounded repetitions, 642are probably not handled well. 643.Pp 644Due to a mistake in 1003.2, things like 645.Ql a)b 646are legal REs because 647.Ql \&) 648is 649a special character only in the presence of a previous unmatched 650.Ql \&( . 651This can't be fixed until the spec is fixed. 652.Pp 653The standard's definition of back references is vague. 654For example, does 655.Dq a\e(\e(b\e)*\e2\e)*d 656match 657.Dq abbbd ? 658Until the standard is clarified, 659behavior in such cases should not be relied on. 660.Pp 661The implementation of word-boundary matching is a bit of a kludge, 662and bugs may lurk in combinations of word-boundary matching and anchoring. 663