1.\" $OpenBSD: regex.3,v 1.16 2002/10/16 21:32:06 wcobb Exp $ 2.\" 3.\" Copyright (c) 1997, Phillip F Knaack. All rights reserved. 4.\" 5.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 6.\" Copyright (c) 1992, 1993, 1994 7.\" The Regents of the University of California. All rights reserved. 8.\" 9.\" This code is derived from software contributed to Berkeley by 10.\" Henry Spencer. 11.\" 12.\" Redistribution and use in source and binary forms, with or without 13.\" modification, are permitted provided that the following conditions 14.\" are met: 15.\" 1. Redistributions of source code must retain the above copyright 16.\" notice, this list of conditions and the following disclaimer. 17.\" 2. Redistributions in binary form must reproduce the above copyright 18.\" notice, this list of conditions and the following disclaimer in the 19.\" documentation and/or other materials provided with the distribution. 20.\" 3. All advertising materials mentioning features or use of this software 21.\" must display the following acknowledgement: 22.\" This product includes software developed by the University of 23.\" California, Berkeley and its contributors. 24.\" 4. Neither the name of the University nor the names of its contributors 25.\" may be used to endorse or promote products derived from this software 26.\" without specific prior written permission. 27.\" 28.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 29.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 30.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 31.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 32.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 33.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 34.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 35.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 36.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 37.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 38.\" SUCH DAMAGE. 39.\" 40.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 41.\" 42.Dd March 20, 1994 43.Dt REGEX 3 44.Os 45.Sh NAME 46.Nm regcomp , 47.Nm regexec , 48.Nm regerror , 49.Nm regfree 50.Nd regular-expression library 51.Sh SYNOPSIS 52.Fd #include <sys/types.h> 53.Fd #include <regex.h> 54.Ft int 55.Fn regcomp "regex_t *preg" "const char *pattern" "int cflags" 56.Pp 57.Ft int 58.Fn regexec "const regex_t *preg" "const char *string" "size_t nmatch" \ 59 "regmatch_t pmatch[]" "int eflags" 60.Pp 61.Ft size_t 62.Fn regerror "int errcode" "const regex_t *preg" "char *errbuf" \ 63 "size_t errbuf_size" 64.Pp 65.Ft void 66.Fn regfree "regex_t *preg" 67.Sh DESCRIPTION 68These routines implement 69.St -p1003.2 70regular expressions 71.Pq Dq REs ; 72see 73.Xr re_format 7 . 74.Fn regcomp 75compiles an RE written as a string into an internal form, 76.Fn regexec 77matches that internal form against a string and reports results, 78.Fn regerror 79transforms error codes from either into human-readable messages, and 80.Fn regfree 81frees any dynamically allocated storage used by the internal form 82of an RE. 83.Pp 84The header 85.Aq Pa regex.h 86declares two structure types, 87.Li regex_t 88and 89.Li regmatch_t , 90the former for compiled internal forms and the latter for match reporting. 91It also declares the four functions, 92a type 93.Li regoff_t , 94and a number of constants with names starting with 95.Dv REG_ . 96.Pp 97.Fn regcomp 98compiles the regular expression contained in the 99.Fa pattern 100string, 101subject to the flags in 102.Fa cflags , 103and places the results in the 104.Li regex_t 105structure pointed to by 106.Fa preg . 107.Fa cflags 108is the bitwise 109.Tn OR 110of zero or more of the following flags: 111.Bl -tag -width XREG_EXTENDEDX 112.It Dv REG_EXTENDED 113Compile modern 114.Pq Dq extended 115REs, 116rather than the obsolete 117.Pq Dq basic 118REs that are the default. 119.It Dv REG_BASIC 120This is a synonym for 0, 121provided as a counterpart to 122.Dv REG_EXTENDED 123to improve readability. 124.It Dv REG_NOSPEC 125Compile with recognition of all special characters turned off. 126All characters are thus considered ordinary, 127so the RE is a literal string. 128This is an extension, 129compatible with but not specified by 130.St -p1003.2 , 131and should be used with 132caution in software intended to be portable to other systems. 133.Dv REG_EXTENDED 134and 135.Dv REG_NOSPEC 136may not be used in the same call to 137.Fn regcomp . 138.It Dv REG_ICASE 139Compile for matching that ignores upper/lower case distinctions. 140See 141.Xr re_format 7 . 142.It Dv REG_NOSUB 143Compile for matching that need only report success or failure, 144not what was matched. 145.It Dv REG_NEWLINE 146Compile for newline-sensitive matching. 147By default, newline is a completely ordinary character with no special 148meaning in either REs or strings. 149With this flag, 150.Ql \&[^ 151bracket expressions and 152.Ql \&. 153never match newline, 154a 155.Ql ^ 156anchor matches the null string after any newline in the string 157in addition to its normal function, 158and the 159.Ql $ 160anchor matches the null string before any newline in the 161string in addition to its normal function. 162.It Dv REG_PEND 163The regular expression ends, 164not at the first NUL, 165but just before the character pointed to by the 166.Fa re_endp 167member of the structure pointed to by 168.Fa preg . 169The 170.Fa re_endp 171member is of type 172.Fa const\ char\ * . 173This flag permits inclusion of NULs in the RE; 174they are considered ordinary characters. 175This is an extension, 176compatible with but not specified by 177.St -p1003.2 , 178and should be used with 179caution in software intended to be portable to other systems. 180.El 181.Pp 182When successful, 183.Fn regcomp 184returns 0 and fills in the structure pointed to by 185.Fa preg . 186One member of that structure 187(other than 188.Fa re_endp ) 189is publicized: 190.Fa re_nsub , 191of type 192.Fa size_t , 193contains the number of parenthesized subexpressions within the RE 194(except that the value of this member is undefined if the 195.Dv REG_NOSUB 196flag was used). 197If 198.Fn regcomp 199fails, it returns a non-zero error code; 200see DIAGNOSTICS. 201.Pp 202.Fn regexec 203matches the compiled RE pointed to by 204.Fa preg 205against the 206.Fa string , 207subject to the flags in 208.Fa eflags , 209and reports results using 210.Fa nmatch , 211.Fa pmatch , 212and the returned value. 213The RE must have been compiled by a previous invocation of 214.Fn regcomp . 215The compiled form is not altered during execution of 216.Fn regexec , 217so a single compiled RE can be used simultaneously by multiple threads. 218.Pp 219By default, 220the null-terminated string pointed to by 221.Fa string 222is considered to be the text of an entire line, minus any terminating 223newline. 224The 225.Fa eflags 226argument is the bitwise 227.Tn OR 228of zero or more of the following flags: 229.Bl -tag -width XREG_STARTENDX 230.It Dv REG_NOTBOL 231The first character of 232the string 233is not the beginning of a line, so the 234.Ql ^ 235anchor should not match before it. 236This does not affect the behavior of newlines under 237.Dv REG_NEWLINE . 238.It Dv REG_NOTEOL 239The NUL terminating 240the string 241does not end a line, so the 242.Ql $ 243anchor should not match before it. 244This does not affect the behavior of newlines under 245.Dv REG_NEWLINE . 246.It Dv REG_STARTEND 247The string is considered to start at 248\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR 249and to have a terminating NUL located at 250\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR 251(there need not actually be a NUL at that location), 252regardless of the value of 253.Fa nmatch . 254See below for the definition of 255.Fa pmatch 256and 257.Fa nmatch . 258This is an extension, 259compatible with but not specified by 260.St -p1003.2 , 261and should be used with 262caution in software intended to be portable to other systems. 263Note that a non-zero \fIrm_so\fR does not imply 264.Dv REG_NOTBOL ; 265.Dv REG_STARTEND 266affects only the location of the string, 267not how it is matched. 268.El 269.Pp 270See 271.Xr re_format 7 272for a discussion of what is matched in situations where an RE or a 273portion thereof could match any of several substrings of 274.Fa string . 275.Pp 276Normally, 277.Fn regexec 278returns 0 for success and the non-zero code 279.Dv REG_NOMATCH 280for failure. 281Other non-zero error codes may be returned in exceptional situations; 282see DIAGNOSTICS. 283.Pp 284If 285.Dv REG_NOSUB 286was specified in the compilation of the RE, 287or if 288.Fa nmatch 289is 0, 290.Fn regexec 291ignores the 292.Fa pmatch 293argument (but see below for the case where 294.Dv REG_STARTEND 295is specified). 296Otherwise, 297.Fa pmatch 298points to an array of 299.Fa nmatch 300structures of type 301.Li regmatch_t . 302Such a structure has at least the members 303.Fa rm_so 304and 305.Fa rm_eo , 306both of type 307.Fa regoff_t 308(a signed arithmetic type at least as large as an 309.Li off_t 310and a 311.Li ssize_t ) , 312containing respectively the offset of the first character of a substring 313and the offset of the first character after the end of the substring. 314Offsets are measured from the beginning of the 315.Fa string 316argument given to 317.Fn regexec . 318An empty substring is denoted by equal offsets, 319both indicating the character following the empty substring. 320.Pp 321The 0th member of the 322.Fa pmatch 323array is filled in to indicate what substring of 324.Fa string 325was matched by the entire RE. 326Remaining members report what substring was matched by parenthesized 327subexpressions within the RE; 328member 329.Va i 330reports subexpression 331.Va i , 332with subexpressions counted (starting at 1) by the order of their opening 333parentheses in the RE, left to right. 334Unused entries in the array\(emcorresponding either to subexpressions that 335did not participate in the match at all, or to subexpressions that do not 336exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both 337.Fa rm_so 338and 339.Fa rm_eo 340set to \-1. 341If a subexpression participated in the match several times, 342the reported substring is the last one it matched. 343(Note, as an example in particular, that when the RE 344.Dq (b*)+ 345matches 346.Dq bbb , 347the parenthesized subexpression matches each of the three 348.Sq b Ns s 349and then 350an infinite number of empty strings following the last 351.Sq b , 352so the reported substring is one of the empties.) 353.Pp 354If 355.Dv REG_STARTEND 356is specified, 357.Fa pmatch 358must point to at least one 359.Li regmatch_t 360(even if 361.Fa nmatch 362is 0 or 363.Dv REG_NOSUB 364was specified), 365to hold the input offsets for 366.Dv REG_STARTEND . 367Use for output is still entirely controlled by 368.Fa nmatch ; 369if 370.Fa nmatch 371is 0 or 372.Dv REG_NOSUB 373was specified, 374the value of 375.Fa pmatch[0] 376will not be changed by a successful 377.Fn regexec . 378.Pp 379.Fn regerror 380maps a non-zero 381.Va errcode 382from either 383.Fn regcomp 384or 385.Fn regexec 386to a human-readable, printable message. 387If 388.Fa preg 389is non-NULL, 390the error code should have arisen from use of 391the 392.Li regex_t 393pointed to by 394.Fa preg , 395and if the error code came from 396.Fn regcomp , 397it should have been the result from the most recent 398.Fn regcomp 399using that 400.Li regex_t . 401.Pf ( Fn regerror 402may be able to supply a more detailed message using information 403from the 404.Li regex_t . ) 405.Fn regerror 406places the null-terminated message into the buffer pointed to by 407.Fa errbuf , 408limiting the length (including the NUL) to at most 409.Fa errbuf_size 410bytes. 411If the whole message won't fit, 412as much of it as will fit before the terminating NUL is supplied. 413In any case, 414the returned value is the size of buffer needed to hold the whole 415message (including the terminating NUL). 416If 417.Fa errbuf_size 418is 0, 419.Fa errbuf 420is ignored but the return value is still correct. 421.Pp 422If the 423.Fa errcode 424given to 425.Fn regerror 426is first 427.Tn OR Ns 'ed 428with 429.Dv REG_ITOA , 430the 431.Dq message 432that results is the printable name of the error code, 433e.g., 434.Dq REG_NOMATCH , 435rather than an explanation thereof. 436If 437.Fa errcode 438is 439.Dv REG_ATOI , 440then 441.Fa preg 442shall be non-null and the 443.Fa re_endp 444member of the structure it points to 445must point to the printable name of an error code; 446in this case, the result in 447.Fa errbuf 448is the decimal digits of 449the numeric value of the error code 450(0 if the name is not recognized). 451.Dv REG_ITOA 452and 453.Dv REG_ATOI 454are intended primarily as debugging facilities; 455they are extensions, 456compatible with but not specified by 457.St -p1003.2 458and should be used with 459caution in software intended to be portable to other systems. 460Be warned also that they are considered experimental and changes are possible. 461.Pp 462.Fn regfree 463frees any dynamically allocated storage associated with the compiled RE 464pointed to by 465.Fa preg . 466The remaining 467.Li regex_t 468is no longer a valid compiled RE 469and the effect of supplying it to 470.Fn regexec 471or 472.Fn regerror 473is undefined. 474.Pp 475None of these functions references global variables except for tables 476of constants; 477all are safe for use from multiple threads if the arguments are safe. 478.Sh IMPLEMENTATION CHOICES 479There are a number of decisions that 480.St -p1003.2 481leaves up to the implementor, 482either by explicitly saying 483.Dq undefined 484or by virtue of them being 485forbidden by the RE grammar. 486This implementation treats them as follows. 487.Pp 488See 489.Xr re_format 7 490for a discussion of the definition of case-independent matching. 491.Pp 492There is no particular limit on the length of REs, 493except insofar as memory is limited. 494Memory usage is approximately linear in RE size, and largely insensitive 495to RE complexity, except for bounded repetitions. 496See 497.Sx BUGS 498for one short RE using them 499that will run almost any system out of memory. 500.Pp 501A backslashed character other than one specifically given a magic meaning 502by 503.St -p1003.2 504(such magic meanings occur only in obsolete REs) 505is taken as an ordinary character. 506.Pp 507Any unmatched 508.Ql \&[ 509is a 510.Dv REG_EBRACK 511error. 512.Pp 513Equivalence classes cannot begin or end bracket-expression ranges. 514The endpoint of one range cannot begin another. 515.Pp 516RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255. 517.Pp 518A repetition operator (?, *, +, or bounds) cannot follow another 519repetition operator. 520A repetition operator cannot begin an expression or subexpression 521or follow 522.Ql ^ 523or 524.Ql | . 525.Pp 526A 527.Ql | 528cannot appear first or last in a (sub)expression, or after another 529.Ql | , 530i.e., an operand of 531.Ql | 532cannot be an empty subexpression. 533An empty parenthesized subexpression, 534.Ql \&(\&) , 535is legal and matches an 536empty (sub)string. 537An empty string is not a legal RE. 538.Pp 539A 540.Ql { 541followed by a digit is considered the beginning of bounds for a 542bounded repetition, which must then follow the syntax for bounds. 543A 544.Ql { 545.Em not 546followed by a digit is considered an ordinary character. 547.Pp 548.Ql ^ 549and 550.Ql $ 551beginning and ending subexpressions in obsolete 552.Pq Dq basic 553REs are anchors, not ordinary characters. 554.Sh SEE ALSO 555.Xr grep 1 , 556.Xr re_format 7 557.Pp 558.St -p1003.2 , 559sections 2.8 (Regular Expression Notation) 560and 561B.5 (C Binding for Regular Expression Matching). 562.Sh DIAGNOSTICS 563Non-zero error codes from 564.Fn regcomp 565and 566.Fn regexec 567include the following: 568.Pp 569.Bl -tag -compact -width XREG_ECOLLATEX 570.It Er REG_NOMATCH 571regexec() failed to match 572.It Er REG_BADPAT 573invalid regular expression 574.It Er REG_ECOLLATE 575invalid collating element 576.It Er REG_ECTYPE 577invalid character class 578.It Er REG_EESCAPE 579\e applied to unescapable character 580.It Er REG_ESUBREG 581invalid backreference number 582.It Er REG_EBRACK 583brackets [ ] not balanced 584.It Er REG_EPAREN 585parentheses ( ) not balanced 586.It Er REG_EBRACE 587braces { } not balanced 588.It Er REG_BADBR 589invalid repetition count(s) in { } 590.It Er REG_ERANGE 591invalid character range in [ ] 592.It Er REG_ESPACE 593ran out of memory 594.It Er REG_BADRPT 595?, *, or + operand invalid 596.It Er REG_EMPTY 597empty (sub)expression 598.It Er REG_ASSERT 599.Dq can't happen 600\(emyou found a bug 601.It Er REG_INVARG 602invalid argument, e.g., negative-length string 603.El 604.Sh HISTORY 605Originally written by Henry Spencer. 606Altered for inclusion in the 607.Bx 4.4 608distribution. 609.Sh BUGS 610This is an alpha release with known defects. 611Please report problems. 612.Pp 613There is one known functionality bug. 614The implementation of internationalization is incomplete: 615the locale is always assumed to be the default one of 616.St -p1003.2 , 617and only the collating elements etc. of that locale are available. 618.Pp 619The back-reference code is subtle and doubts linger about its correctness 620in complex cases. 621.Pp 622.Fn regexec 623performance is poor. 624This will improve with later releases. 625.Fa nmatch 626exceeding 0 is expensive; 627.Fa nmatch 628exceeding 1 is worse. 629.Fn regexec 630is largely insensitive to RE complexity 631.Em except 632that back references are massively expensive. 633RE length does matter; in particular, there is a strong speed bonus 634for keeping RE length under about 30 characters, 635with most special characters counting roughly double. 636.Pp 637.Fn regcomp 638implements bounded repetitions by macro expansion, 639which is costly in time and space if counts are large 640or bounded repetitions are nested. 641A RE like, say, 642.Dq ((((a{1,100}){1,100}){1,100}){1,100}){1,100} 643will (eventually) run almost any existing machine out of swap space. 644.Pp 645There are suspected problems with response to obscure error conditions. 646Notably, 647certain kinds of internal overflow, 648produced only by truly enormous REs or by multiply nested bounded repetitions, 649are probably not handled well. 650.Pp 651Due to a mistake in 652.St -p1003.2 , 653things like 654.Ql a)b 655are legal REs because 656.Ql \&) 657is 658a special character only in the presence of a previous unmatched 659.Ql \&( . 660This can't be fixed until the spec is fixed. 661.Pp 662The standard's definition of back references is vague. 663For example, does 664.Dq a\e(\e(b\e)*\e2\e)*d 665match 666.Dq abbbd ? 667Until the standard is clarified, 668behavior in such cases should not be relied on. 669.Pp 670The implementation of word-boundary matching is a bit of a kludge, 671and bugs may lurk in combinations of word-boundary matching and anchoring. 672