1.\" $NetBSD: regex.3,v 1.6 1998/02/05 07:28:03 perry Exp $ 2.\" 3.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 4.\" Copyright (c) 1992, 1993, 1994 5.\" The Regents of the University of California. All rights reserved. 6.\" 7.\" This code is derived from software contributed to Berkeley by 8.\" Henry Spencer. 9.\" 10.\" Redistribution and use in source and binary forms, with or without 11.\" modification, are permitted provided that the following conditions 12.\" are met: 13.\" 1. Redistributions of source code must retain the above copyright 14.\" notice, this list of conditions and the following disclaimer. 15.\" 2. Redistributions in binary form must reproduce the above copyright 16.\" notice, this list of conditions and the following disclaimer in the 17.\" documentation and/or other materials provided with the distribution. 18.\" 3. All advertising materials mentioning features or use of this software 19.\" must display the following acknowledgement: 20.\" This product includes software developed by the University of 21.\" California, Berkeley and its contributors. 22.\" 4. Neither the name of the University nor the names of its contributors 23.\" may be used to endorse or promote products derived from this software 24.\" without specific prior written permission. 25.\" 26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 29.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 36.\" SUCH DAMAGE. 37.\" 38.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 39.\" 40.Dd March 20, 1994 41.Dt REGEX 3 42.Os 43.Sh NAME 44.Nm regex , 45.Nm regcomp , 46.Nm regexec , 47.Nm regerror , 48.Nm regfree 49.Nd regular-expression library 50.Sh SYNOPSIS 51.Fd #include <sys/types.h> 52.Fd #include <regex.h> 53.Ft int 54.Fn regcomp "regex_t *preg" "const char *pattern" "int cflags" 55.Ft int 56.Fn regexec "const regex_t *preg" "const char *string" "size_t nmatch" "regmatch_t pmatch[]" "int eflags" 57.Ft size_t 58.Fn regerror "int errcode" "const regex_t *preg" "char *errbuf" "size_t errbuf_size" 59.Ft void 60.Fn regfree "regex_t *preg" 61.Sh DESCRIPTION 62These routines implement 63.St -p1003.2-92 64regular expressions (``RE''s); 65see 66.Xr re_format 7 . 67.Fn regcomp 68compiles an RE written as a string into an internal form, 69.Fn regexec 70matches that internal form against a string and reports results, 71.Fn regerror 72transforms error codes from either into human-readable messages, 73and 74.Fn regfree 75frees any dynamically-allocated storage used by the internal form 76of an RE. 77.Pp 78The header 79.Em <regex.h> 80declares two structure types, 81.Fa regex_t 82and 83.Fa regmatch_t , 84the former for compiled internal forms and the latter for match reporting. 85It also declares the four functions, 86a type 87.Fa regoff_t , 88and a number of constants with names starting with ``REG_''. 89.Pp 90.Fn Regcomp 91compiles the regular expression contained in the 92.Fa pattern 93string, 94subject to the flags in 95.Fa cflags , 96and places the results in the 97.Fa regex_t 98structure pointed to by 99.Fa preg . 100.Fa Cflags 101is the bitwise OR of zero or more of the following flags: 102.Bl -tag -width XXXREG_EXTENDED 103.It Dv REG_EXTENDED 104Compile modern (``extended'') REs, rather than the obsolete 105(``basic'') REs that are the default. 106.It Dv REG_BASIC 107This is a synonym for 0, 108provided as a counterpart to REG_EXTENDED to improve readability. 109.It Dv REG_NOSPEC 110Compile with recognition of all special characters turned off. All 111characters are thus considered ordinary, so the ``RE'' is a literal 112string. 113This is an extension, compatible with but not specified by 114.St -p1003.2-92 , 115and should be used with caution in software intended to be portable to 116other systems. 117.Dv REG_EXTENDED 118and 119.Dv REG_NOSPEC 120may not be used in the same call to 121.Fn regcomp . 122.It Dv REG_ICASE 123Compile for matching that ignores upper/lower case distinctions. See 124.Xr re_format 7 . 125.It Dv REG_NOSUB 126Compile for matching that need only report success or failure, not 127what was matched. 128.It Dv REG_NEWLINE 129Compile for newline-sensitive matching. 130By default, newline is a completely ordinary character with no special 131meaning in either REs or strings. 132With this flag, 133`[^' bracket expressions and `.' never match newline, 134a `^' anchor matches the null string after any newline in the string 135in addition to its normal function, 136and the `$' anchor matches the null string before any newline in the 137string in addition to its normal function. 138.It Dv REG_PEND 139The regular expression ends, not at the first NUL, but just before the 140character pointed to by the 141.Fa re_endp 142member of the structure pointed to by 143.Fa preg . 144The 145.Fa re_endp 146member is of type 147.Fa "const\ char\ *" . 148This flag permits inclusion of NULs in the RE; they are considered 149ordinary characters. 150This is an extension, compatible with but not specified by 151.St -p1003.2-92 , 152and should be used with caution in software intended to be portable to 153other systems. 154.El 155.Pp 156When successful, 157.Fn regcomp 158returns 0 and fills in the structure pointed to by 159.Fa preg . 160One member of that structure (other than 161.Fa re_endp ) 162is publicized: 163.Fa re_nsub , 164of type 165.Fa size_t , 166contains the number of parenthesized subexpressions within the RE 167(except that the value of this member is undefined if the 168.Dv REG_NOSUB 169flag was used). 170If 171.Fn regcomp 172fails, it returns a non-zero error code; 173see DIAGNOSTICS. 174.Pp 175.Fn Regexec 176matches the compiled RE pointed to by 177.Fa preg 178against the 179.Fa string , 180subject to the flags in 181.Fa eflags , 182and reports results using 183.Fa nmatch , 184.Fa pmatch , 185and the returned value. 186The RE must have been compiled by a previous invocation of 187.Fn regcomp . 188The compiled form is not altered during execution of 189.Fn regexec , 190so a single compiled RE can be used simultaneously by multiple threads. 191.Pp 192By default, 193the NUL-terminated string pointed to by 194.Fa string 195is considered to be the text of an entire line, minus any terminating 196newline. 197The 198.Fa eflags 199argument is the bitwise OR of zero or more of the following flags: 200.Bl -tag -width XXXREG_NOTBOL 201.It Dv REG_NOTBOL 202The first character of the string 203is not the beginning of a line, so the `^' anchor should not match before it. 204This does not affect the behavior of newlines under 205.Dv REG_NEWLINE . 206.It Dv REG_NOTEOL 207The NUL terminating the string does not end a line, so the `$' anchor 208should not match before it. This does not affect the behavior of 209newlines under 210.Dv REG_NEWLINE . 211.It Dv REG_STARTEND 212The string is considered to start at 213.Fa string 214+ 215.Fa pmatch[0].rm_so 216and to have a terminating NUL located at 217.Fa string 218+ 219.Fa pmatch[0].rm_eo 220(there need not actually be a NUL at that location), 221regardless of the value of 222.Fa nmatch . 223See below for the definition of 224.Fa pmatch 225and 226.Fa nmatch . 227This is an extension, compatible with but not specified by 228.St -p1003.2-92 , 229and should be used with caution in software intended to be portable to 230other systems. 231Note that a non-zero 232.Fa rm_so 233does not imply 234.Dv REG_NOTBOL ; 235.Dv REG_STARTEND 236affects only the location of the string, not how it is matched. 237.El 238.Pp 239See 240.Xr re_format 7 241for a discussion of what is matched in situations where an RE or a 242portion thereof could match any of several substrings of 243.Fa string . 244.Pp 245Normally, 246.Fn regexec 247returns 0 for success and the non-zero code 248.Dv REG_NOMATCH 249for failure. 250Other non-zero error codes may be returned in exceptional situations; 251see DIAGNOSTICS. 252.Pp 253If 254.Dv REG_NOSUB 255was specified in the compilation of the RE, or if 256.Fa nmatch 257is 0, 258.Fn regexec 259ignores the 260.Fa pmatch 261argument (but see below for the case where 262.Dv REG_STARTEND 263is specified). 264Otherwise, 265.Fa pmatch 266points to an array of 267.Fa nmatch 268structures of type 269.Fa regmatch_t . 270Such a structure has at least the members 271.Fa rm_so 272and 273.Fa rm_eo , 274both of type 275.Fa regoff_t 276(a signed arithmetic type at least as large as an 277.Fa off_t 278and a 279.Fa ssize_t ) , 280containing respectively the offset of the first character of a substring 281and the offset of the first character after the end of the substring. 282Offsets are measured from the beginning of the 283.Fa string 284argument given to 285.Fn regexec . 286An empty substring is denoted by equal offsets, 287both indicating the character following the empty substring. 288.Pp 289The 0th member of the 290.Fa pmatch 291array is filled in to indicate what substring of 292.Fa string 293was matched by the entire RE. 294Remaining members report what substring was matched by parenthesized 295subexpressions within the RE; 296member 297.Fa i 298reports subexpression 299.Fa i , 300with subexpressions counted (starting at 1) by the order of their 301opening parentheses in the RE, left to right. 302Unused entries in the array\(emcorresponding either to subexpressions that 303did not participate in the match at all, or to subexpressions that do not 304exist in the RE (that is, 305.Fa i 306> 307.Fa preg->re_nsub ) 308\(emhave both 309.Fa rm_so 310and 311.Fa rm_eo 312set to -1. 313If a subexpression participated in the match several times, 314the reported substring is the last one it matched. 315(Note, as an example in particular, that when the RE `(b*)+' matches `bbb', 316the parenthesized subexpression matches each of the three `b's and then 317an infinite number of empty strings following the last `b', 318so the reported substring is one of the empties.) 319.Pp 320If 321.Dv REG_STARTEND 322is specified, 323.Fa pmatch 324must point to at least one 325.Fa regmatch_t 326(even if 327.Fa nmatch 328is 0 or 329.Dv REG_NOSUB 330was specified), 331to hold the input offsets for 332.Dv REG_STARTEND . 333Use for output is still entirely controlled by 334.Fa nmatch ; 335if 336.Fa nmatch 337is 0 or 338.Dv REG_NOSUB 339was specified, 340the value of 341.Fa pmatch [0] 342will not be changed by a successful 343.Fn regexec . 344.Pp 345.Fn Regerror 346maps a non-zero 347.Fa errcode 348from either 349.Fn regcomp 350or 351.Fn regexec 352to a human-readable, printable message. 353If 354.Fa preg 355is non-NULL, 356the error code should have arisen from use of the 357.Fa regex_t 358pointed to by 359.Fa preg , 360and if the error code came from 361.Fn regcomp , 362it should have been the result from the most recent 363.Fn regcomp 364using that 365.Fa regex_t . ( 366.Fn Regerror 367may be able to supply a more detailed message using information 368from the 369.Fa regex_t . ) 370.Fn Regerror 371places the NUL-terminated message into the buffer pointed to by 372.Fa errbuf , 373limiting the length (including the NUL) to at most 374.Fa errbuf_size 375bytes. 376If the whole message won't fit, 377as much of it as will fit before the terminating NUL is supplied. 378In any case, 379the returned value is the size of buffer needed to hold the whole 380message (including terminating NUL). 381If 382.Fa errbuf_size 383is 0, 384.Fa errbuf 385is ignored but the return value is still correct. 386.Pp 387If the 388.Fa errcode 389given to 390.Fn regerror 391is first ORed with 392.Dv REG_ITOA , 393the ``message'' that results is the printable name of the error code, 394e.g. ``REG_NOMATCH'', 395rather than an explanation thereof. 396If 397.Fa errcode 398.Dv REG_ATOI , 399then 400.Fa preg 401shall be non-NULL and the 402.Fa re_endp 403member of the structure it points to 404must point to the printable name of an error code; 405in this case, the result in 406.Fa errbuf 407is the decimal digits of 408the numeric value of the error code 409(0 if the name is not recognized). 410.Dv REG_ITOA 411and 412.Dv REG_ATOI 413are intended primarily as debugging facilities; 414they are extensions, compatible with but not specified by 415.St -p1003.2-92 , 416and should be used with caution in software intended to be portable to 417other systems. 418Be warned also that they are considered experimental and changes are possible. 419.Pp 420.Fn Regfree 421frees any dynamically-allocated storage associated with the compiled RE 422pointed to by 423.Fa preg . 424The remaining 425.Fa regex_t 426is no longer a valid compiled RE 427and the effect of supplying it to 428.Fn regexec 429or 430.Fn regerror 431is undefined. 432.Pp 433None of these functions references global variables except for tables 434of constants; 435all are safe for use from multiple threads if the arguments are safe. 436.Sh IMPLEMENTATION CHOICES 437There are a number of decisions that 438.St -p1003.2-92 439leaves up to the implementor, 440either by explicitly saying ``undefined'' or by virtue of them being 441forbidden by the RE grammar. 442This implementation treats them as follows. 443.Pp 444See 445.Xr re_format 7 446for a discussion of the definition of case-independent matching. 447.Pp 448There is no particular limit on the length of REs, 449except insofar as memory is limited. 450Memory usage is approximately linear in RE size, and largely insensitive 451to RE complexity, except for bounded repetitions. 452See BUGS for one short RE using them 453that will run almost any system out of memory. 454.Pp 455A backslashed character other than one specifically given a magic meaning 456by 457.St -p1003.2-92 458(such magic meanings occur only in obsolete [``basic''] REs) 459is taken as an ordinary character. 460.Pp 461Any unmatched [ is a 462.Dv REG_EBRACK 463error. 464.Pp 465Equivalence classes cannot begin or end bracket-expression ranges. 466The endpoint of one range cannot begin another. 467.Pp 468.Dv RE_DUP_MAX , 469the limit on repetition counts in bounded repetitions, is 255. 470.Pp 471A repetition operator (?, *, +, or bounds) cannot follow another 472repetition operator. 473A repetition operator cannot begin an expression or subexpression 474or follow `^' or `|'. 475.Pp 476`|' cannot appear first or last in a (sub)expression or after another `|', 477i.e. an operand of `|' cannot be an empty subexpression. 478An empty parenthesized subexpression, `()', is legal and matches an 479empty (sub)string. 480An empty string is not a legal RE. 481.Pp 482A `{' followed by a digit is considered the beginning of bounds for a 483bounded repetition, which must then follow the syntax for bounds. 484A `{' \fInot\fR followed by a digit is considered an ordinary character. 485.Pp 486`^' and `$' beginning and ending subexpressions in obsolete (``basic'') 487REs are anchors, not ordinary characters. 488.Sh SEE ALSO 489.Xr grep 1 , 490.Xr sed 1 , 491.Xr re_format 7 492.Pp 493.St -p1003.2-92 , 494sections 2.8 (Regular Expression Notation) 495and 496B.5 (C Binding for Regular Expression Matching). 497.Sh DIAGNOSTICS 498Non-zero error codes from 499.Fn regcomp 500and 501.Fn regexec 502include the following: 503.Pp 504.Bl -tag -width XXXREG_ECOLLATE -compact 505.It Dv REG_NOMATCH 506regexec() failed to match 507.It Dv REG_BADPAT 508invalid regular expression 509.It Dv REG_ECOLLATE 510invalid collating element 511.It Dv REG_ECTYPE 512invalid character class 513.It Dv REG_EESCAPE 514\e applied to unescapable character 515.It Dv REG_ESUBREG 516invalid backreference number 517.It Dv REG_EBRACK 518brackets [ ] not balanced 519.It Dv REG_EPAREN 520parentheses ( ) not balanced 521.It Dv REG_EBRACE 522braces { } not balanced 523.It Dv REG_BADBR 524invalid repetition count(s) in { } 525.It Dv REG_ERANGE 526invalid character range in [ ] 527.It Dv REG_ESPACE 528ran out of memory 529.It Dv REG_BADRPT 530?, *, or + operand invalid 531.It Dv REG_EMPTY 532empty (sub)expression 533.It Dv REG_ASSERT 534``can't happen''\(emyou found a bug 535.It Dv REG_INVARG 536invalid argument, e.g. negative-length string 537.El 538.Sh HISTORY 539Originally written by Henry Spencer. 540Altered for inclusion in the 4.4BSD distribution. 541.Sh BUGS 542This is an alpha release with known defects. 543Please report problems. 544.Pp 545There is one known functionality bug. 546The implementation of internationalization is incomplete: 547the locale is always assumed to be the default one of 548.St -p1003.2-92 , 549and only the collating elements etc. of that locale are available. 550.Pp 551The back-reference code is subtle and doubts linger about its correctness 552in complex cases. 553.Pp 554.Fn Regexec 555performance is poor. 556This will improve with later releases. 557.Fa Nmatch 558exceeding 0 is expensive; 559.Fa nmatch 560exceeding 1 is worse. 561.Fa Regexec 562is largely insensitive to RE complexity 563.Em except 564that back references are massively expensive. 565RE length does matter; in particular, there is a strong speed bonus 566for keeping RE length under about 30 characters, 567with most special characters counting roughly double. 568.Pp 569.Fn Regcomp 570implements bounded repetitions by macro expansion, 571which is costly in time and space if counts are large 572or bounded repetitions are nested. 573An RE like, say, 574`((((a{1,100}){1,100}){1,100}){1,100}){1,100}' 575will (eventually) run almost any existing machine out of swap space. 576.Pp 577There are suspected problems with response to obscure error conditions. 578Notably, 579certain kinds of internal overflow, 580produced only by truly enormous REs or by multiply nested bounded repetitions, 581are probably not handled well. 582.Pp 583Due to a mistake in 584.St -p1003.2-92 , 585things like `a)b' are legal REs because `)' is a special character 586only in the presence of a previous unmatched `('. This can't be fixed 587until the spec is fixed. 588.Pp 589The standard's definition of back references is vague. 590For example, does 591`a\e(\e(b\e)*\e2\e)*d' match `abbbd'? 592Until the standard is clarified, behavior in such cases should not be 593relied on. 594.Pp 595The implementation of word-boundary matching is a bit of a kludge, 596and bugs may lurk in combinations of word-boundary matching and anchoring. 597