1.\" $NetBSD: regex.3,v 1.33 2022/12/04 01:29:32 uwe Exp $ 2.\" 3.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 4.\" Copyright (c) 1992, 1993, 1994 5.\" The Regents of the University of California. All rights reserved. 6.\" 7.\" This code is derived from software contributed to Berkeley by 8.\" Henry Spencer. 9.\" 10.\" Redistribution and use in source and binary forms, with or without 11.\" modification, are permitted provided that the following conditions 12.\" are met: 13.\" 1. Redistributions of source code must retain the above copyright 14.\" notice, this list of conditions and the following disclaimer. 15.\" 2. Redistributions in binary form must reproduce the above copyright 16.\" notice, this list of conditions and the following disclaimer in the 17.\" documentation and/or other materials provided with the distribution. 18.\" 3. Neither the name of the University nor the names of its contributors 19.\" may be used to endorse or promote products derived from this software 20.\" without specific prior written permission. 21.\" 22.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 23.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 24.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 25.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 26.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 27.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 28.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 29.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 30.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 31.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 32.\" SUCH DAMAGE. 33.\" 34.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 35.\" $FreeBSD: head/lib/libc/regex/regex.3 363817 2020-08-04 02:06:49Z kevans $ 36.\" 37.Dd March 11, 2021 38.Dt REGEX 3 39.Os 40.Sh NAME 41.Nm regcomp , 42.Nm regexec , 43.Nm regerror , 44.Nm regfree , 45.Nm regasub , 46.Nm regnsub 47.Nd regular-expression library 48.Sh LIBRARY 49.Lb libc 50.Sh SYNOPSIS 51.In regex.h 52.Ft int 53.Fo regcomp 54.Fa "regex_t * restrict preg" "const char * restrict pattern" "int cflags" 55.Fc 56.Ft int 57.Fo regexec 58.Fa "const regex_t * restrict preg" "const char * restrict string" 59.Fa "size_t nmatch" "regmatch_t pmatch[restrict]" "int eflags" 60.Fc 61.Ft size_t 62.Fo regerror 63.Fa "int errcode" "const regex_t * restrict preg" 64.Fa "char * restrict errbuf" "size_t errbuf_size" 65.Fc 66.Ft void 67.Fn regfree "regex_t *preg" 68.Ft ssize_t 69.Fn regnsub "char *buf" "size_t bufsiz" "const char *sub" "const regmatch_t *rm" "const char *str" 70.Ft ssize_t 71.Fn regasub "char **buf" "const char *sub" "const regmatch_t *rm" "const char *sstr" 72.Sh DESCRIPTION 73These routines implement 74.St -p1003.2 75regular expressions 76.Pq Do RE Dc Ns s ; 77see 78.Xr re_format 7 . 79The 80.Fn regcomp 81function 82compiles an RE written as a string into an internal form, 83.Fn regexec 84matches that internal form against a string and reports results, 85.Fn regerror 86transforms error codes from either into human-readable messages, 87and 88.Fn regfree 89frees any dynamically-allocated storage used by the internal form 90of an RE. 91.Pp 92The header 93.In regex.h 94declares two structure types, 95.Ft regex_t 96and 97.Ft regmatch_t , 98the former for compiled internal forms and the latter for match reporting. 99It also declares the four functions, 100a type 101.Ft regoff_t , 102and a number of constants with names starting with 103.Dq Dv REG_ . 104.Pp 105The 106.Fn regcomp 107function 108compiles the regular expression contained in the 109.Fa pattern 110string, 111subject to the flags in 112.Fa cflags , 113and places the results in the 114.Ft regex_t 115structure pointed to by 116.Fa preg . 117The 118.Fa cflags 119argument 120is the bitwise OR of zero or more of the following flags: 121.Bl -tag -width REG_EXTENDED 122.It Dv REG_EXTENDED 123Compile modern 124.Pq Dq extended 125REs, 126rather than the obsolete 127.Pq Dq basic 128REs that 129are the default. 130.It Dv REG_BASIC 131This is a synonym for 0, 132provided as a counterpart to 133.Dv REG_EXTENDED 134to improve readability. 135.It Dv REG_NOSPEC 136Compile with recognition of all special characters turned off. 137All characters are thus considered ordinary, 138so the 139.Dq RE 140is a literal string. 141This is an extension, 142compatible with but not specified by 143.St -p1003.2 , 144and should be used with 145caution in software intended to be portable to other systems. 146.Dv REG_EXTENDED 147and 148.Dv REG_NOSPEC 149may not be used 150in the same call to 151.Fn regcomp . 152.It Dv REG_ICASE 153Compile for matching that ignores upper/lower case distinctions. 154See 155.Xr re_format 7 . 156.It Dv REG_NOSUB 157Compile for matching that need only report success or failure, 158not what was matched. 159.It Dv REG_NEWLINE 160Compile for newline-sensitive matching. 161By default, newline is a completely ordinary character with no special 162meaning in either REs or strings. 163With this flag, 164.Ql [^ 165bracket expressions and 166.Ql .\& 167never match newline, 168a 169.Ql ^\& 170anchor matches the null string after any newline in the string 171in addition to its normal function, 172and the 173.Ql $\& 174anchor matches the null string before any newline in the 175string in addition to its normal function. 176.It Dv REG_PEND 177The regular expression ends, 178not at the first NUL, 179but just before the character pointed to by the 180.Va re_endp 181member of the structure pointed to by 182.Fa preg . 183The 184.Va re_endp 185member is of type 186.Ft "const char *" . 187This flag permits inclusion of NULs in the RE; 188they are considered ordinary characters. 189This is an extension, 190compatible with but not specified by 191.St -p1003.2 , 192and should be used with 193caution in software intended to be portable to other systems. 194.It Dv REG_GNU 195Include GNU-inspired extensions: 196.Pp 197.Bl -tag -offset indent -width XX -compact 198.It \eN 199Use backreference 200.Dv N 201where 202.Dv N 203is a single digit number between 204.Dv 1 205and 206.Dv 9 . 207.It \ea 208Visual Bell 209.It \eb 210Match a position that is a word boundary. 211.It \eB 212Match a position that is a not word boundary. 213.It \ef 214Form Feed 215.It \en 216Line Feed 217.It \er 218Carriage return 219.It \es 220Alias for [[:space:]] 221.It \eS 222Alias for [^[:space:]] 223.It \et 224Horizontal Tab 225.It \ev 226Vertical Tab 227.It \ew 228Alias for [[:alnum:]_] 229.It \eW 230Alias for [^[:alnum:]_] 231.It \e' 232Matches the end of the subject string (the string to be matched). 233.It \e` 234Matches the beginning of the subject string. 235.El 236.Pp 237This is an extension, 238compatible with but not specified by 239.St -p1003.2 , 240and should be used with 241caution in software intended to be portable to other systems. 242.El 243.Pp 244When successful, 245.Fn regcomp 246returns 0 and fills in the structure pointed to by 247.Fa preg . 248One member of that structure 249(other than 250.Va re_endp ) 251is publicized: 252.Va re_nsub , 253of type 254.Ft size_t , 255contains the number of parenthesized subexpressions within the RE 256(except that the value of this member is undefined if the 257.Dv REG_NOSUB 258flag was used). 259If 260.Fn regcomp 261fails, it returns a non-zero error code; 262see 263.Sx DIAGNOSTICS . 264.Pp 265The 266.Fn regexec 267function 268matches the compiled RE pointed to by 269.Fa preg 270against the 271.Fa string , 272subject to the flags in 273.Fa eflags , 274and reports results using 275.Fa nmatch , 276.Fa pmatch , 277and the returned value. 278The RE must have been compiled by a previous invocation of 279.Fn regcomp . 280The compiled form is not altered during execution of 281.Fn regexec , 282so a single compiled RE can be used simultaneously by multiple threads. 283.Pp 284By default, 285the NUL-terminated string pointed to by 286.Fa string 287is considered to be the text of an entire line, minus any terminating 288newline. 289The 290.Fa eflags 291argument is the bitwise OR of zero or more of the following flags: 292.Bl -tag -width REG_STARTEND 293.It Dv REG_NOTBOL 294The first character of the string is treated as the continuation 295of a line. 296This means that the anchors 297.Ql ^\& , 298.Ql [[:<:]] , 299and 300.Ql \e< 301do not match before it; but see 302.Dv REG_STARTEND 303below. 304This does not affect the behavior of newlines under 305.Dv REG_NEWLINE . 306.It Dv REG_NOTEOL 307The NUL terminating 308the string 309does not end a line, so the 310.Ql $\& 311anchor does not match before it. 312This does not affect the behavior of newlines under 313.Dv REG_NEWLINE . 314.It Dv REG_STARTEND 315The string is considered to start at 316.Fa string No + 317.Fa pmatch Ns [0]. Ns Fa rm_so 318and to end before the byte located at 319.Fa string No + 320.Fa pmatch Ns [0]. Ns Fa rm_eo , 321regardless of the value of 322.Fa nmatch . 323See below for the definition of 324.Fa pmatch 325and 326.Fa nmatch . 327This is an extension, 328compatible with but not specified by 329.St -p1003.2 , 330and should be used with 331caution in software intended to be portable to other systems. 332.Pp 333Without 334.Dv REG_NOTBOL , 335the position 336.Fa rm_so 337is considered the beginning of a line, such that 338.Ql ^ 339matches before it, and the beginning of a word if there is a word 340character at this position, such that 341.Ql [[:<:]] 342and 343.Ql \e< 344match before it. 345.Pp 346With 347.Dv REG_NOTBOL , 348the character at position 349.Fa rm_so 350is treated as the continuation of a line, and if 351.Fa rm_so 352is greater than 0, the preceding character is taken into consideration. 353If the preceding character is a newline and the regular expression was compiled 354with 355.Dv REG_NEWLINE , 356.Ql ^ 357matches before the string; if the preceding character is not a word character 358but the string starts with a word character, 359.Ql [[:<:]] 360and 361.Ql \e< 362match before the string. 363.El 364.Pp 365See 366.Xr re_format 7 367for a discussion of what is matched in situations where an RE or a 368portion thereof could match any of several substrings of 369.Fa string . 370.Pp 371Normally, 372.Fn regexec 373returns 0 for success and the non-zero code 374.Dv REG_NOMATCH 375for failure. 376Other non-zero error codes may be returned in exceptional situations; 377see 378.Sx DIAGNOSTICS . 379.Pp 380If 381.Dv REG_NOSUB 382was specified in the compilation of the RE, 383or if 384.Fa nmatch 385is 0, 386.Fn regexec 387ignores the 388.Fa pmatch 389argument (but see below for the case where 390.Dv REG_STARTEND 391is specified). 392Otherwise, 393.Fa pmatch 394points to an array of 395.Fa nmatch 396structures of type 397.Ft regmatch_t . 398Such a structure has at least the members 399.Va rm_so 400and 401.Va rm_eo , 402both of type 403.Ft regoff_t 404(a signed arithmetic type at least as large as an 405.Ft off_t 406and a 407.Ft ssize_t ) , 408containing respectively the offset of the first character of a substring 409and the offset of the first character after the end of the substring. 410Offsets are measured from the beginning of the 411.Fa string 412argument given to 413.Fn regexec . 414An empty substring is denoted by equal offsets, 415both indicating the character following the empty substring. 416.Pp 417The 0th member of the 418.Fa pmatch 419array is filled in to indicate what substring of 420.Fa string 421was matched by the entire RE. 422Remaining members report what substring was matched by parenthesized 423subexpressions within the RE; 424member 425.Va i 426reports subexpression 427.Va i , 428with subexpressions counted (starting at 1) by the order of their opening 429parentheses in the RE, left to right. 430Unused entries in the array (corresponding either to subexpressions that 431did not participate in the match at all, or to subexpressions that do not 432exist in the RE (that is, 433.Va i 434> 435.Fa preg Ns -> Ns Va re_nsub ) ) 436have both 437.Va rm_so 438and 439.Va rm_eo 440set to -1. 441If a subexpression participated in the match several times, 442the reported substring is the last one it matched. 443(Note, as an example in particular, that when the RE 444.Ql "(b*)+" 445matches 446.Ql bbb , 447the parenthesized subexpression matches each of the three 448.So Li b Sc Ns s 449and then 450an infinite number of empty strings following the last 451.Ql b , 452so the reported substring is one of the empties.) 453.Pp 454If 455.Dv REG_STARTEND 456is specified, 457.Fa pmatch 458must point to at least one 459.Ft regmatch_t 460(even if 461.Fa nmatch 462is 0 or 463.Dv REG_NOSUB 464was specified), 465to hold the input offsets for 466.Dv REG_STARTEND . 467Use for output is still entirely controlled by 468.Fa nmatch ; 469if 470.Fa nmatch 471is 0 or 472.Dv REG_NOSUB 473was specified, 474the value of 475.Fa pmatch Ns [0] 476will not be changed by a successful 477.Fn regexec . 478.Pp 479The 480.Fn regerror 481function 482maps a non-zero 483.Fa errcode 484from either 485.Fn regcomp 486or 487.Fn regexec 488to a human-readable, printable message. 489If 490.Fa preg 491is 492.No non\- Ns Dv NULL , 493the error code should have arisen from use of 494the 495.Ft regex_t 496pointed to by 497.Fa preg , 498and if the error code came from 499.Fn regcomp , 500it should have been the result from the most recent 501.Fn regcomp 502using that 503.Ft regex_t . 504The 505.Po 506.Fn regerror 507may be able to supply a more detailed message using information 508from the 509.Ft regex_t . 510.Pc 511The 512.Fn regerror 513function 514places the NUL-terminated message into the buffer pointed to by 515.Fa errbuf , 516limiting the length (including the NUL) to at most 517.Fa errbuf_size 518bytes. 519If the whole message will not fit, 520as much of it as will fit before the terminating NUL is supplied. 521In any case, 522the returned value is the size of buffer needed to hold the whole 523message (including terminating NUL). 524If 525.Fa errbuf_size 526is 0, 527.Fa errbuf 528is ignored but the return value is still correct. 529.Pp 530If the 531.Fa errcode 532given to 533.Fn regerror 534is first ORed with 535.Dv REG_ITOA , 536the 537.Dq message 538that results is the printable name of the error code, 539e.g.\& 540.Dq Dv REG_NOMATCH , 541rather than an explanation thereof. 542If 543.Fa errcode 544is 545.Dv REG_ATOI , 546then 547.Fa preg 548shall be 549.No non\- Ns Dv NULL 550and the 551.Va re_endp 552member of the structure it points to 553must point to the printable name of an error code; 554in this case, the result in 555.Fa errbuf 556is the decimal digits of 557the numeric value of the error code 558(0 if the name is not recognized). 559.Dv REG_ITOA 560and 561.Dv REG_ATOI 562are intended primarily as debugging facilities; 563they are extensions, 564compatible with but not specified by 565.St -p1003.2 , 566and should be used with 567caution in software intended to be portable to other systems. 568Be warned also that they are considered experimental and changes are possible. 569.Pp 570The 571.Fn regfree 572function 573frees any dynamically-allocated storage associated with the compiled RE 574pointed to by 575.Fa preg . 576The remaining 577.Ft regex_t 578is no longer a valid compiled RE 579and the effect of supplying it to 580.Fn regexec 581or 582.Fn regerror 583is undefined. 584.Pp 585None of these functions references global variables except for tables 586of constants; 587all are safe for use from multiple threads if the arguments are safe. 588.Pp 589The 590.Fn regnsub 591and 592.Fn regasub 593functions perform substitutions using 594.Xr sed 1 595like syntax. 596They return the length of the string that would have been created 597if there was enough space or 598.Dv \-1 599on error, setting 600.Dv errno . 601The result 602is being placed in 603.Fa buf 604which is user-supplied in 605.Fn regnsub 606and dynamically allocated in 607.Fn regasub . 608The 609.Fa sub 610argument contains a substitution string which might refer to the first 6119 regular expression strings using 612.Dq \e<n> 613to refer to the nth matched 614item, or 615.Dq & 616(which is equivalent to 617.Dq \e0 ) 618to refer to the full match. 619The 620.Fa rm 621array must be at least 10 elements long, and should contain the result 622of the matches from a previous 623.Fn regexec 624call. 625Only 10 elements of the 626.Fa rm 627array can be used. 628The 629.Fa str 630argument contains the source string to apply the transformation to. 631.Sh IMPLEMENTATION CHOICES 632There are a number of decisions that 633.St -p1003.2 634leaves up to the implementor, 635either by explicitly saying 636.Dq undefined 637or by virtue of them being 638forbidden by the RE grammar. 639This implementation treats them as follows. 640.Pp 641See 642.Xr re_format 7 643for a discussion of the definition of case-independent matching. 644.Pp 645There is no particular limit on the length of REs, 646except insofar as memory is limited. 647Memory usage is approximately linear in RE size, and largely insensitive 648to RE complexity, except for bounded repetitions. 649See 650.Sx BUGS 651for one short RE using them 652that will run almost any system out of memory. 653.Pp 654A backslashed character other than one specifically given a magic meaning 655by 656.St -p1003.2 657(such magic meanings occur only in obsolete 658.Bq Dq basic 659REs) 660is taken as an ordinary character. 661.Pp 662Any unmatched 663.Ql [\& 664is a 665.Dv REG_EBRACK 666error. 667.Pp 668Equivalence classes cannot begin or end bracket-expression ranges. 669The endpoint of one range cannot begin another. 670.Pp 671.Dv RE_DUP_MAX , 672the limit on repetition counts in bounded repetitions, is 255. 673.Pp 674A repetition operator 675.Ql ( ?\& , 676.Ql *\& , 677.Ql +\& , 678or bounds) 679cannot follow another 680repetition operator. 681A repetition operator cannot begin an expression or subexpression 682or follow 683.Ql ^\& 684or 685.Ql |\& . 686.Pp 687.Ql |\& 688cannot appear first or last in a (sub)expression or after another 689.Ql |\& , 690i.e., an operand of 691.Ql |\& 692cannot be an empty subexpression. 693An empty parenthesized subexpression, 694.Ql "()" , 695is legal and matches an 696empty (sub)string. 697An empty string is not a legal RE. 698.Pp 699A 700.Ql {\& 701followed by a digit is considered the beginning of bounds for a 702bounded repetition, which must then follow the syntax for bounds. 703A 704.Ql {\& 705.Em not 706followed by a digit is considered an ordinary character. 707.Pp 708.Ql ^\& 709and 710.Ql $\& 711beginning and ending subexpressions in obsolete 712.Pq Dq basic 713REs are anchors, not ordinary characters. 714.Sh RETURN VALUES 715Non-zero error codes from 716.Fn regcomp 717and 718.Fn regexec 719include the following: 720.Pp 721.Bl -tag -width REG_ECOLLATE -compact 722.It Dv REG_NOMATCH 723The 724.Fn regexec 725function 726failed to match 727.It Dv REG_BADPAT 728invalid regular expression 729.It Dv REG_ECOLLATE 730invalid collating element 731.It Dv REG_ECTYPE 732invalid character class 733.It Dv REG_EESCAPE 734.Ql \e 735applied to unescapable character 736.It Dv REG_ESUBREG 737invalid backreference number 738.It Dv REG_EBRACK 739brackets 740.Ql "[ ]" 741not balanced 742.It Dv REG_EPAREN 743parentheses 744.Ql "( )" 745not balanced 746.It Dv REG_EBRACE 747braces 748.Ql "{ }" 749not balanced 750.It Dv REG_BADBR 751invalid repetition count(s) in 752.Ql "{ }" 753.It Dv REG_ERANGE 754invalid character range in 755.Ql "[ ]" 756.It Dv REG_ESPACE 757ran out of memory 758.It Dv REG_BADRPT 759.Ql ?\& , 760.Ql *\& , 761or 762.Ql +\& 763operand invalid 764.It Dv REG_EMPTY 765empty (sub)expression 766.It Dv REG_ASSERT 767cannot happen - you found a bug 768.It Dv REG_INVARG 769invalid argument, e.g.\& negative-length string 770.It Dv REG_ILLSEQ 771illegal byte sequence (bad multibyte character) 772.El 773.Sh SEE ALSO 774.Xr grep 1 , 775.Xr re_format 7 776.Pp 777.St -p1003.2 , 778sections 2.8 (Regular Expression Notation) 779and 780B.5 (C Binding for Regular Expression Matching). 781.Sh HISTORY 782Originally written by 783.An Henry Spencer . 784Altered for inclusion in the 785.Bx 4.4 786distribution. 787.Pp 788The 789.Fn regnsub 790and 791.Fn regasub 792functions appeared in 793.Nx 8 . 794.Sh BUGS 795This is an alpha release with known defects. 796Please report problems. 797.Pp 798The back-reference code is subtle and doubts linger about its correctness 799in complex cases. 800.Pp 801The 802.Fn regexec 803function 804performance is poor. 805This will improve with later releases. 806The 807.Fa nmatch 808argument 809exceeding 0 is expensive; 810.Fa nmatch 811exceeding 1 is worse. 812The 813.Fn regexec 814function 815is largely insensitive to RE complexity 816.Em except 817that back 818references are massively expensive. 819RE length does matter; in particular, there is a strong speed bonus 820for keeping RE length under about 30 characters, 821with most special characters counting roughly double. 822.Pp 823The 824.Fn regcomp 825function 826implements bounded repetitions by macro expansion, 827which is costly in time and space if counts are large 828or bounded repetitions are nested. 829An RE like, say, 830.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}" 831will (eventually) run almost any existing machine out of swap space. 832.Pp 833There are suspected problems with response to obscure error conditions. 834Notably, 835certain kinds of internal overflow, 836produced only by truly enormous REs or by multiply nested bounded repetitions, 837are probably not handled well. 838.Pp 839Due to a mistake in 840.St -p1003.2 , 841things like 842.Ql "a)b" 843are legal REs because 844.Ql )\& 845is 846a special character only in the presence of a previous unmatched 847.Ql (\& . 848This cannot be fixed until the spec is fixed. 849.Pp 850The standard's definition of back references is vague. 851For example, does 852.Ql "a\e(\e(b\e)*\e2\e)*d" 853match 854.Ql "abbbd" ? 855Until the standard is clarified, 856behavior in such cases should not be relied on. 857.Pp 858The implementation of word-boundary matching is a bit of a kludge, 859and bugs may lurk in combinations of word-boundary matching and anchoring. 860.Pp 861Word-boundary matching does not work properly in multibyte locales. 862