1.\" $NetBSD: regex.3,v 1.35 2024/09/24 14:10:43 uwe Exp $ 2.\" 3.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 4.\" Copyright (c) 1992, 1993, 1994 5.\" The Regents of the University of California. All rights reserved. 6.\" 7.\" This code is derived from software contributed to Berkeley by 8.\" Henry Spencer. 9.\" 10.\" Redistribution and use in source and binary forms, with or without 11.\" modification, are permitted provided that the following conditions 12.\" are met: 13.\" 1. Redistributions of source code must retain the above copyright 14.\" notice, this list of conditions and the following disclaimer. 15.\" 2. Redistributions in binary form must reproduce the above copyright 16.\" notice, this list of conditions and the following disclaimer in the 17.\" documentation and/or other materials provided with the distribution. 18.\" 3. Neither the name of the University nor the names of its contributors 19.\" may be used to endorse or promote products derived from this software 20.\" without specific prior written permission. 21.\" 22.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 23.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 24.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 25.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 26.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 27.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 28.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 29.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 30.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 31.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 32.\" SUCH DAMAGE. 33.\" 34.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 35.\" $FreeBSD: head/lib/libc/regex/regex.3 363817 2020-08-04 02:06:49Z kevans $ 36.\" 37.Dd September 21, 2024 38.Dt REGEX 3 39.Os 40. 41.Sh NAME 42.Nm regcomp , 43.Nm regexec , 44.Nm regerror , 45.Nm regfree , 46.Nm regasub , 47.Nm regnsub 48. 49.Nd regular-expression library 50. 51.Sh LIBRARY 52.Lb libc 53. 54.Sh SYNOPSIS 55. 56.In regex.h 57. 58.Ft int 59.Fo regcomp 60.Fa "regex_t * restrict preg" 61.Fa "const char * restrict pattern" 62.Fa "int cflags" 63.Fc 64. 65.Ft int 66.Fo regexec 67.Fa "const regex_t * restrict preg" 68.Fa "const char * restrict string" 69.Fa "size_t nmatch" 70.Fa "regmatch_t pmatch[restrict]" 71.Fa "int eflags" 72.Fc 73.Ft size_t 74.Fo regerror 75.Fa "int errcode" 76.Fa "const regex_t * restrict preg" 77.Fa "char * restrict errbuf" 78.Fa "size_t errbuf_size" 79.Fc 80. 81.Ft void 82.Fn regfree "regex_t *preg" 83. 84.Ft ssize_t 85.Fo regnsub 86.Fa "char *buf" 87.Fa "size_t bufsiz" 88.Fa "const char *sub" 89.Fa "const regmatch_t *rm" 90.Fa "const char *str" 91.Fc 92. 93.Ft ssize_t 94.Fo regasub 95.Fa "char **buf" 96.Fa "const char *sub" 97.Fa "const regmatch_t *rm" 98.Fa "const char *sstr" 99.Fc 100. 101.Sh DESCRIPTION 102These routines implement 103.St -p1003.2 104regular expressions 105.Pq Do Tn RE Dc Ns s ; 106see 107.Xr re_format 7 . 108The 109.Fn regcomp 110function 111compiles an RE written as a string into an internal form, 112.Fn regexec 113matches that internal form against a string and reports results, 114.Fn regerror 115transforms error codes from either into human-readable messages, 116and 117.Fn regfree 118frees any dynamically-allocated storage used by the internal form 119of an RE. 120.Pp 121The header 122.In regex.h 123declares two structure types, 124.Ft regex_t 125and 126.Ft regmatch_t , 127the former for compiled internal forms and the latter for match reporting. 128It also declares the four functions, 129a type 130.Ft regoff_t , 131and a number of constants with names starting with 132.Ql REG_ . 133.Pp 134The 135.Fn regcomp 136function 137compiles the regular expression contained in the 138.Fa pattern 139string, 140subject to the flags in 141.Fa cflags , 142and places the results in the 143.Ft regex_t 144structure pointed to by 145.Fa preg . 146The 147.Fa cflags 148argument 149is the bitwise 150.Em or 151of zero or more of the following flags: 152.Bl -tag -width Dv 153. 154.It Dv REG_EXTENDED 155Compile modern 156.Pq Dq extended 157REs, 158rather than the obsolete 159.Pq Dq basic 160REs that 161are the default. 162. 163.It Dv REG_BASIC 164This is a synonym for 0, 165provided as a counterpart to 166.Dv REG_EXTENDED 167to improve readability. 168. 169.It Dv REG_NOSPEC 170Compile with recognition of all special characters turned off. 171All characters are thus considered ordinary, 172so the 173.Dq RE 174is a literal string. 175This is an extension, 176compatible with but not specified by 177.St -p1003.2 , 178and should be used with 179caution in software intended to be portable to other systems. 180.Dv REG_EXTENDED 181and 182.Dv REG_NOSPEC 183may not be used 184in the same call to 185.Fn regcomp . 186. 187.It Dv REG_ICASE 188Compile for matching that ignores upper\|/\^lower case distinctions. 189See 190.Xr re_format 7 . 191. 192.It Dv REG_NOSUB 193Compile for matching that need only report success or failure, 194not what was matched. 195. 196.It Dv REG_NEWLINE 197Compile for newline-sensitive matching. 198By default, newline is a completely ordinary character with no special 199meaning in either REs or strings. 200With this flag, 201.Ql \&[^ 202bracket expressions and 203.Ql \&. 204never match newline, 205a 206.Ql \&^ 207anchor matches the null string after any newline in the string 208in addition to its normal function, 209and the 210.Ql \&$ 211anchor matches the null string before any newline in the 212string in addition to its normal function. 213. 214.It Dv REG_PEND 215The regular expression ends, 216not at the first 217.Tn NUL , 218but just before the character pointed to by the 219.Fa re_endp 220member of the structure pointed to by 221.Fa preg . 222The 223.Fa re_endp 224member is of type 225.Ft "const char *" . 226This flag permits inclusion of 227.Tn NUL Ns s 228in the RE; 229they are considered ordinary characters. 230This is an extension, 231compatible with but not specified by 232.St -p1003.2 , 233and should be used with 234caution in software intended to be portable to other systems. 235.It Dv REG_GNU 236Include GNU-inspired extensions: 237.Pp 238.Bl -tag -offset indent -width Ds -compact 239.It Ic \e Ns Ar N 240Use backreference 241.Ar N 242where 243.Ar N 244is a single digit number between 1 and 9. 245.It Ic \ea 246Visual Bell 247.It Ic \eb 248Match a position that is a word boundary. 249.It Ic \eB 250Match a position that is a not word boundary. 251.It Ic \ef 252Form Feed 253.It Ic \en 254Line Feed 255.It Ic \er 256Carriage return 257.It Ic \es 258Alias for 259.Ql [[:space:]] 260.It Ic \eS 261Alias for 262.Ql [^[:space:]] 263.It Ic \et 264Horizontal Tab 265.It Ic \ev 266Vertical Tab 267.It Ic \ew 268Alias for 269.Ql [[:alnum:]_] 270.It Ic \eW 271Alias for 272.Ql [^[:alnum:]_] 273.It Ic \e' 274Matches the end of the subject string (the string to be matched). 275.It Ic \e` 276Matches the beginning of the subject string. 277.El 278.Pp 279This is an extension, 280compatible with but not specified by 281.St -p1003.2 , 282and should be used with 283caution in software intended to be portable to other systems. 284.El 285.Pp 286When successful, 287.Fn regcomp 288returns 0 and fills in the structure pointed to by 289.Fa preg . 290One member of that structure 291.Pq other than Fa re_endp 292is publicized: 293.Fa re_nsub , 294of type 295.Ft size_t , 296contains the number of parenthesized subexpressions within the RE 297.Po 298except that the value of this member is undefined if the 299.Dv REG_NOSUB 300flag was used 301.Pc . 302If 303.Fn regcomp 304fails, it returns a non-zero error code; 305see 306.Sx RETURN VALUES . 307.Pp 308The 309.Fn regexec 310function 311matches the compiled RE pointed to by 312.Fa preg 313against the 314.Fa string , 315subject to the flags in 316.Fa eflags , 317and reports results using 318.Fa nmatch , 319.Fa pmatch , 320and the returned value. 321The RE must have been compiled by a previous invocation of 322.Fn regcomp . 323The compiled form is not altered during execution of 324.Fn regexec , 325so a single compiled RE can be used simultaneously by multiple threads. 326.Pp 327By default, 328the NUL-terminated string pointed to by 329.Fa string 330is considered to be the text of an entire line, minus any terminating 331newline. 332The 333.Fa eflags 334argument is the bitwise 335.Em or 336of zero or more of the following flags: 337.Bl -tag -width Dv 338. 339.It Dv REG_NOTBOL 340The first character of the string is treated as the continuation 341of a line. 342This means that the anchors 343.Ql \&^ , 344.Ql [[:<:]] , 345and 346.Ql \e< 347do not match before it; but see 348.Dv REG_STARTEND 349below. 350This does not affect the behavior of newlines under 351.Dv REG_NEWLINE . 352. 353.It Dv REG_NOTEOL 354The NUL terminating 355the string 356does not end a line, so the 357.Ql \&$ 358anchor does not match before it. 359This does not affect the behavior of newlines under 360.Dv REG_NEWLINE . 361. 362.It Dv REG_STARTEND 363The string is considered to start at 364.Sm off 365.Fa string Li " + " Fa pmatch Li [0]. Fa rm_so 366.Sm on 367and to end before the byte located at 368.Sm off 369.Fa string Li " + " Fa pmatch Li [0]. Fa rm_eo , 370.Sm on 371regardless of the value of 372.Fa nmatch . 373See below for the definition of 374.Fa pmatch 375and 376.Fa nmatch . 377This is an extension, 378compatible with but not specified by 379.St -p1003.2 , 380and should be used with 381caution in software intended to be portable to other systems. 382.Pp 383Without 384.Dv REG_NOTBOL , 385the position 386.Fa rm_so 387is considered the beginning of a line, such that 388.Ql \&^ 389matches before it, and the beginning of a word if there is a word 390character at this position, such that 391.Ql [[:<:]] 392and 393.Ql \e< 394match before it. 395.Pp 396With 397.Dv REG_NOTBOL , 398the character at position 399.Fa rm_so 400is treated as the continuation of a line, and if 401.Fa rm_so 402is greater than 0, the preceding character is taken into consideration. 403If the preceding character is a newline and the regular expression was compiled 404with 405.Dv REG_NEWLINE , 406.Ql \&^ 407matches before the string; if the preceding character is not a word character 408but the string starts with a word character, 409.Ql [[:<:]] 410and 411.Ql \e< 412match before the string. 413.El 414.Pp 415See 416.Xr re_format 7 417for a discussion of what is matched in situations where an RE or a 418portion thereof could match any of several substrings of 419.Fa string . 420.Pp 421Normally, 422.Fn regexec 423returns 0 for success and the non-zero code 424.Dv REG_NOMATCH 425for failure. 426Other non-zero error codes may be returned in exceptional situations; 427see 428.Sx RETURN VALUES . 429.Pp 430If 431.Dv REG_NOSUB 432was specified in the compilation of the RE, 433or if 434.Fa nmatch 435is 0, 436.Fn regexec 437ignores the 438.Fa pmatch 439argument 440.Po 441but see below for the case where 442.Dv REG_STARTEND 443is specified 444.Pc . 445Otherwise, 446.Fa pmatch 447points to an array of 448.Fa nmatch 449structures of type 450.Ft regmatch_t . 451Such a structure has at least the members 452.Va rm_so 453and 454.Va rm_eo , 455both of type 456.Ft regoff_t 457.Po 458a signed arithmetic type at least as large as an 459.Ft off_t 460and a 461.Ft ssize_t 462.Pc , 463containing respectively the offset of the first character of a substring 464and the offset of the first character after the end of the substring. 465Offsets are measured from the beginning of the 466.Fa string 467argument given to 468.Fn regexec . 469An empty substring is denoted by equal offsets, 470both indicating the character following the empty substring. 471.Pp 472The 473.No 0 Ap th 474member of the 475.Fa pmatch 476array is filled in to indicate what substring of 477.Fa string 478was matched by the entire RE. 479Remaining members report what substring was matched by parenthesized 480subexpressions within the RE; 481member 482.Va i 483reports subexpression 484.Va i , 485with subexpressions counted 486.Pq starting at 1 487by the order of their opening parentheses in the RE, left to right. 488Unused entries in the array 489.Po 490corresponding either to subexpressions that 491did not participate in the match at all, or to subexpressions that do not 492exist in the RE, 493that is, 494.Va i 495> 496.Fa preg Ns Li -> Ns Fa re_nsub 497.Pc 498have both 499.Fa rm_so 500and 501.Fa rm_eo 502set to \-1. 503If a subexpression participated in the match several times, 504the reported substring is the last one it matched. 505.Po 506Note, as an example in particular, that when the RE 507.Ql "(b*)+" 508matches 509.Ql bbb , 510the parenthesized subexpression matches each of the three 511.So Li b Sc Ns s 512and then 513an infinite number of empty strings following the last 514.Ql b , 515so the reported substring is one of the empties. 516.Pc 517.Pp 518If 519.Dv REG_STARTEND 520is specified, 521.Fa pmatch 522must point to at least one 523.Ft regmatch_t 524.Po 525even if 526.Fa nmatch 527is 0 or 528.Dv REG_NOSUB 529was specified 530.Pc , 531to hold the input offsets for 532.Dv REG_STARTEND . 533Use for output is still entirely controlled by 534.Fa nmatch ; 535if 536.Fa nmatch 537is 0 or 538.Dv REG_NOSUB 539was specified, 540the value of 541.Fa pmatch Ns Li [0] 542will not be changed by a successful 543.Fn regexec . 544.Pp 545The 546.Fn regerror 547function 548maps a non-zero 549.Fa errcode 550from either 551.Fn regcomp 552or 553.Fn regexec 554to a human-readable, printable message. 555If 556.Fa preg 557is 558.Pf non- Dv NULL , 559the error code should have arisen from use of 560the 561.Ft regex_t 562pointed to by 563.Fa preg , 564and if the error code came from 565.Fn regcomp , 566it should have been the result from the most recent 567.Fn regcomp 568using that 569.Ft regex_t 570.Po 571the 572.Fn regerror 573may be able to supply a more detailed message using information 574from the 575.Ft regex_t 576.Pc . 577The 578.Fn regerror 579function 580places the NUL-terminated message into the buffer pointed to by 581.Fa errbuf , 582limiting the length 583.Pq including the Tn NUL 584to at most 585.Fa errbuf_size 586bytes. 587If the whole message will not fit, 588as much of it as will fit before the terminating NUL is supplied. 589In any case, 590the returned value is the size of buffer needed to hold the whole 591message 592.Pq including terminating Tn NUL . 593If 594.Fa errbuf_size 595is 0, 596.Fa errbuf 597is ignored but the return value is still correct. 598.Pp 599If the 600.Fa errcode 601given to 602.Fn regerror 603is first 604.Em or Ap ed 605with 606.Dv REG_ITOA , 607the 608.Dq message 609that results is the printable name of the error code, 610e.g.\& 611.Dq Dv REG_NOMATCH , 612rather than an explanation thereof. 613If 614.Fa errcode 615is 616.Dv REG_ATOI , 617then 618.Fa preg 619shall be 620.Pf non- Dv NULL 621and the 622.Fa re_endp 623member of the structure it points to 624must point to the printable name of an error code; 625in this case, the result in 626.Fa errbuf 627is the decimal digits of 628the numeric value of the error code 629.Pq 0 if the name is not recognized . 630.Dv REG_ITOA 631and 632.Dv REG_ATOI 633are intended primarily as debugging facilities; 634they are extensions, 635compatible with but not specified by 636.St -p1003.2 , 637and should be used with 638caution in software intended to be portable to other systems. 639Be warned also that they are considered experimental and changes are possible. 640.Pp 641The 642.Fn regfree 643function 644frees any dynamically-allocated storage associated with the compiled RE 645pointed to by 646.Fa preg . 647The remaining 648.Ft regex_t 649is no longer a valid compiled RE 650and the effect of supplying it to 651.Fn regexec 652or 653.Fn regerror 654is undefined. 655.Pp 656None of these functions references global variables except for tables 657of constants; 658all are safe for use from multiple threads if the arguments are safe. 659.Pp 660The 661.Fn regnsub 662and 663.Fn regasub 664functions perform substitutions using 665.Xr sed 1 666like syntax. 667They return the length of the string that would have been created 668if there was enough space or \-1 on error, setting 669.Va errno . 670The result 671is being placed in 672.Fa buf 673which is user-supplied in 674.Fn regnsub 675and dynamically allocated in 676.Fn regasub . 677The 678.Fa sub 679argument contains a substitution string which might refer to the first 6809 regular expression strings using 681.So Ic \e Ns Ar N Sc 682to refer to the nth matched 683item, or 684.Ql & 685.Po 686which is equivalent to 687.Ic \e0 688.Pc 689to refer to the full match. 690The 691.Fa rm 692array must be at least 10 elements long, and should contain the result 693of the matches from a previous 694.Fn regexec 695call. 696Only 10 elements of the 697.Fa rm 698array can be used. 699The 700.Fa str 701argument contains the source string to apply the transformation to. 702.Sh IMPLEMENTATION CHOICES 703There are a number of decisions that 704.St -p1003.2 705leaves up to the implementor, 706either by explicitly saying 707.Dq undefined 708or by virtue of them being 709forbidden by the RE grammar. 710This implementation treats them as follows. 711.Pp 712See 713.Xr re_format 7 714for a discussion of the definition of case-independent matching. 715.Pp 716There is no particular limit on the length of REs, 717except insofar as memory is limited. 718Memory usage is approximately linear in RE size, and largely insensitive 719to RE complexity, except for bounded repetitions. 720See 721.Sx BUGS 722for one short RE using them 723that will run almost any system out of memory. 724.Pp 725A backslashed character other than one specifically given a magic meaning 726by 727.St -p1003.2 728.Po 729such magic meanings occur only in obsolete 730.Pq Dq basic 731REs 732.Pc 733is taken as an ordinary character. 734.Pp 735Any unmatched 736.Ql \&[ 737is a 738.Dv REG_EBRACK 739error. 740.Pp 741Equivalence classes cannot begin or end bracket-expression ranges. 742The endpoint of one range cannot begin another. 743.Pp 744.Dv RE_DUP_MAX , 745the limit on repetition counts in bounded repetitions, is 255. 746.Pp 747A repetition operator 748.Po 749.Ql \&? , 750.Ql \&* , 751.Ql \&+ , 752or bounds 753.Pc 754cannot follow another 755repetition operator. 756A repetition operator cannot begin an expression or subexpression 757or follow 758.Ql \&^ 759or 760.Ql \&| . 761.Pp 762.Ql \&| 763cannot appear first or last in a (sub)expression or after another 764.Ql \&| , 765i.e., an operand of 766.Ql \&| 767cannot be an empty subexpression. 768An empty parenthesized subexpression, 769.Ql "()" , 770is legal and matches an 771empty (sub)string. 772An empty string is not a legal RE. 773.Pp 774A 775.Ql \&{ 776followed by a digit is considered the beginning of bounds for a 777bounded repetition, which must then follow the syntax for bounds. 778A 779.Ql \&{ 780.Em not 781followed by a digit is considered an ordinary character. 782.Pp 783.Ql \&^ 784and 785.Ql \&$ 786beginning and ending subexpressions in obsolete 787.Pq Dq basic 788REs are anchors, not ordinary characters. 789.Sh RETURN VALUES 790Non-zero error codes from 791.Fn regcomp 792and 793.Fn regexec 794include the following: 795.Pp 796.Bl -tag -width ".Dv REG_ECOLLATE" -compact 797.It Dv REG_NOMATCH 798The 799.Fn regexec 800function 801failed to match 802.It Dv REG_BADPAT 803invalid regular expression 804.It Dv REG_ECOLLATE 805invalid collating element 806.It Dv REG_ECTYPE 807invalid character class 808.It Dv REG_EESCAPE 809.Ql \e 810applied to unescapable character 811.It Dv REG_ESUBREG 812invalid backreference number 813.It Dv REG_EBRACK 814brackets 815.Ql "[ ]" 816not balanced 817.It Dv REG_EPAREN 818parentheses 819.Ql "( )" 820not balanced 821.It Dv REG_EBRACE 822braces 823.Ql "{ }" 824not balanced 825.It Dv REG_BADBR 826invalid repetition count(s) in 827.Ql "{ }" 828.It Dv REG_ERANGE 829invalid character range in 830.Ql "[ ]" 831.It Dv REG_ESPACE 832ran out of memory 833.It Dv REG_BADRPT 834.Ql \&? , 835.Ql \&* , 836or 837.Ql \&+ 838operand invalid 839.It Dv REG_EMPTY 840empty (sub)expression 841.It Dv REG_ASSERT 842cannot happen - you found a bug 843.It Dv REG_INVARG 844invalid argument, e.g.\& negative-length string 845.It Dv REG_ILLSEQ 846illegal byte sequence (bad multibyte character) 847.El 848.Sh SEE ALSO 849.Xr grep 1 , 850.Xr re_format 7 851.Pp 852.St -p1003.2 , 853sections 2.8 854.Pq Regular Expression Notation 855and 856B.5 857.Pq Tn C No Binding for Regular Expression Matching . 858.Sh HISTORY 859Originally written by 860.An Henry Spencer . 861Altered for inclusion in the 862.Bx 4.4 863distribution. 864.Pp 865The 866.Fn regnsub 867and 868.Fn regasub 869functions appeared in 870.Nx 8 . 871.Sh BUGS 872This is an alpha release with known defects. 873Please report problems. 874.Pp 875The back-reference code is subtle and doubts linger about its correctness 876in complex cases. 877.Pp 878The 879.Fn regexec 880function 881performance is poor. 882This will improve with later releases. 883The 884.Fa nmatch 885argument 886exceeding 0 is expensive; 887.Fa nmatch 888exceeding 1 is worse. 889The 890.Fn regexec 891function 892is largely insensitive to RE complexity 893.Em except 894that back 895references are massively expensive. 896RE length does matter; in particular, there is a strong speed bonus 897for keeping RE length under about 30 characters, 898with most special characters counting roughly double. 899.Pp 900The 901.Fn regcomp 902function 903implements bounded repetitions by macro expansion, 904which is costly in time and space if counts are large 905or bounded repetitions are nested. 906An RE like, say, 907.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}" 908will (eventually) run almost any existing machine out of swap space. 909.Pp 910There are suspected problems with response to obscure error conditions. 911Notably, 912certain kinds of internal overflow, 913produced only by truly enormous REs or by multiply nested bounded repetitions, 914are probably not handled well. 915.Pp 916Due to a mistake in 917.St -p1003.2 , 918things like 919.Ql "a)b" 920are legal REs because 921.Ql \&) 922is 923a special character only in the presence of a previous unmatched 924.Ql \&( . 925This cannot be fixed until the spec is fixed. 926.Pp 927The standard's definition of back references is vague. 928For example, does 929.Ql "a\e(\e(b\e)*\e2\e)*d" 930match 931.Ql "abbbd" ? 932Until the standard is clarified, 933behavior in such cases should not be relied on. 934.Pp 935The implementation of word-boundary matching is a bit of a kludge, 936and bugs may lurk in combinations of word-boundary matching and anchoring. 937.Pp 938Word-boundary matching does not work properly in multibyte locales. 939