1.\" Copyright (c) 1991, 1993 2.\" The Regents of the University of California. All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 3. All advertising materials mentioning features or use of this software 13.\" must display the following acknowledgement: 14.\" This product includes software developed by the University of 15.\" California, Berkeley and its contributors. 16.\" 4. Neither the name of the University nor the names of its contributors 17.\" may be used to endorse or promote products derived from this software 18.\" without specific prior written permission. 19.\" 20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" from: @(#)regexp.3 8.1 (Berkeley) 6/4/93 33.\" $NetBSD: regexp.3,v 1.9 1998/04/29 19:25:25 fair Exp $ 34.\" 35.Dd June 4, 1993 36.Dt REGEXP 3 37.Os 38.Sh NAME 39.Nm regcomp , 40.Nm regexec , 41.Nm regsub , 42.Nm regerror 43.Nd regular expression handlers 44.Sh LIBRARY 45.Lb libcompat 46.Sh SYNOPSIS 47.Fd #include <regexp.h> 48.Ft regexp * 49.Fn regcomp "const char *exp" 50.Ft int 51.Fn regexec "const regexp *prog" "const char *string" 52.Ft void 53.Fn regsub "const regexp *prog" "const char *source" "char *dest" 54.Sh DESCRIPTION 55.Bf -symbolic 56This interface is made obsolete by 57.Xr regex 3 . 58It is available from the compatibility library, libcompat. 59.Ef 60.Pp 61The 62.Fn regcomp , 63.Fn regexec , 64.Fn regsub , 65and 66.Fn regerror 67functions implement 68.Xr egrep 1 Ns -style 69regular expressions and supporting facilities. 70.Pp 71The 72.Fn regcomp 73function 74compiles a regular expression into a structure of type 75.Em regexp , 76and returns a pointer to it. 77The space has been allocated using 78.Xr malloc 3 79and may be released by 80.Xr free 3 . 81.Pp 82The 83.Fn regexec 84function 85matches a 86.Dv NUL Ns -terminated 87.Fa string 88against the compiled regular expression 89in 90.Fa prog . 91It returns 1 for success and 0 for failure, and adjusts the contents of 92.Fa prog Ns 's 93.Em startp 94and 95.Em endp 96(see below) accordingly. 97.Pp 98The members of a 99.Em regexp 100structure include at least the following (not necessarily in order): 101.Bd -literal -offset indent 102char *startp[NSUBEXP]; 103char *endp[NSUBEXP]; 104.Ed 105.Pp 106where 107.Dv NSUBEXP 108is defined (as 10) in the header file. 109Once a successful 110.Fn regexec 111has been done using the 112.Fn regexp , 113each 114.Em startp Ns - Em endp 115pair describes one substring 116within the 117.Fa string , 118with the 119.Em startp 120pointing to the first character of the substring and 121the 122.Em endp 123pointing to the first character following the substring. 124The 0th substring is the substring of 125.Fa string 126that matched the whole 127regular expression. 128The others are those substrings that matched parenthesized expressions 129within the regular expression, with parenthesized expressions numbered 130in left-to-right order of their opening parentheses. 131.Pp 132The 133.Fn regsub 134function 135copies 136.Fa source 137to 138.Fa dest , 139making substitutions according to the 140most recent 141.Fn regexec 142performed using 143.Fa prog . 144Each instance of `&' in 145.Fa source 146is replaced by the substring 147indicated by 148.Em startp Ns Bq 149and 150.Em endp Ns Bq . 151Each instance of 152.Sq \e Ns Em n , 153where 154.Em n 155is a digit, is replaced by 156the substring indicated by 157.Em startp Ns Bq Em n 158and 159.Em endp Ns Bq Em n . 160To get a literal `&' or 161.Sq \e Ns Em n 162into 163.Fa dest , 164prefix it with `\e'; 165to get a literal `\e' preceding `&' or 166.Sq \e Ns Em n , 167prefix it with 168another `\e'. 169.Pp 170The 171.Fn regerror 172function 173is called whenever an error is detected in 174.Fn regcomp , 175.Fn regexec , 176or 177.Fn regsub . 178The default 179.Fn regerror 180writes the string 181.Fa msg , 182with a suitable indicator of origin, 183on the standard 184error output 185and invokes 186.Xr exit 3 . 187The 188.Fn regerror 189function 190can be replaced by the user if other actions are desirable. 191.Sh REGULAR EXPRESSION SYNTAX 192A regular expression is zero or more 193.Em branches , 194separated by `|'. 195It matches anything that matches one of the branches. 196.Pp 197A branch is zero or more 198.Em pieces , 199concatenated. 200It matches a match for the first, followed by a match for the second, etc. 201.Pp 202A piece is an 203.Em atom 204possibly followed by `*', `+', or `?'. 205An atom followed by `*' matches a sequence of 0 or more matches of the atom. 206An atom followed by `+' matches a sequence of 1 or more matches of the atom. 207An atom followed by `?' matches a match of the atom, or the null string. 208.Pp 209An atom is a regular expression in parentheses (matching a match for the 210regular expression), a 211.Em range 212(see below), `.' 213(matching any single character), `^' (matching the null string at the 214beginning of the input string), `$' (matching the null string at the 215end of the input string), a `\e' followed by a single character (matching 216that character), or a single character with no other significance 217(matching that character). 218.Pp 219A 220.Em range 221is a sequence of characters enclosed in `[]'. 222It normally matches any single character from the sequence. 223If the sequence begins with `^', 224it matches any single character 225.Em not 226from the rest of the sequence. 227If two characters in the sequence are separated by `\-', this is shorthand 228for the full list of 229.Tn ASCII 230characters between them 231(e.g. `[0-9]' matches any decimal digit). 232To include a literal `]' in the sequence, make it the first character 233(following a possible `^'). 234To include a literal `\-', make it the first or last character. 235.Sh AMBIGUITY 236If a regular expression could match two different parts of the input string, 237it will match the one which begins earliest. 238If both begin in the same place but match different lengths, or match 239the same length in different ways, life gets messier, as follows. 240.Pp 241In general, the possibilities in a list of branches are considered in 242left-to-right order, the possibilities for `*', `+', and `?' are 243considered longest-first, nested constructs are considered from the 244outermost in, and concatenated constructs are considered leftmost-first. 245The match that will be chosen is the one that uses the earliest 246possibility in the first choice that has to be made. 247If there is more than one choice, the next will be made in the same manner 248(earliest possibility) subject to the decision on the first choice. 249And so forth. 250.Pp 251For example, 252.Sq Li (ab|a)b*c 253could match 254`abc' in one of two ways. 255The first choice is between `ab' and `a'; since `ab' is earlier, and does 256lead to a successful overall match, it is chosen. 257Since the `b' is already spoken for, 258the `b*' must match its last possibility\(emthe empty string\(emsince 259it must respect the earlier choice. 260.Pp 261In the particular case where no `|'s are present and there is only one 262`*', `+', or `?', the net effect is that the longest possible 263match will be chosen. 264So 265.Sq Li ab* , 266presented with `xabbbby', will match `abbbb'. 267Note that if 268.Sq Li ab* , 269is tried against `xabyabbbz', it 270will match `ab' just after `x', due to the begins-earliest rule. 271(In effect, the decision on where to start the match is the first choice 272to be made, hence subsequent choices must respect it even if this leads them 273to less-preferred alternatives.) 274.Sh RETURN VALUES 275The 276.Fn regcomp 277function 278returns 279.Dv NULL 280for a failure 281.Pf ( Fn regerror 282permitting), 283where failures are syntax errors, exceeding implementation limits, 284or applying `+' or `*' to a possibly-null operand. 285.Sh SEE ALSO 286.Xr ed 1 , 287.Xr ex 1 , 288.Xr expr 1 , 289.Xr egrep 1 , 290.Xr fgrep 1 , 291.Xr grep 1 , 292.Xr regex 3 293.Sh HISTORY 294Both code and manual page for 295.Fn regcomp , 296.Fn regexec , 297.Fn regsub , 298and 299.Fn regerror 300were written at the University of Toronto 301and appeared in 302.Bx 4.3 tahoe . 303They are intended to be compatible with the Bell V8 304.Xr regexp 3 , 305but are not derived from Bell code. 306.Sh BUGS 307Empty branches and empty regular expressions are not portable to V8. 308.Pp 309The restriction against 310applying `*' or `+' to a possibly-null operand is an artifact of the 311simplistic implementation. 312.Pp 313Does not support 314.Xr egrep 1 Ns 's 315newline-separated branches; 316neither does the V8 317.Xr regexp 3 , 318though. 319.Pp 320Due to emphasis on 321compactness and simplicity, 322it's not strikingly fast. 323It does give special attention to handling simple cases quickly. 324