1.\" Copyright (c) 1991, 1993 2.\" The Regents of the University of California. All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 3. All advertising materials mentioning features or use of this software 13.\" must display the following acknowledgement: 14.\" This product includes software developed by the University of 15.\" California, Berkeley and its contributors. 16.\" 4. Neither the name of the University nor the names of its contributors 17.\" may be used to endorse or promote products derived from this software 18.\" without specific prior written permission. 19.\" 20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" from: @(#)regexp.3 8.1 (Berkeley) 6/4/93 33.\" $NetBSD: regexp.3,v 1.7 1998/02/05 18:51:31 perry Exp $ 34.\" 35.Dd June 4, 1993 36.Dt REGEXP 3 37.Os 38.Sh NAME 39.Nm regcomp , 40.Nm regexec , 41.Nm regsub , 42.Nm regerror 43.Nd regular expression handlers 44.Sh LIBRARY 45.Lb libcompat 46.Sh SYNOPSIS 47.Fd #include <regexp.h> 48.Ft regexp * 49.Fn regcomp "const char *exp" 50.Ft int 51.Fn regexec "const regexp *prog" "const char *string" 52.Ft void 53.Fn regsub "const regexp *prog" "const char *source" "char *dest" 54.Sh DESCRIPTION 55.Bf -symbolic 56This interface is made obsolete by 57.Xr regex 3 . 58It is available from the compatibility library, libcompat. 59.Ef 60.Pp 61The 62.Fn regcomp , 63.Fn regexec , 64.Fn regsub , 65and 66.Fn regerror 67functions 68implement 69.Xr egrep 1 Ns -style 70regular expressions and supporting facilities. 71.Pp 72The 73.Fn regcomp 74function 75compiles a regular expression into a structure of type 76.Xr regexp , 77and returns a pointer to it. 78The space has been allocated using 79.Xr malloc 3 80and may be released by 81.Xr free . 82.Pp 83The 84.Fn regexec 85function 86matches a 87.Dv NUL Ns -terminated 88.Fa string 89against the compiled regular expression 90in 91.Fa prog . 92It returns 1 for success and 0 for failure, and adjusts the contents of 93.Fa prog Ns 's 94.Em startp 95and 96.Em endp 97(see below) accordingly. 98.Pp 99The members of a 100.Xr regexp 101structure include at least the following (not necessarily in order): 102.Bd -literal -offset indent 103char *startp[NSUBEXP]; 104char *endp[NSUBEXP]; 105.Ed 106.Pp 107where 108.Dv NSUBEXP 109is defined (as 10) in the header file. 110Once a successful 111.Fn regexec 112has been done using the 113.Fn regexp , 114each 115.Em startp Ns - Em endp 116pair describes one substring 117within the 118.Fa string , 119with the 120.Em startp 121pointing to the first character of the substring and 122the 123.Em endp 124pointing to the first character following the substring. 125The 0th substring is the substring of 126.Fa string 127that matched the whole 128regular expression. 129The others are those substrings that matched parenthesized expressions 130within the regular expression, with parenthesized expressions numbered 131in left-to-right order of their opening parentheses. 132.Pp 133The 134.Fn regsub 135function 136copies 137.Fa source 138to 139.Fa dest , 140making substitutions according to the 141most recent 142.Fn regexec 143performed using 144.Fa prog . 145Each instance of `&' in 146.Fa source 147is replaced by the substring 148indicated by 149.Em startp Ns Bq 150and 151.Em endp Ns Bq . 152Each instance of 153.Sq \e Ns Em n , 154where 155.Em n 156is a digit, is replaced by 157the substring indicated by 158.Em startp Ns Bq Em n 159and 160.Em endp Ns Bq Em n . 161To get a literal `&' or 162.Sq \e Ns Em n 163into 164.Fa dest , 165prefix it with `\e'; 166to get a literal `\e' preceding `&' or 167.Sq \e Ns Em n , 168prefix it with 169another `\e'. 170.Pp 171The 172.Fn regerror 173function 174is called whenever an error is detected in 175.Fn regcomp , 176.Fn regexec , 177or 178.Fn regsub . 179The default 180.Fn regerror 181writes the string 182.Fa msg , 183with a suitable indicator of origin, 184on the standard 185error output 186and invokes 187.Xr exit 3 . 188The 189.Fn regerror 190function 191can be replaced by the user if other actions are desirable. 192.Sh REGULAR EXPRESSION SYNTAX 193A regular expression is zero or more 194.Em branches , 195separated by `|'. 196It matches anything that matches one of the branches. 197.Pp 198A branch is zero or more 199.Em pieces , 200concatenated. 201It matches a match for the first, followed by a match for the second, etc. 202.Pp 203A piece is an 204.Em atom 205possibly followed by `*', `+', or `?'. 206An atom followed by `*' matches a sequence of 0 or more matches of the atom. 207An atom followed by `+' matches a sequence of 1 or more matches of the atom. 208An atom followed by `?' matches a match of the atom, or the null string. 209.Pp 210An atom is a regular expression in parentheses (matching a match for the 211regular expression), a 212.Em range 213(see below), `.' 214(matching any single character), `^' (matching the null string at the 215beginning of the input string), `$' (matching the null string at the 216end of the input string), a `\e' followed by a single character (matching 217that character), or a single character with no other significance 218(matching that character). 219.Pp 220A 221.Em range 222is a sequence of characters enclosed in `[]'. 223It normally matches any single character from the sequence. 224If the sequence begins with `^', 225it matches any single character 226.Em not 227from the rest of the sequence. 228If two characters in the sequence are separated by `\-', this is shorthand 229for the full list of 230.Tn ASCII 231characters between them 232(e.g. `[0-9]' matches any decimal digit). 233To include a literal `]' in the sequence, make it the first character 234(following a possible `^'). 235To include a literal `\-', make it the first or last character. 236.Sh AMBIGUITY 237If a regular expression could match two different parts of the input string, 238it will match the one which begins earliest. 239If both begin in the same place but match different lengths, or match 240the same length in different ways, life gets messier, as follows. 241.Pp 242In general, the possibilities in a list of branches are considered in 243left-to-right order, the possibilities for `*', `+', and `?' are 244considered longest-first, nested constructs are considered from the 245outermost in, and concatenated constructs are considered leftmost-first. 246The match that will be chosen is the one that uses the earliest 247possibility in the first choice that has to be made. 248If there is more than one choice, the next will be made in the same manner 249(earliest possibility) subject to the decision on the first choice. 250And so forth. 251.Pp 252For example, 253.Sq Li (ab|a)b*c 254could match 255`abc' in one of two ways. 256The first choice is between `ab' and `a'; since `ab' is earlier, and does 257lead to a successful overall match, it is chosen. 258Since the `b' is already spoken for, 259the `b*' must match its last possibility\(emthe empty string\(emsince 260it must respect the earlier choice. 261.Pp 262In the particular case where no `|'s are present and there is only one 263`*', `+', or `?', the net effect is that the longest possible 264match will be chosen. 265So 266.Sq Li ab* , 267presented with `xabbbby', will match `abbbb'. 268Note that if 269.Sq Li ab* , 270is tried against `xabyabbbz', it 271will match `ab' just after `x', due to the begins-earliest rule. 272(In effect, the decision on where to start the match is the first choice 273to be made, hence subsequent choices must respect it even if this leads them 274to less-preferred alternatives.) 275.Sh RETURN VALUES 276The 277.Fn regcomp 278function 279returns 280.Dv NULL 281for a failure 282.Pf ( Fn regerror 283permitting), 284where failures are syntax errors, exceeding implementation limits, 285or applying `+' or `*' to a possibly-null operand. 286.Sh SEE ALSO 287.Xr ed 1 , 288.Xr ex 1 , 289.Xr expr 1 , 290.Xr egrep 1 , 291.Xr fgrep 1 , 292.Xr grep 1 , 293.Xr regex 3 294.Sh HISTORY 295Both code and manual page for 296.Fn regcomp , 297.Fn regexec , 298.Fn regsub , 299and 300.Fn regerror 301were written at the University of Toronto 302and appeared in 303.Bx 4.3 tahoe . 304They are intended to be compatible with the Bell V8 305.Xr regexp 3 , 306but are not derived from Bell code. 307.Sh BUGS 308Empty branches and empty regular expressions are not portable to V8. 309.Pp 310The restriction against 311applying `*' or `+' to a possibly-null operand is an artifact of the 312simplistic implementation. 313.Pp 314Does not support 315.Xr egrep Ns 's 316newline-separated branches; 317neither does the V8 318.Xr regexp 3 , 319though. 320.Pp 321Due to emphasis on 322compactness and simplicity, 323it's not strikingly fast. 324It does give special attention to handling simple cases quickly. 325