1.\" Copyright (c) 1991, 1993 2.\" The Regents of the University of California. All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 3. All advertising materials mentioning features or use of this software 13.\" must display the following acknowledgement: 14.\" This product includes software developed by the University of 15.\" California, Berkeley and its contributors. 16.\" 4. Neither the name of the University nor the names of its contributors 17.\" may be used to endorse or promote products derived from this software 18.\" without specific prior written permission. 19.\" 20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" from: @(#)regexp.3 8.1 (Berkeley) 6/4/93 33.\" $NetBSD: regexp.3,v 1.6 1997/10/09 10:21:14 lukem Exp $ 34.\" 35.Dd June 4, 1993 36.Dt REGEXP 3 37.Os 38.Sh NAME 39.Nm regcomp , 40.Nm regexec , 41.Nm regsub , 42.Nm regerror 43.Nd regular expression handlers 44.Sh SYNOPSIS 45.Fd #include <regexp.h> 46.Ft regexp * 47.Fn regcomp "const char *exp" 48.Ft int 49.Fn regexec "const regexp *prog" "const char *string" 50.Ft void 51.Fn regsub "const regexp *prog" "const char *source" "char *dest" 52.Sh DESCRIPTION 53.Bf -symbolic 54This interface is made obsolete by 55.Xr regex 3 . 56It is available from the compatibility library, libcompat. 57.Ef 58.Pp 59The 60.Fn regcomp , 61.Fn regexec , 62.Fn regsub , 63and 64.Fn regerror 65functions 66implement 67.Xr egrep 1 Ns -style 68regular expressions and supporting facilities. 69.Pp 70The 71.Fn regcomp 72function 73compiles a regular expression into a structure of type 74.Xr regexp , 75and returns a pointer to it. 76The space has been allocated using 77.Xr malloc 3 78and may be released by 79.Xr free . 80.Pp 81The 82.Fn regexec 83function 84matches a 85.Dv NUL Ns -terminated 86.Fa string 87against the compiled regular expression 88in 89.Fa prog . 90It returns 1 for success and 0 for failure, and adjusts the contents of 91.Fa prog Ns 's 92.Em startp 93and 94.Em endp 95(see below) accordingly. 96.Pp 97The members of a 98.Xr regexp 99structure include at least the following (not necessarily in order): 100.Bd -literal -offset indent 101char *startp[NSUBEXP]; 102char *endp[NSUBEXP]; 103.Ed 104.Pp 105where 106.Dv NSUBEXP 107is defined (as 10) in the header file. 108Once a successful 109.Fn regexec 110has been done using the 111.Fn regexp , 112each 113.Em startp Ns - Em endp 114pair describes one substring 115within the 116.Fa string , 117with the 118.Em startp 119pointing to the first character of the substring and 120the 121.Em endp 122pointing to the first character following the substring. 123The 0th substring is the substring of 124.Fa string 125that matched the whole 126regular expression. 127The others are those substrings that matched parenthesized expressions 128within the regular expression, with parenthesized expressions numbered 129in left-to-right order of their opening parentheses. 130.Pp 131The 132.Fn regsub 133function 134copies 135.Fa source 136to 137.Fa dest , 138making substitutions according to the 139most recent 140.Fn regexec 141performed using 142.Fa prog . 143Each instance of `&' in 144.Fa source 145is replaced by the substring 146indicated by 147.Em startp Ns Bq 148and 149.Em endp Ns Bq . 150Each instance of 151.Sq \e Ns Em n , 152where 153.Em n 154is a digit, is replaced by 155the substring indicated by 156.Em startp Ns Bq Em n 157and 158.Em endp Ns Bq Em n . 159To get a literal `&' or 160.Sq \e Ns Em n 161into 162.Fa dest , 163prefix it with `\e'; 164to get a literal `\e' preceding `&' or 165.Sq \e Ns Em n , 166prefix it with 167another `\e'. 168.Pp 169The 170.Fn regerror 171function 172is called whenever an error is detected in 173.Fn regcomp , 174.Fn regexec , 175or 176.Fn regsub . 177The default 178.Fn regerror 179writes the string 180.Fa msg , 181with a suitable indicator of origin, 182on the standard 183error output 184and invokes 185.Xr exit 3 . 186The 187.Fn regerror 188function 189can be replaced by the user if other actions are desirable. 190.Sh REGULAR EXPRESSION SYNTAX 191A regular expression is zero or more 192.Em branches , 193separated by `|'. 194It matches anything that matches one of the branches. 195.Pp 196A branch is zero or more 197.Em pieces , 198concatenated. 199It matches a match for the first, followed by a match for the second, etc. 200.Pp 201A piece is an 202.Em atom 203possibly followed by `*', `+', or `?'. 204An atom followed by `*' matches a sequence of 0 or more matches of the atom. 205An atom followed by `+' matches a sequence of 1 or more matches of the atom. 206An atom followed by `?' matches a match of the atom, or the null string. 207.Pp 208An atom is a regular expression in parentheses (matching a match for the 209regular expression), a 210.Em range 211(see below), `.' 212(matching any single character), `^' (matching the null string at the 213beginning of the input string), `$' (matching the null string at the 214end of the input string), a `\e' followed by a single character (matching 215that character), or a single character with no other significance 216(matching that character). 217.Pp 218A 219.Em range 220is a sequence of characters enclosed in `[]'. 221It normally matches any single character from the sequence. 222If the sequence begins with `^', 223it matches any single character 224.Em not 225from the rest of the sequence. 226If two characters in the sequence are separated by `\-', this is shorthand 227for the full list of 228.Tn ASCII 229characters between them 230(e.g. `[0-9]' matches any decimal digit). 231To include a literal `]' in the sequence, make it the first character 232(following a possible `^'). 233To include a literal `\-', make it the first or last character. 234.Sh AMBIGUITY 235If a regular expression could match two different parts of the input string, 236it will match the one which begins earliest. 237If both begin in the same place but match different lengths, or match 238the same length in different ways, life gets messier, as follows. 239.Pp 240In general, the possibilities in a list of branches are considered in 241left-to-right order, the possibilities for `*', `+', and `?' are 242considered longest-first, nested constructs are considered from the 243outermost in, and concatenated constructs are considered leftmost-first. 244The match that will be chosen is the one that uses the earliest 245possibility in the first choice that has to be made. 246If there is more than one choice, the next will be made in the same manner 247(earliest possibility) subject to the decision on the first choice. 248And so forth. 249.Pp 250For example, 251.Sq Li (ab|a)b*c 252could match 253`abc' in one of two ways. 254The first choice is between `ab' and `a'; since `ab' is earlier, and does 255lead to a successful overall match, it is chosen. 256Since the `b' is already spoken for, 257the `b*' must match its last possibility\(emthe empty string\(emsince 258it must respect the earlier choice. 259.Pp 260In the particular case where no `|'s are present and there is only one 261`*', `+', or `?', the net effect is that the longest possible 262match will be chosen. 263So 264.Sq Li ab* , 265presented with `xabbbby', will match `abbbb'. 266Note that if 267.Sq Li ab* , 268is tried against `xabyabbbz', it 269will match `ab' just after `x', due to the begins-earliest rule. 270(In effect, the decision on where to start the match is the first choice 271to be made, hence subsequent choices must respect it even if this leads them 272to less-preferred alternatives.) 273.Sh RETURN VALUES 274The 275.Fn regcomp 276function 277returns 278.Dv NULL 279for a failure 280.Pf ( Fn regerror 281permitting), 282where failures are syntax errors, exceeding implementation limits, 283or applying `+' or `*' to a possibly-null operand. 284.Sh SEE ALSO 285.Xr ed 1 , 286.Xr ex 1 , 287.Xr expr 1 , 288.Xr egrep 1 , 289.Xr fgrep 1 , 290.Xr grep 1 , 291.Xr regex 3 292.Sh HISTORY 293Both code and manual page for 294.Fn regcomp , 295.Fn regexec , 296.Fn regsub , 297and 298.Fn regerror 299were written at the University of Toronto 300and appeared in 301.Bx 4.3 tahoe . 302They are intended to be compatible with the Bell V8 303.Xr regexp 3 , 304but are not derived from Bell code. 305.Sh BUGS 306Empty branches and empty regular expressions are not portable to V8. 307.Pp 308The restriction against 309applying `*' or `+' to a possibly-null operand is an artifact of the 310simplistic implementation. 311.Pp 312Does not support 313.Xr egrep Ns 's 314newline-separated branches; 315neither does the V8 316.Xr regexp 3 , 317though. 318.Pp 319Due to emphasis on 320compactness and simplicity, 321it's not strikingly fast. 322It does give special attention to handling simple cases quickly. 323