1.\" Copyright 1991 The Regents of the University of California. 2.\" All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 3. All advertising materials mentioning features or use of this software 13.\" must display the following acknowledgement: 14.\" This product includes software developed by the University of 15.\" California, Berkeley and its contributors. 16.\" 4. Neither the name of the University nor the names of its contributors 17.\" may be used to endorse or promote products derived from this software 18.\" without specific prior written permission. 19.\" 20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" from: @(#)regexp.3 5.2 (Berkeley) 4/20/91 33.\" $Id: regexp.3,v 1.2 1993/08/01 07:57:39 mycroft Exp $ 34.\" 35.Dd April 20, 1991 36.Dt REGEXP 3 37.Os 38.Sh NAME 39.Nm regcomp , 40.Nm regexec , 41.Nm regsub , 42.Nm regerror 43.Nd regular expression handlers 44.Sh SYNOPSIS 45.Fd #include <regexp.h> 46.Ft regexp * 47.Fn regcomp "const char *exp" 48.Ft int 49.Fn regexec "const regexp *prog" "const char *string" 50.Ft void 51.Fn regsub "const regexp *prog" "const char *source" "char *dest" 52.Sh DESCRIPTION 53The 54.Fn regcomp , 55.Fn regexec , 56.Fn regsub , 57and 58.Fn regerror 59functions 60implement 61.Xr egrep 1 Ns -style 62regular expressions and supporting facilities. 63.Pp 64The 65.Fn regcomp 66function 67compiles a regular expression into a structure of type 68.Xr regexp , 69and returns a pointer to it. 70The space has been allocated using 71.Xr malloc 3 72and may be released by 73.Xr free . 74.Pp 75The 76.Fn regexec 77function 78matches a 79.Dv NUL Ns -terminated 80.Fa string 81against the compiled regular expression 82in 83.Fa prog . 84It returns 1 for success and 0 for failure, and adjusts the contents of 85.Fa prog Ns 's 86.Em startp 87and 88.Em endp 89(see below) accordingly. 90.Pp 91The members of a 92.Xr regexp 93structure include at least the following (not necessarily in order): 94.Bd -literal -offset indent 95char *startp[NSUBEXP]; 96char *endp[NSUBEXP]; 97.Ed 98.Pp 99where 100.Dv NSUBEXP 101is defined (as 10) in the header file. 102Once a successful 103.Fn regexec 104has been done using the 105.Fn regexp , 106each 107.Em startp Ns - Em endp 108pair describes one substring 109within the 110.Fa string , 111with the 112.Em startp 113pointing to the first character of the substring and 114the 115.Em endp 116pointing to the first character following the substring. 117The 0th substring is the substring of 118.Fa string 119that matched the whole 120regular expression. 121The others are those substrings that matched parenthesized expressions 122within the regular expression, with parenthesized expressions numbered 123in left-to-right order of their opening parentheses. 124.Pp 125The 126.Fn regsub 127function 128copies 129.Fa source 130to 131.Fa dest , 132making substitutions according to the 133most recent 134.Fn regexec 135performed using 136.Fa prog . 137Each instance of `&' in 138.Fa source 139is replaced by the substring 140indicated by 141.Em startp Ns Bq 142and 143.Em endp Ns Bq . 144Each instance of 145.Sq \e Ns Em n , 146where 147.Em n 148is a digit, is replaced by 149the substring indicated by 150.Em startp Ns Bq Em n 151and 152.Em endp Ns Bq Em n . 153To get a literal `&' or 154.Sq \e Ns Em n 155into 156.Fa dest , 157prefix it with `\e'; 158to get a literal `\e' preceding `&' or 159.Sq \e Ns Em n , 160prefix it with 161another `\e'. 162.Pp 163The 164.Fn regerror 165function 166is called whenever an error is detected in 167.Fn regcomp , 168.Fn regexec , 169or 170.Fn regsub . 171The default 172.Fn regerror 173writes the string 174.Fa msg , 175with a suitable indicator of origin, 176on the standard 177error output 178and invokes 179.Xr exit 2 . 180The 181.Fn regerror 182function 183can be replaced by the user if other actions are desirable. 184.Sh REGULAR EXPRESSION SYNTAX 185A regular expression is zero or more 186.Em branches , 187separated by `|'. 188It matches anything that matches one of the branches. 189.Pp 190A branch is zero or more 191.Em pieces , 192concatenated. 193It matches a match for the first, followed by a match for the second, etc. 194.Pp 195A piece is an 196.Em atom 197possibly followed by `*', `+', or `?'. 198An atom followed by `*' matches a sequence of 0 or more matches of the atom. 199An atom followed by `+' matches a sequence of 1 or more matches of the atom. 200An atom followed by `?' matches a match of the atom, or the null string. 201.Pp 202An atom is a regular expression in parentheses (matching a match for the 203regular expression), a 204.Em range 205(see below), `.' 206(matching any single character), `^' (matching the null string at the 207beginning of the input string), `$' (matching the null string at the 208end of the input string), a `\e' followed by a single character (matching 209that character), or a single character with no other significance 210(matching that character). 211.Pp 212A 213.Em range 214is a sequence of characters enclosed in `[]'. 215It normally matches any single character from the sequence. 216If the sequence begins with `^', 217it matches any single character 218.Em not 219from the rest of the sequence. 220If two characters in the sequence are separated by `\-', this is shorthand 221for the full list of 222.Tn ASCII 223characters between them 224(e.g. `[0-9]' matches any decimal digit). 225To include a literal `]' in the sequence, make it the first character 226(following a possible `^'). 227To include a literal `\-', make it the first or last character. 228.Sh AMBIGUITY 229If a regular expression could match two different parts of the input string, 230it will match the one which begins earliest. 231If both begin in the same place but match different lengths, or match 232the same length in different ways, life gets messier, as follows. 233.Pp 234In general, the possibilities in a list of branches are considered in 235left-to-right order, the possibilities for `*', `+', and `?' are 236considered longest-first, nested constructs are considered from the 237outermost in, and concatenated constructs are considered leftmost-first. 238The match that will be chosen is the one that uses the earliest 239possibility in the first choice that has to be made. 240If there is more than one choice, the next will be made in the same manner 241(earliest possibility) subject to the decision on the first choice. 242And so forth. 243.Pp 244For example, 245.Sq Li (ab|a)b*c 246could match 247`abc' in one of two ways. 248The first choice is between `ab' and `a'; since `ab' is earlier, and does 249lead to a successful overall match, it is chosen. 250Since the `b' is already spoken for, 251the `b*' must match its last possibility\(emthe empty string\(emsince 252it must respect the earlier choice. 253.Pp 254In the particular case where no `|'s are present and there is only one 255`*', `+', or `?', the net effect is that the longest possible 256match will be chosen. 257So 258.Sq Li ab* , 259presented with `xabbbby', will match `abbbb'. 260Note that if 261.Sq Li ab* , 262is tried against `xabyabbbz', it 263will match `ab' just after `x', due to the begins-earliest rule. 264(In effect, the decision on where to start the match is the first choice 265to be made, hence subsequent choices must respect it even if this leads them 266to less-preferred alternatives.) 267.Sh RETURN VALUES 268The 269.Fn regcomp 270function 271returns 272.Dv NULL 273for a failure 274.Pf ( Fn regerror 275permitting), 276where failures are syntax errors, exceeding implementation limits, 277or applying `+' or `*' to a possibly-null operand. 278.Sh SEE ALSO 279.Xr ed 1 , 280.Xr ex 1 , 281.Xr expr 1 , 282.Xr egrep 1 , 283.Xr fgrep 1 , 284.Xr grep 1 , 285.Xr regex 3 286.Sh HISTORY 287Both code and manual page for 288.Fn regcomp , 289.Fn regexec , 290.Fn regsub , 291and 292.Fn regerror 293were written at the University of Toronto 294and appeared in 295.Bx 4.3 tahoe . 296They are intended to be compatible with the Bell V8 297.Xr regexp 3 , 298but are not derived from Bell code. 299.Sh BUGS 300Empty branches and empty regular expressions are not portable to V8. 301.Pp 302The restriction against 303applying `*' or `+' to a possibly-null operand is an artifact of the 304simplistic implementation. 305.Pp 306Does not support 307.Xr egrep Ns 's 308newline-separated branches; 309neither does the V8 310.Xr regexp 3 , 311though. 312.Pp 313Due to emphasis on 314compactness and simplicity, 315it's not strikingly fast. 316It does give special attention to handling simple cases quickly. 317