1.\" Copyright 1991 The Regents of the University of California. 2.\" All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 3. All advertising materials mentioning features or use of this software 13.\" must display the following acknowledgement: 14.\" This product includes software developed by the University of 15.\" California, Berkeley and its contributors. 16.\" 4. Neither the name of the University nor the names of its contributors 17.\" may be used to endorse or promote products derived from this software 18.\" without specific prior written permission. 19.\" 20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" @(#)regexp.3 5.2 (Berkeley) 4/20/91 33.\" 34.Dd April 20, 1991 35.Dt REGEXP 3 36.Os 37.Sh NAME 38.Nm regcomp , 39.Nm regexec , 40.Nm regsub , 41.Nm regerror 42.Nd regular expression handlers 43.Sh SYNOPSIS 44.Fd #include <regexp.h> 45.Ft regexp * 46.Fn regcomp "const char *exp" 47.Ft int 48.Fn regexec "const regexp *prog" "const char *string" 49.Ft void 50.Fn regsub "const regexp *prog" "const char *source" "char *dest" 51.Sh DESCRIPTION 52The 53.Fn regcomp , 54.Fn regexec , 55.Fn regsub , 56and 57.Fn regerror 58functions 59implement 60.Xr egrep 1 Ns -style 61regular expressions and supporting facilities. 62.Pp 63The 64.Fn regcomp 65function 66compiles a regular expression into a structure of type 67.Xr regexp , 68and returns a pointer to it. 69The space has been allocated using 70.Xr malloc 3 71and may be released by 72.Xr free . 73.Pp 74The 75.Fn regexec 76function 77matches a 78.Dv NUL Ns -terminated 79.Fa string 80against the compiled regular expression 81in 82.Fa prog . 83It returns 1 for success and 0 for failure, and adjusts the contents of 84.Fa prog Ns 's 85.Em startp 86and 87.Em endp 88(see below) accordingly. 89.Pp 90The members of a 91.Xr regexp 92structure include at least the following (not necessarily in order): 93.Bd -literal -offset indent 94char *startp[NSUBEXP]; 95char *endp[NSUBEXP]; 96.Ed 97.Pp 98where 99.Dv NSUBEXP 100is defined (as 10) in the header file. 101Once a successful 102.Fn regexec 103has been done using the 104.Fn regexp , 105each 106.Em startp Ns - Em endp 107pair describes one substring 108within the 109.Fa string , 110with the 111.Em startp 112pointing to the first character of the substring and 113the 114.Em endp 115pointing to the first character following the substring. 116The 0th substring is the substring of 117.Fa string 118that matched the whole 119regular expression. 120The others are those substrings that matched parenthesized expressions 121within the regular expression, with parenthesized expressions numbered 122in left-to-right order of their opening parentheses. 123.Pp 124The 125.Fn regsub 126function 127copies 128.Fa source 129to 130.Fa dest , 131making substitutions according to the 132most recent 133.Fn regexec 134performed using 135.Fa prog . 136Each instance of `&' in 137.Fa source 138is replaced by the substring 139indicated by 140.Em startp Ns Bq 141and 142.Em endp Ns Bq . 143Each instance of 144.Sq \e Ns Em n , 145where 146.Em n 147is a digit, is replaced by 148the substring indicated by 149.Em startp Ns Bq Em n 150and 151.Em endp Ns Bq Em n . 152To get a literal `&' or 153.Sq \e Ns Em n 154into 155.Fa dest , 156prefix it with `\e'; 157to get a literal `\e' preceding `&' or 158.Sq \e Ns Em n , 159prefix it with 160another `\e'. 161.Pp 162The 163.Fn regerror 164function 165is called whenever an error is detected in 166.Fn regcomp , 167.Fn regexec , 168or 169.Fn regsub . 170The default 171.Fn regerror 172writes the string 173.Fa msg , 174with a suitable indicator of origin, 175on the standard 176error output 177and invokes 178.Xr exit 2 . 179The 180.Fn regerror 181function 182can be replaced by the user if other actions are desirable. 183.Sh REGULAR EXPRESSION SYNTAX 184A regular expression is zero or more 185.Em branches , 186separated by `|'. 187It matches anything that matches one of the branches. 188.Pp 189A branch is zero or more 190.Em pieces , 191concatenated. 192It matches a match for the first, followed by a match for the second, etc. 193.Pp 194A piece is an 195.Em atom 196possibly followed by `*', `+', or `?'. 197An atom followed by `*' matches a sequence of 0 or more matches of the atom. 198An atom followed by `+' matches a sequence of 1 or more matches of the atom. 199An atom followed by `?' matches a match of the atom, or the null string. 200.Pp 201An atom is a regular expression in parentheses (matching a match for the 202regular expression), a 203.Em range 204(see below), `.' 205(matching any single character), `^' (matching the null string at the 206beginning of the input string), `$' (matching the null string at the 207end of the input string), a `\e' followed by a single character (matching 208that character), or a single character with no other significance 209(matching that character). 210.Pp 211A 212.Em range 213is a sequence of characters enclosed in `[]'. 214It normally matches any single character from the sequence. 215If the sequence begins with `^', 216it matches any single character 217.Em not 218from the rest of the sequence. 219If two characters in the sequence are separated by `\-', this is shorthand 220for the full list of 221.Tn ASCII 222characters between them 223(e.g. `[0-9]' matches any decimal digit). 224To include a literal `]' in the sequence, make it the first character 225(following a possible `^'). 226To include a literal `\-', make it the first or last character. 227.Sh AMBIGUITY 228If a regular expression could match two different parts of the input string, 229it will match the one which begins earliest. 230If both begin in the same place but match different lengths, or match 231the same length in different ways, life gets messier, as follows. 232.Pp 233In general, the possibilities in a list of branches are considered in 234left-to-right order, the possibilities for `*', `+', and `?' are 235considered longest-first, nested constructs are considered from the 236outermost in, and concatenated constructs are considered leftmost-first. 237The match that will be chosen is the one that uses the earliest 238possibility in the first choice that has to be made. 239If there is more than one choice, the next will be made in the same manner 240(earliest possibility) subject to the decision on the first choice. 241And so forth. 242.Pp 243For example, 244.Sq Li (ab|a)b*c 245could match 246`abc' in one of two ways. 247The first choice is between `ab' and `a'; since `ab' is earlier, and does 248lead to a successful overall match, it is chosen. 249Since the `b' is already spoken for, 250the `b*' must match its last possibility\(emthe empty string\(emsince 251it must respect the earlier choice. 252.Pp 253In the particular case where no `|'s are present and there is only one 254`*', `+', or `?', the net effect is that the longest possible 255match will be chosen. 256So 257.Sq Li ab* , 258presented with `xabbbby', will match `abbbb'. 259Note that if 260.Sq Li ab* , 261is tried against `xabyabbbz', it 262will match `ab' just after `x', due to the begins-earliest rule. 263(In effect, the decision on where to start the match is the first choice 264to be made, hence subsequent choices must respect it even if this leads them 265to less-preferred alternatives.) 266.Sh RETURN VALUES 267The 268.Fn regcomp 269function 270returns 271.Dv NULL 272for a failure 273.Pf ( Fn regerror 274permitting), 275where failures are syntax errors, exceeding implementation limits, 276or applying `+' or `*' to a possibly-null operand. 277.Sh SEE ALSO 278.Xr ed 1 , 279.Xr ex 1 , 280.Xr expr 1 , 281.Xr egrep 1 , 282.Xr fgrep 1 , 283.Xr grep 1 , 284.Xr regex 3 285.Sh HISTORY 286Both code and manual page for 287.Fn regcomp , 288.Fn regexec , 289.Fn regsub , 290and 291.Fn regerror 292were written at the University of Toronto 293and appeared in 294.Bx 4.3 tahoe . 295They are intended to be compatible with the Bell V8 296.Xr regexp 3 , 297but are not derived from Bell code. 298.Sh BUGS 299Empty branches and empty regular expressions are not portable to V8. 300.Pp 301The restriction against 302applying `*' or `+' to a possibly-null operand is an artifact of the 303simplistic implementation. 304.Pp 305Does not support 306.Xr egrep Ns 's 307newline-separated branches; 308neither does the V8 309.Xr regexp 3 , 310though. 311.Pp 312Due to emphasis on 313compactness and simplicity, 314it's not strikingly fast. 315It does give special attention to handling simple cases quickly. 316