1.\" Copyright (c) 1991, 1993 2.\" The Regents of the University of California. All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 3. All advertising materials mentioning features or use of this software 13.\" must display the following acknowledgement: 14.\" This product includes software developed by the University of 15.\" California, Berkeley and its contributors. 16.\" 4. Neither the name of the University nor the names of its contributors 17.\" may be used to endorse or promote products derived from this software 18.\" without specific prior written permission. 19.\" 20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" @(#)regexp.3 8.1 (Berkeley) 6/4/93 33.\" 34.Dd June 4, 1993 35.Dt REGEXP 3 36.Os 37.Sh NAME 38.Nm regcomp , 39.Nm regexec , 40.Nm regsub , 41.Nm regerror 42.Nd regular expression handlers 43.Sh SYNOPSIS 44.Fd #include <regexp.h> 45.Ft regexp * 46.Fn regcomp "const char *exp" 47.Ft int 48.Fn regexec "const regexp *prog" "const char *string" 49.Ft void 50.Fn regsub "const regexp *prog" "const char *source" "char *dest" 51.Sh DESCRIPTION 52This interface is made obsolete by 53.Xr regex 3 . 54.Pp 55The 56.Fn regcomp , 57.Fn regexec , 58.Fn regsub , 59and 60.Fn regerror 61functions 62implement 63.Xr egrep 1 Ns -style 64regular expressions and supporting facilities. 65.Pp 66The 67.Fn regcomp 68function 69compiles a regular expression into a structure of type 70.Xr regexp , 71and returns a pointer to it. 72The space has been allocated using 73.Xr malloc 3 74and may be released by 75.Xr free . 76.Pp 77The 78.Fn regexec 79function 80matches a 81.Dv NUL Ns -terminated 82.Fa string 83against the compiled regular expression 84in 85.Fa prog . 86It returns 1 for success and 0 for failure, and adjusts the contents of 87.Fa prog Ns 's 88.Em startp 89and 90.Em endp 91(see below) accordingly. 92.Pp 93The members of a 94.Xr regexp 95structure include at least the following (not necessarily in order): 96.Bd -literal -offset indent 97char *startp[NSUBEXP]; 98char *endp[NSUBEXP]; 99.Ed 100.Pp 101where 102.Dv NSUBEXP 103is defined (as 10) in the header file. 104Once a successful 105.Fn regexec 106has been done using the 107.Fn regexp , 108each 109.Em startp Ns - Em endp 110pair describes one substring 111within the 112.Fa string , 113with the 114.Em startp 115pointing to the first character of the substring and 116the 117.Em endp 118pointing to the first character following the substring. 119The 0th substring is the substring of 120.Fa string 121that matched the whole 122regular expression. 123The others are those substrings that matched parenthesized expressions 124within the regular expression, with parenthesized expressions numbered 125in left-to-right order of their opening parentheses. 126.Pp 127The 128.Fn regsub 129function 130copies 131.Fa source 132to 133.Fa dest , 134making substitutions according to the 135most recent 136.Fn regexec 137performed using 138.Fa prog . 139Each instance of `&' in 140.Fa source 141is replaced by the substring 142indicated by 143.Em startp Ns Bq 144and 145.Em endp Ns Bq . 146Each instance of 147.Sq \e Ns Em n , 148where 149.Em n 150is a digit, is replaced by 151the substring indicated by 152.Em startp Ns Bq Em n 153and 154.Em endp Ns Bq Em n . 155To get a literal `&' or 156.Sq \e Ns Em n 157into 158.Fa dest , 159prefix it with `\e'; 160to get a literal `\e' preceding `&' or 161.Sq \e Ns Em n , 162prefix it with 163another `\e'. 164.Pp 165The 166.Fn regerror 167function 168is called whenever an error is detected in 169.Fn regcomp , 170.Fn regexec , 171or 172.Fn regsub . 173The default 174.Fn regerror 175writes the string 176.Fa msg , 177with a suitable indicator of origin, 178on the standard 179error output 180and invokes 181.Xr exit 2 . 182The 183.Fn regerror 184function 185can be replaced by the user if other actions are desirable. 186.Sh REGULAR EXPRESSION SYNTAX 187A regular expression is zero or more 188.Em branches , 189separated by `|'. 190It matches anything that matches one of the branches. 191.Pp 192A branch is zero or more 193.Em pieces , 194concatenated. 195It matches a match for the first, followed by a match for the second, etc. 196.Pp 197A piece is an 198.Em atom 199possibly followed by `*', `+', or `?'. 200An atom followed by `*' matches a sequence of 0 or more matches of the atom. 201An atom followed by `+' matches a sequence of 1 or more matches of the atom. 202An atom followed by `?' matches a match of the atom, or the null string. 203.Pp 204An atom is a regular expression in parentheses (matching a match for the 205regular expression), a 206.Em range 207(see below), `.' 208(matching any single character), `^' (matching the null string at the 209beginning of the input string), `$' (matching the null string at the 210end of the input string), a `\e' followed by a single character (matching 211that character), or a single character with no other significance 212(matching that character). 213.Pp 214A 215.Em range 216is a sequence of characters enclosed in `[]'. 217It normally matches any single character from the sequence. 218If the sequence begins with `^', 219it matches any single character 220.Em not 221from the rest of the sequence. 222If two characters in the sequence are separated by `\-', this is shorthand 223for the full list of 224.Tn ASCII 225characters between them 226(e.g. `[0-9]' matches any decimal digit). 227To include a literal `]' in the sequence, make it the first character 228(following a possible `^'). 229To include a literal `\-', make it the first or last character. 230.Sh AMBIGUITY 231If a regular expression could match two different parts of the input string, 232it will match the one which begins earliest. 233If both begin in the same place but match different lengths, or match 234the same length in different ways, life gets messier, as follows. 235.Pp 236In general, the possibilities in a list of branches are considered in 237left-to-right order, the possibilities for `*', `+', and `?' are 238considered longest-first, nested constructs are considered from the 239outermost in, and concatenated constructs are considered leftmost-first. 240The match that will be chosen is the one that uses the earliest 241possibility in the first choice that has to be made. 242If there is more than one choice, the next will be made in the same manner 243(earliest possibility) subject to the decision on the first choice. 244And so forth. 245.Pp 246For example, 247.Sq Li (ab|a)b*c 248could match 249`abc' in one of two ways. 250The first choice is between `ab' and `a'; since `ab' is earlier, and does 251lead to a successful overall match, it is chosen. 252Since the `b' is already spoken for, 253the `b*' must match its last possibility\(emthe empty string\(emsince 254it must respect the earlier choice. 255.Pp 256In the particular case where no `|'s are present and there is only one 257`*', `+', or `?', the net effect is that the longest possible 258match will be chosen. 259So 260.Sq Li ab* , 261presented with `xabbbby', will match `abbbb'. 262Note that if 263.Sq Li ab* , 264is tried against `xabyabbbz', it 265will match `ab' just after `x', due to the begins-earliest rule. 266(In effect, the decision on where to start the match is the first choice 267to be made, hence subsequent choices must respect it even if this leads them 268to less-preferred alternatives.) 269.Sh RETURN VALUES 270The 271.Fn regcomp 272function 273returns 274.Dv NULL 275for a failure 276.Pf ( Fn regerror 277permitting), 278where failures are syntax errors, exceeding implementation limits, 279or applying `+' or `*' to a possibly-null operand. 280.Sh SEE ALSO 281.Xr ed 1 , 282.Xr ex 1 , 283.Xr expr 1 , 284.Xr egrep 1 , 285.Xr fgrep 1 , 286.Xr grep 1 , 287.Xr regex 3 288.Sh HISTORY 289Both code and manual page for 290.Fn regcomp , 291.Fn regexec , 292.Fn regsub , 293and 294.Fn regerror 295were written at the University of Toronto 296and appeared in 297.Bx 4.3 tahoe . 298They are intended to be compatible with the Bell V8 299.Xr regexp 3 , 300but are not derived from Bell code. 301.Sh BUGS 302Empty branches and empty regular expressions are not portable to V8. 303.Pp 304The restriction against 305applying `*' or `+' to a possibly-null operand is an artifact of the 306simplistic implementation. 307.Pp 308Does not support 309.Xr egrep Ns 's 310newline-separated branches; 311neither does the V8 312.Xr regexp 3 , 313though. 314.Pp 315Due to emphasis on 316compactness and simplicity, 317it's not strikingly fast. 318It does give special attention to handling simple cases quickly. 319