1*48375Scael.\" Copyright 1991 The Regents of the University of California. 2*48375Scael.\" All rights reserved. 3*48375Scael.\" 4*48375Scael.\" %sccs.include.redist.man% 5*48375Scael.\" 6*48375Scael.\" @(#)regexp.3 5.2 (Berkeley) 04/20/91 7*48375Scael.\" 8*48375Scael.Dd 9*48375Scael.Dt REGEXP 3 10*48375Scael.Os 11*48375Scael.Sh NAME 12*48375Scael.Nm regcomp , 13*48375Scael.Nm regexec , 14*48375Scael.Nm regsub , 15*48375Scael.Nm regerror 16*48375Scael.Nd regular expression handlers 17*48375Scael.Sh SYNOPSIS 18*48375Scael.Fd #include <regexp.h> 19*48375Scael.Ft regexp * 20*48375Scael.Fn regcomp "const char *exp" 21*48375Scael.Ft int 22*48375Scael.Fn regexec "const regexp *prog" "const char *string" 23*48375Scael.Ft void 24*48375Scael.Fn regsub "const regexp *prog" "const char *source" "char *dest" 25*48375Scael.Sh DESCRIPTION 26*48375ScaelThe 27*48375Scael.Fn regcomp , 28*48375Scael.Fn regexec , 29*48375Scael.Fn regsub , 30*48375Scaeland 31*48375Scael.Fn regerror 32*48375Scaelfunctions 33*48375Scaelimplement 34*48375Scael.Xr egrep 1 Ns -style 3541347Sbosticregular expressions and supporting facilities. 36*48375Scael.Pp 37*48375ScaelThe 38*48375Scael.Fn regcomp 39*48375Scaelfunction 4041347Sbosticcompiles a regular expression into a structure of type 41*48375Scael.Xr regexp , 4241347Sbosticand returns a pointer to it. 4341347SbosticThe space has been allocated using 44*48375Scael.Xr malloc 3 4541347Sbosticand may be released by 46*48375Scael.Xr free . 47*48375Scael.Pp 48*48375ScaelThe 49*48375Scael.Fn regexec 50*48375Scaelfunction 51*48375Scaelmatches a 52*48375Scael.Dv NUL Ns -terminated 53*48375Scael.Fa string 54*48375Scaelagainst the compiled regular expression 55*48375Scaelin 56*48375Scael.Fa prog . 5741347SbosticIt returns 1 for success and 0 for failure, and adjusts the contents of 58*48375Scael.Fa prog Ns 's 59*48375Scael.Em startp 60*48375Scaeland 61*48375Scael.Em endp 62*48375Scael(see below) accordingly. 63*48375Scael.Pp 6441347SbosticThe members of a 65*48375Scael.Xr regexp 6641347Sbosticstructure include at least the following (not necessarily in order): 67*48375Scael.Bd -literal -offset indent 6841347Sbosticchar *startp[NSUBEXP]; 6941347Sbosticchar *endp[NSUBEXP]; 70*48375Scael.Ed 71*48375Scael.Pp 7241347Sbosticwhere 73*48375Scael.Dv NSUBEXP 7441347Sbosticis defined (as 10) in the header file. 75*48375ScaelOnce a successful 76*48375Scael.Fn regexec 77*48375Scaelhas been done using the 78*48375Scael.Fn regexp , 79*48375Scaeleach 80*48375Scael.Em startp Ns - Em endp 81*48375Scaelpair describes one substring 82*48375Scaelwithin the 83*48375Scael.Fa string , 84*48375Scaelwith the 85*48375Scael.Em startp 86*48375Scaelpointing to the first character of the substring and 87*48375Scaelthe 88*48375Scael.Em endp 89*48375Scaelpointing to the first character following the substring. 90*48375ScaelThe 0th substring is the substring of 91*48375Scael.Fa string 92*48375Scaelthat matched the whole 9341347Sbosticregular expression. 9441347SbosticThe others are those substrings that matched parenthesized expressions 9541347Sbosticwithin the regular expression, with parenthesized expressions numbered 9641347Sbosticin left-to-right order of their opening parentheses. 97*48375Scael.Pp 98*48375ScaelThe 99*48375Scael.Fn regsub 100*48375Scaelfunction 101*48375Scaelcopies 102*48375Scael.Fa source 103*48375Scaelto 104*48375Scael.Fa dest , 105*48375Scaelmaking substitutions according to the 106*48375Scaelmost recent 107*48375Scael.Fn regexec 108*48375Scaelperformed using 109*48375Scael.Fa prog . 110*48375ScaelEach instance of `&' in 111*48375Scael.Fa source 112*48375Scaelis replaced by the substring 113*48375Scaelindicated by 114*48375Scael.Em startp Ns Bq 115*48375Scaeland 116*48375Scael.Em endp Ns Bq . 117*48375ScaelEach instance of 118*48375Scael.Sq \e Ns Em n , 119*48375Scaelwhere 120*48375Scael.Em n 121*48375Scaelis a digit, is replaced by 12241347Sbosticthe substring indicated by 123*48375Scael.Em startp Ns Bq Em n 124*48375Scaeland 125*48375Scael.Em endp Ns Bq Em n . 126*48375ScaelTo get a literal `&' or 127*48375Scael.Sq \e Ns Em n 128*48375Scaelinto 129*48375Scael.Fa dest , 130*48375Scaelprefix it with `\e'; 131*48375Scaelto get a literal `\e' preceding `&' or 132*48375Scael.Sq \e Ns Em n , 133*48375Scaelprefix it with 13441347Sbosticanother `\e'. 135*48375Scael.Pp 136*48375ScaelThe 137*48375Scael.Fn regerror 138*48375Scaelfunction 139*48375Scaelis called whenever an error is detected in 140*48375Scael.Fn regcomp , 141*48375Scael.Fn regexec , 142*48375Scaelor 143*48375Scael.Fn regsub . 144*48375ScaelThe default 145*48375Scael.Fn regerror 146*48375Scaelwrites the string 147*48375Scael.Fa msg , 14841347Sbosticwith a suitable indicator of origin, 14941347Sbosticon the standard 15041347Sbosticerror output 151*48375Scaeland invokes 152*48375Scael.Xr exit 2 . 153*48375ScaelThe 154*48375Scael.Fn regerror 155*48375Scaelfunction 15641347Sbosticcan be replaced by the user if other actions are desirable. 157*48375Scael.Sh REGULAR EXPRESSION SYNTAX 158*48375ScaelA regular expression is zero or more 159*48375Scael.Em branches , 160*48375Scaelseparated by `|'. 16141347SbosticIt matches anything that matches one of the branches. 162*48375Scael.Pp 163*48375ScaelA branch is zero or more 164*48375Scael.Em pieces , 165*48375Scaelconcatenated. 16641347SbosticIt matches a match for the first, followed by a match for the second, etc. 167*48375Scael.Pp 168*48375ScaelA piece is an 169*48375Scael.Em atom 170*48375Scaelpossibly followed by `*', `+', or `?'. 17141347SbosticAn atom followed by `*' matches a sequence of 0 or more matches of the atom. 17241347SbosticAn atom followed by `+' matches a sequence of 1 or more matches of the atom. 17341347SbosticAn atom followed by `?' matches a match of the atom, or the null string. 174*48375Scael.Pp 17541347SbosticAn atom is a regular expression in parentheses (matching a match for the 176*48375Scaelregular expression), a 177*48375Scael.Em range 178*48375Scael(see below), `.' 17941347Sbostic(matching any single character), `^' (matching the null string at the 18041347Sbosticbeginning of the input string), `$' (matching the null string at the 18141347Sbosticend of the input string), a `\e' followed by a single character (matching 18241347Sbosticthat character), or a single character with no other significance 18341347Sbostic(matching that character). 184*48375Scael.Pp 185*48375ScaelA 186*48375Scael.Em range 187*48375Scaelis a sequence of characters enclosed in `[]'. 18841347SbosticIt normally matches any single character from the sequence. 18941347SbosticIf the sequence begins with `^', 190*48375Scaelit matches any single character 191*48375Scael.Em not 192*48375Scaelfrom the rest of the sequence. 19341347SbosticIf two characters in the sequence are separated by `\-', this is shorthand 194*48375Scaelfor the full list of 195*48375Scael.Tn ASCII 196*48375Scaelcharacters between them 19741347Sbostic(e.g. `[0-9]' matches any decimal digit). 19841347SbosticTo include a literal `]' in the sequence, make it the first character 19941347Sbostic(following a possible `^'). 20041347SbosticTo include a literal `\-', make it the first or last character. 201*48375Scael.Sh AMBIGUITY 20241347SbosticIf a regular expression could match two different parts of the input string, 20341347Sbosticit will match the one which begins earliest. 204*48375ScaelIf both begin in the same place but match different lengths, or match 20541347Sbosticthe same length in different ways, life gets messier, as follows. 206*48375Scael.Pp 20741347SbosticIn general, the possibilities in a list of branches are considered in 20841347Sbosticleft-to-right order, the possibilities for `*', `+', and `?' are 20941347Sbosticconsidered longest-first, nested constructs are considered from the 21041347Sbosticoutermost in, and concatenated constructs are considered leftmost-first. 21141347SbosticThe match that will be chosen is the one that uses the earliest 21241347Sbosticpossibility in the first choice that has to be made. 21341347SbosticIf there is more than one choice, the next will be made in the same manner 21441347Sbostic(earliest possibility) subject to the decision on the first choice. 21541347SbosticAnd so forth. 216*48375Scael.Pp 217*48375ScaelFor example, 218*48375Scael.Sq Li (ab|a)b*c 219*48375Scaelcould match 220*48375Scael`abc' in one of two ways. 22141347SbosticThe first choice is between `ab' and `a'; since `ab' is earlier, and does 22241347Sbosticlead to a successful overall match, it is chosen. 22341347SbosticSince the `b' is already spoken for, 22441347Sbosticthe `b*' must match its last possibility\(emthe empty string\(emsince 22541347Sbosticit must respect the earlier choice. 226*48375Scael.Pp 22741347SbosticIn the particular case where no `|'s are present and there is only one 22841347Sbostic`*', `+', or `?', the net effect is that the longest possible 22941347Sbosticmatch will be chosen. 230*48375ScaelSo 231*48375Scael.Sq Li ab* , 232*48375Scaelpresented with `xabbbby', will match `abbbb'. 233*48375ScaelNote that if 234*48375Scael.Sq Li ab* , 235*48375Scaelis tried against `xabyabbbz', it 23641347Sbosticwill match `ab' just after `x', due to the begins-earliest rule. 23741347Sbostic(In effect, the decision on where to start the match is the first choice 23841347Sbosticto be made, hence subsequent choices must respect it even if this leads them 23941347Sbosticto less-preferred alternatives.) 240*48375Scael.Sh RETURN VALUES 241*48375ScaelThe 242*48375Scael.Fn regcomp 243*48375Scaelfunction 244*48375Scaelreturns 245*48375Scael.Dv NULL 246*48375Scaelfor a failure 247*48375Scael.Pf ( Fn regerror 248*48375Scaelpermitting), 24941347Sbosticwhere failures are syntax errors, exceeding implementation limits, 25041347Sbosticor applying `+' or `*' to a possibly-null operand. 251*48375Scael.Sh SEE ALSO 252*48375Scael.Xr ed 1 , 253*48375Scael.Xr ex 1 , 254*48375Scael.Xr expr 1 , 255*48375Scael.Xr egrep 1 , 256*48375Scael.Xr fgrep 1 , 257*48375Scael.Xr grep 1 , 258*48375Scael.Xr regex 3 259*48375Scael.Sh HISTORY 260*48375ScaelBoth code and manual page for 261*48375Scael.Fn regcomp , 262*48375Scael.Fn regexec , 263*48375Scael.Fn regsub , 264*48375Scaeland 265*48375Scael.Fn regerror 266*48375Scaelwere written at the University of Toronto 267*48375Scaeland appeared in 268*48375Scael.Bx 4.3 tahoe . 269*48375ScaelThey are intended to be compatible with the Bell V8 270*48375Scael.Xr regexp 3 , 27141347Sbosticbut are not derived from Bell code. 272*48375Scael.Sh BUGS 27341347SbosticEmpty branches and empty regular expressions are not portable to V8. 274*48375Scael.Pp 27541347SbosticThe restriction against 27641347Sbosticapplying `*' or `+' to a possibly-null operand is an artifact of the 27741347Sbosticsimplistic implementation. 278*48375Scael.Pp 279*48375ScaelDoes not support 280*48375Scael.Xr egrep Ns 's 281*48375Scaelnewline-separated branches; 282*48375Scaelneither does the V8 283*48375Scael.Xr regexp 3 , 284*48375Scaelthough. 285*48375Scael.Pp 28641347SbosticDue to emphasis on 28741347Sbosticcompactness and simplicity, 28841347Sbosticit's not strikingly fast. 28941347SbosticIt does give special attention to handling simple cases quickly. 290