1Introduction 2============ 3 4TRE is a lightweight, robust, and efficient POSIX compliant regexp 5matching library with some exciting features such as approximate 6(fuzzy) matching. 7 8The matching algorithm used in TRE uses linear worst-case time in 9the length of the text being searched, and quadratic worst-case 10time in the length of the used regular expression. 11 12In other words, the time complexity of the algorithm is O(M^2N), where 13M is the length of the regular expression and N is the length of the 14text. The used space is also quadratic on the length of the regex, but 15does not depend on the searched string. This quadratic behaviour 16occurs only on pathological cases which are probably very rare in 17practice. 18 19 20Hacking 21======= 22 23Here's how to work with this code. 24 25Prerequisites 26------------- 27 28You will need the following tools installed on your system: 29 30 - autoconf 31 - automake 32 - gettext 33 - libtool 34 - zip (optional) 35 36 37Building 38-------- 39 40First, prepare the tre. Change to the root of the source directory 41and run 42``` 43./utils/autogen.sh 44``` 45This will regenerate various things using the prerequisite tools so 46that you end up with a buildable tree. 47 48After this, you can run the configure script and build TRE as usual: 49``` 50./configure 51make 52make check 53make install 54``` 55 56 57Building a source code package 58------------------------------ 59 60In a prepared tree, this command creates a source code tarball: 61``` 62./configure && make dist 63``` 64 65Alternatively, you can run 66``` 67./utils/build-sources.sh 68``` 69which builds the source code packages and puts them in the `dist` 70subdirectory. This script needs a working `zip` command. 71 72 73Features 74======== 75 76TRE is not just yet another regexp matcher. TRE has some features 77which are not there in most free POSIX compatible implementations. 78Most of these features are not present in non-free implementations 79either, for that matter. 80 81Approximate matching 82-------------------- 83 84Approximate pattern matching allows matches to be approximate, that 85is, allows the matches to be close to the searched pattern under 86some measure of closeness. TRE uses the edit-distance measure (also 87known as the Levenshtein distance) where characters can be 88inserted, deleted, or substituted in the searched text in order to 89get an exact match. 90 91Each insertion, deletion, or substitution adds the distance, or cost, 92of the match. TRE can report the matches which have a cost lower than 93some given threshold value. TRE can also be used to search for matches 94with the lowest cost. 95 96TRE includes a version of the agrep (approximate grep) command line 97tool for approximate regexp matching in the style of grep. Unlike 98other agrep implementations (like the one by Sun Wu and Udi Manber 99from University of Arizona) TRE agrep allows full regexps of any 100length, any number of errors, and non-uniform costs for insertion, 101deletion and substitution. 102 103Strict standard conformance 104--------------------------- 105 106POSIX defines the behaviour of regexp functions precisely. TRE 107attempts to conform to these specifications as strictly as possible. 108TRE always returns the correct matches for subpatterns, for example. 109Very few other implementations do this correctly. In fact, the only 110other implementations besides TRE that I am aware of (free or not) 111that get it right are Rx by Tom Lord, Regex++ by John Maddock, and the 112AT&T ast regex by Glenn Fowler and Doug McIlroy. 113 114The standard TRE tries to conform to is the IEEE Std 1003.1-2001, 115or Open Group Base Specifications Issue 6, commonly referred to as 116"POSIX". It can be found online here. The relevant parts are the 117base specifications on regular expressions (and the rationale) and 118the description of the regcomp() API. 119 120For an excellent survey on POSIX regexp matchers, see the testregex 121pages by Glenn Fowler of AT&T Labs Research. 122 123Predictable matching speed 124-------------------------- 125 126Because of the matching algorithm used in TRE, the maximum time 127consumed by any regexec() call is always directly proportional to 128the length of the searched string. There is one exception: if back 129references are used, the matching may take time that grows 130exponentially with the length of the string. This is because 131matching back references is an NP complete problem, and almost 132certainly requires exponential time to match in the worst case. 133 134Predictable and modest memory consumption 135----------------------------------------- 136 137A regexec() call never allocates memory from the heap. TRE 138allocates all the memory it needs during a regcomp() call, and some 139temporary working space from the stack frame for the duration of 140the regexec() call. The amount of temporary space needed is 141constant during matching and does not depend on the searched 142string. For regexps of reasonable size TRE needs less than 50K of 143dynamically allocated memory during the regcomp() call, less than 14420K for the compiled pattern buffer, and less than two kilobytes of 145temporary working space from the stack frame during a regexec() 146call. There is no time/memory tradeoff. TRE is also small in code 147size; statically linking with TRE increases the executable size 148less than 30K (gcc-3.2, x86, GNU/Linux). 149 150Wide character and multibyte character set support 151-------------------------------------------------- 152 153TRE supports multibyte character sets. This makes it possible to 154use regexps seamlessly with, for example, Japanese locales. TRE 155also provides a wide character API. 156 157Binary pattern and data support 158------------------------------- 159 160TRE provides APIs which allow binary zero characters both in 161regexps and searched strings. The standard API cannot be easily 162used to, for example, search for printable words from binary data 163(although it is possible with some hacking). Searching for patterns 164which contain binary zeroes embedded is not possible at all with 165the standard API. 166 167Completely thread safe 168---------------------- 169 170TRE is completely thread safe. All the exported functions are 171re-entrant, and a single compiled regexp object can be used 172simultaneously in multiple contexts; e.g. in main() and a signal 173handler, or in many threads of a multithreaded application. 174 175Portable 176-------- 177 178TRE is portable across multiple platforms. Here's a table of 179platforms and compilers that have been successfully used to compile 180and run TRE: 181 182<table> 183 <tr><th>Platform(s)</th> <th>Compiler(s)</th></tr> 184 <tr><td>AIX 4.3.2 - 5.3.0</td> <td>GCC, C for AIX compiler version 5</td></tr> 185 <tr><td>Compaq Tru64 UNIX V5.1A/B</td> <td>Compaq C V6.4-014 - V6.5-011</td></tr> 186 <tr><td>Cygwin 1.3 - 1.5</td> <td>GCC</td></tr> 187 <tr><td>Digital UNIX V4.0</td> <td>DEC C V5.9-005</td></tr> 188 <tr><td>FreeBSD 4 and above</td> <td>GCC</td></tr> 189 <tr><td>GNU/Linux systems on x86, x86_64, ppc64, s390</td><td>GCC</td></tr> 190 <tr><td>HP-UX 10.20- 11.00</td> <td>GCC, HP C Compiler</td></tr> 191 <tr><td>IRIX 6.5</td> <td>GCC, MIPSpro Compilers 7.3.1.3m</td></tr> 192 <tr><td>Max OS X</td></tr> 193 <tr><td>NetBSD 1.5 and above</td> <td>GCC, egcs</td></tr> 194 <tr><td>OpenBSD 3.3 and above</td> <td>GCC</td></tr> 195 <tr><td>Solaris 2.7-10 sparc/x86</td> <td>GCC, Sun Workshop 6 compilers</td></tr> 196 <tr><td>Windows 98 - XP</td> <td>Microsoft Visual C++ 6.0</td></tr> 197</table> 198 199 200TRE 0.7.5 should compile without changes on all of the above 201platforms. Tell me if you are using TRE on a platform that is not 202listed above, and I'll add it to the list. Also let me know if TRE 203does not work on a listed platform. 204 205Depending on the platform, you may need to install libutf8 to get 206wide character and multibyte character set support. 207 208Free 209---- 210 211TRE is released under a license which is essentially the same as 212the "2 clause" BSD-style license used in NetBSD. See the file 213LICENSE for details. 214 215Roadmap 216------- 217 218There are currently two features, both related to collating 219elements, missing from 100% POSIX compliance. These are: 220 221* Support for collating elements (e.g. [[.<X>.]], where <X> is a 222 collating element). It is not possible to support 223 multi-character collating elements portably, since POSIX does 224 not define a way to determine whether a character sequence is a 225 multi-character collating element or not. 226 227* Support for equivalence classes, for example [[=<X>=]], where 228 <X> is a collating element. An equivalence class matches any 229 character which has the same primary collation weight as 230 <X>. Again, POSIX provides no portable mechanism for 231 determining the primary collation weight of a collating 232 element. 233 234Note that other portable regexp implementations don't support 235collating elements either. The single exception is Regex++, which 236comes with its own database for collating elements for different 237locales. Support for collating elements and equivalence classes has 238not been widely requested and is not very high on the TODO list at 239the moment. 240 241These are other features I'm planning to implement real soon now: 242 243* All the missing GNU extensions enabled in GNU regex, such as 244 [[:<:]] and [[:>:]] 245 246* A REG_SHORTEST regexec() flag for returning the shortest match 247 instead of the longest match. 248 249* Perl-compatible syntax 250 * `[:^class:]` 251 * Matches anything but the characters in class. Note that 252 * [^[:class:]] works already, this would be just a 253 * convenience shorthand. 254 * 255 * `\A` 256 * Match only at beginning of string 257 * 258 * `\Z` 259 * Match only at end of string, or before newline at the end 260 * 261 * `\z` 262 * Match only at end of string 263 * 264 * `\l` 265 * Lowercase next char (think vi) 266 * 267 * `\u` 268 * Uppercase next char (think vi) 269 * 270 * `\L` 271 * Lowercase till \E (think vi) 272 * 273 * `\U` 274 * Uppercase till \E (think vi) 275 * 276 * `(?=pattern)` 277 * Zero-width positive look-ahead assertions. 278 * 279 * `(?!pattern)` 280 * Zero-width negative look-ahead assertions. 281 * 282 * `(?<=pattern)` 283 * Zero-width positive look-behind assertions. 284 * 285 * `(?<!pattern)` 286 * Zero-width negative look-behind assertions. 287 288Documentation especially for the nonstandard features of TRE, such 289as approximate matching, is a work in progress (with "progress" 290loosely defined...) 291