1*5f2eab64SJohn MarinoIntroduction 2*5f2eab64SJohn Marino 3*5f2eab64SJohn Marino TRE is a lightweight, robust, and efficient POSIX compliant regexp 4*5f2eab64SJohn Marino matching library with some exciting features such as approximate 5*5f2eab64SJohn Marino (fuzzy) matching. 6*5f2eab64SJohn Marino 7*5f2eab64SJohn Marino The matching algorithm used in TRE uses linear worst-case time in 8*5f2eab64SJohn Marino the length of the text being searched, and quadratic worst-case 9*5f2eab64SJohn Marino time in the length of the used regular expression. In other words, 10*5f2eab64SJohn Marino the time complexity of the algorithm is O(M^2N), where M is the 11*5f2eab64SJohn Marino length of the regular expression and N is the length of the 12*5f2eab64SJohn Marino text. The used space is also quadratic on the length of the regex, 13*5f2eab64SJohn Marino but does not depend on the searched string. This quadratic 14*5f2eab64SJohn Marino behaviour occurs only on pathological cases which are probably very 15*5f2eab64SJohn Marino rare in practice. 16*5f2eab64SJohn Marino 17*5f2eab64SJohn MarinoFeatures 18*5f2eab64SJohn Marino 19*5f2eab64SJohn Marino TRE is not just yet another regexp matcher. TRE has some features 20*5f2eab64SJohn Marino which are not there in most free POSIX compatible 21*5f2eab64SJohn Marino implementations. Most of these features are not present in non-free 22*5f2eab64SJohn Marino implementations either, for that matter. 23*5f2eab64SJohn Marino 24*5f2eab64SJohn MarinoApproximate matching 25*5f2eab64SJohn Marino 26*5f2eab64SJohn Marino Approximate pattern matching allows matches to be approximate, that 27*5f2eab64SJohn Marino is, allows the matches to be close to the searched pattern under 28*5f2eab64SJohn Marino some measure of closeness. TRE uses the edit-distance measure (also 29*5f2eab64SJohn Marino known as the Levenshtein distance) where characters can be 30*5f2eab64SJohn Marino inserted, deleted, or substituted in the searched text in order to 31*5f2eab64SJohn Marino get an exact match. Each insertion, deletion, or substitution adds 32*5f2eab64SJohn Marino the distance, or cost, of the match. TRE can report the matches 33*5f2eab64SJohn Marino which have a cost lower than some given threshold value. TRE can 34*5f2eab64SJohn Marino also be used to search for matches with the lowest cost. 35*5f2eab64SJohn Marino 36*5f2eab64SJohn Marino TRE includes a version of the agrep (approximate grep) command line 37*5f2eab64SJohn Marino tool for approximate regexp matching in the style of grep. Unlike 38*5f2eab64SJohn Marino other agrep implementations (like the one by Sun Wu and Udi Manber 39*5f2eab64SJohn Marino from University of Arizona available here) TRE agrep allows full 40*5f2eab64SJohn Marino regexps of any length, any number of errors, and non-uniform costs 41*5f2eab64SJohn Marino for insertion, deletion and substitution. 42*5f2eab64SJohn Marino 43*5f2eab64SJohn MarinoStrict standard conformance 44*5f2eab64SJohn Marino 45*5f2eab64SJohn Marino POSIX defines the behaviour of regexp functions precisely. TRE 46*5f2eab64SJohn Marino attempts to conform to these specifications as strictly as 47*5f2eab64SJohn Marino possible. TRE always returns the correct matches for subpatterns, 48*5f2eab64SJohn Marino for example. Very few other implementations do this correctly. In 49*5f2eab64SJohn Marino fact, the only other implementations besides TRE that I am aware of 50*5f2eab64SJohn Marino (free or not) that get it right are Rx by Tom Lord, Regex++ by John 51*5f2eab64SJohn Marino Maddock, and the AT&T ast regex by Glenn Fowler and Doug McIlroy. 52*5f2eab64SJohn Marino 53*5f2eab64SJohn Marino The standard TRE tries to conform to is the IEEE Std 1003.1-2001, 54*5f2eab64SJohn Marino or Open Group Base Specifications Issue 6, commonly referred to as 55*5f2eab64SJohn Marino "POSIX". It can be found online here. The relevant parts are the 56*5f2eab64SJohn Marino base specifications on regular expressions (and the rationale) and 57*5f2eab64SJohn Marino the description of the regcomp() API. 58*5f2eab64SJohn Marino 59*5f2eab64SJohn Marino For an excellent survey on POSIX regexp matchers, see the testregex 60*5f2eab64SJohn Marino pages by Glenn Fowler of AT&T Labs Research. 61*5f2eab64SJohn Marino 62*5f2eab64SJohn MarinoPredictable matching speed 63*5f2eab64SJohn Marino 64*5f2eab64SJohn Marino Because of the matching algorithm used in TRE, the maximum time 65*5f2eab64SJohn Marino consumed by any regexec() call is always directly proportional to 66*5f2eab64SJohn Marino the length of the searched string. There is one exception: if back 67*5f2eab64SJohn Marino references are used, the matching may take time that grows 68*5f2eab64SJohn Marino exponentially with the length of the string. This is because 69*5f2eab64SJohn Marino matching back references is an NP complete problem, and almost 70*5f2eab64SJohn Marino certainly requires exponential time to match in the worst case. 71*5f2eab64SJohn Marino 72*5f2eab64SJohn MarinoPredictable and modest memory consumption 73*5f2eab64SJohn Marino 74*5f2eab64SJohn Marino A regexec() call never allocates memory from the heap. TRE 75*5f2eab64SJohn Marino allocates all the memory it needs during a regcomp() call, and some 76*5f2eab64SJohn Marino temporary working space from the stack frame for the duration of 77*5f2eab64SJohn Marino the regexec() call. The amount of temporary space needed is 78*5f2eab64SJohn Marino constant during matching and does not depend on the searched 79*5f2eab64SJohn Marino string. For regexps of reasonable size TRE needs less than 50K of 80*5f2eab64SJohn Marino dynamically allocated memory during the regcomp() call, less than 81*5f2eab64SJohn Marino 20K for the compiled pattern buffer, and less than two kilobytes of 82*5f2eab64SJohn Marino temporary working space from the stack frame during a regexec() 83*5f2eab64SJohn Marino call. There is no time/memory tradeoff. TRE is also small in code 84*5f2eab64SJohn Marino size; statically linking with TRE increases the executable size 85*5f2eab64SJohn Marino less than 30K (gcc-3.2, x86, GNU/Linux). 86*5f2eab64SJohn Marino 87*5f2eab64SJohn MarinoWide character and multibyte character set support 88*5f2eab64SJohn Marino 89*5f2eab64SJohn Marino TRE supports multibyte character sets. This makes it possible to 90*5f2eab64SJohn Marino use regexps seamlessly with, for example, Japanese locales. TRE 91*5f2eab64SJohn Marino also provides a wide character API. 92*5f2eab64SJohn Marino 93*5f2eab64SJohn MarinoBinary pattern and data support 94*5f2eab64SJohn Marino 95*5f2eab64SJohn Marino TRE provides APIs which allow binary zero characters both in 96*5f2eab64SJohn Marino regexps and searched strings. The standard API cannot be easily 97*5f2eab64SJohn Marino used to, for example, search for printable words from binary data 98*5f2eab64SJohn Marino (although it is possible with some hacking). Searching for patterns 99*5f2eab64SJohn Marino which contain binary zeroes embedded is not possible at all with 100*5f2eab64SJohn Marino the standard API. 101*5f2eab64SJohn Marino 102*5f2eab64SJohn MarinoCompletely thread safe 103*5f2eab64SJohn Marino 104*5f2eab64SJohn Marino TRE is completely thread safe. All the exported functions are 105*5f2eab64SJohn Marino re-entrant, and a single compiled regexp object can be used 106*5f2eab64SJohn Marino simultaneously in multiple contexts; e.g. in main() and a signal 107*5f2eab64SJohn Marino handler, or in many threads of a multithreaded application. 108*5f2eab64SJohn Marino 109*5f2eab64SJohn MarinoPortable 110*5f2eab64SJohn Marino 111*5f2eab64SJohn Marino TRE is portable across multiple platforms. Here's a table of 112*5f2eab64SJohn Marino platforms and compilers that have been successfully used to compile 113*5f2eab64SJohn Marino and run TRE: 114*5f2eab64SJohn Marino 115*5f2eab64SJohn Marino Platform(s) | Compiler(s) 116*5f2eab64SJohn Marino ----------------------------------+------------ 117*5f2eab64SJohn Marino AIX 4.3.2 - 5.3.0 | GCC, C for AIX compiler version 5 118*5f2eab64SJohn Marino Compaq Tru64 UNIX V5.1A/B | Compaq C V6.4-014 - V6.5-011 119*5f2eab64SJohn Marino Cygwin 1.3 - 1.5 | GCC 120*5f2eab64SJohn Marino Digital UNIX V4.0 | DEC C V5.9-005 121*5f2eab64SJohn Marino FreeBSD 4 and above | GCC 122*5f2eab64SJohn Marino GNU/Linux systems on x86, x86_64, | GCC 123*5f2eab64SJohn Marino ppc64, s390 | 124*5f2eab64SJohn Marino HP-UX 10.20- 11.00 | GCC, HP C Compiler 125*5f2eab64SJohn Marino IRIX 6.5 | GCC, MIPSpro Compilers 7.3.1.3m 126*5f2eab64SJohn Marino Max OS X | 127*5f2eab64SJohn Marino NetBSD 1.5 and above | GCC, egcs 128*5f2eab64SJohn Marino OpenBSD 3.3 and above | GCC 129*5f2eab64SJohn Marino Solaris 2.7-10 sparc/x86 | GCC, Sun Workshop 6 compilers 130*5f2eab64SJohn Marino Windows 98 - XP | Microsoft Visual C++ 6.0 131*5f2eab64SJohn Marino 132*5f2eab64SJohn Marino TRE 0.7.5 should compile without changes on all of the above 133*5f2eab64SJohn Marino platforms. Tell me if you are using TRE on a platform that is not 134*5f2eab64SJohn Marino listed above, and I'll add it to the list. Also let me know if TRE 135*5f2eab64SJohn Marino does not work on a listed platform. 136*5f2eab64SJohn Marino 137*5f2eab64SJohn Marino Depending on the platform, you may need to install libutf8 to get 138*5f2eab64SJohn Marino wide character and multibyte character set support. 139*5f2eab64SJohn Marino 140*5f2eab64SJohn Marino Free 141*5f2eab64SJohn Marino 142*5f2eab64SJohn Marino TRE is released under a license which is essentially the same as 143*5f2eab64SJohn Marino the "2 clause" BSD-style license used in NetBSD. See the file 144*5f2eab64SJohn Marino LICENSE for details. 145*5f2eab64SJohn Marino 146*5f2eab64SJohn MarinoRoadmap 147*5f2eab64SJohn Marino 148*5f2eab64SJohn Marino There are currently two features, both related to collating 149*5f2eab64SJohn Marino elements, missing from 100% POSIX compliance. These are: 150*5f2eab64SJohn Marino 151*5f2eab64SJohn Marino * Support for collating elements (e.g. [[.<X>.]], where <X> is a 152*5f2eab64SJohn Marino collating element). It is not possible to support 153*5f2eab64SJohn Marino multi-character collating elements portably, since POSIX does 154*5f2eab64SJohn Marino not define a way to determine whether a character sequence is a 155*5f2eab64SJohn Marino multi-character collating element or not. 156*5f2eab64SJohn Marino 157*5f2eab64SJohn Marino * Support for equivalence classes, for example [[=<X>=]], where 158*5f2eab64SJohn Marino <X> is a collating element. An equivalence class matches any 159*5f2eab64SJohn Marino character which has the same primary collation weight as 160*5f2eab64SJohn Marino <X>. Again, POSIX provides no portable mechanism for 161*5f2eab64SJohn Marino determining the primary collation weight of a collating 162*5f2eab64SJohn Marino element. 163*5f2eab64SJohn Marino 164*5f2eab64SJohn Marino Note that other portable regexp implementations don't support 165*5f2eab64SJohn Marino collating elements either. The single exception is Regex++, which 166*5f2eab64SJohn Marino comes with its own database for collating elements for different 167*5f2eab64SJohn Marino locales. Support for collating elements and equivalence classes has 168*5f2eab64SJohn Marino not been widely requested and is not very high on the TODO list at 169*5f2eab64SJohn Marino the moment. 170*5f2eab64SJohn Marino 171*5f2eab64SJohn Marino These are other features I'm planning to implement real soon now: 172*5f2eab64SJohn Marino 173*5f2eab64SJohn Marino * All the missing GNU extensions enabled in GNU regex, such as 174*5f2eab64SJohn Marino [[:<:]] and [[:>:]] 175*5f2eab64SJohn Marino 176*5f2eab64SJohn Marino * A REG_SHORTEST regexec() flag for returning the shortest match 177*5f2eab64SJohn Marino instead of the longest match. 178*5f2eab64SJohn Marino 179*5f2eab64SJohn Marino * Perl-compatible syntax 180*5f2eab64SJohn Marino 181*5f2eab64SJohn Marino [:^class:] 182*5f2eab64SJohn Marino Matches anything but the characters in class. Note that 183*5f2eab64SJohn Marino [^[:class:]] works already, this would be just a 184*5f2eab64SJohn Marino convenience shorthand. 185*5f2eab64SJohn Marino 186*5f2eab64SJohn Marino \A 187*5f2eab64SJohn Marino Match only at beginning of string 188*5f2eab64SJohn Marino 189*5f2eab64SJohn Marino \Z 190*5f2eab64SJohn Marino Match only at end of string, or before newline at the end 191*5f2eab64SJohn Marino 192*5f2eab64SJohn Marino \z 193*5f2eab64SJohn Marino Match only at end of string 194*5f2eab64SJohn Marino 195*5f2eab64SJohn Marino \l 196*5f2eab64SJohn Marino Lowercase next char (think vi) 197*5f2eab64SJohn Marino 198*5f2eab64SJohn Marino \u 199*5f2eab64SJohn Marino Uppercase next char (think vi) 200*5f2eab64SJohn Marino 201*5f2eab64SJohn Marino \L 202*5f2eab64SJohn Marino Lowercase till \E (think vi) 203*5f2eab64SJohn Marino 204*5f2eab64SJohn Marino \U 205*5f2eab64SJohn Marino Uppercase till \E (think vi) 206*5f2eab64SJohn Marino 207*5f2eab64SJohn Marino (?=pattern) 208*5f2eab64SJohn Marino Zero-width positive look-ahead assertions. 209*5f2eab64SJohn Marino 210*5f2eab64SJohn Marino (?!pattern) 211*5f2eab64SJohn Marino Zero-width negative look-ahead assertions. 212*5f2eab64SJohn Marino 213*5f2eab64SJohn Marino (?<=pattern) 214*5f2eab64SJohn Marino Zero-width positive look-behind assertions. 215*5f2eab64SJohn Marino 216*5f2eab64SJohn Marino (?<!pattern) 217*5f2eab64SJohn Marino Zero-width negative look-behind assertions. 218*5f2eab64SJohn Marino 219*5f2eab64SJohn Marino Documentation especially for the nonstandard features of TRE, such 220*5f2eab64SJohn Marino as approximate matching, is a work in progress (with "progress" 221*5f2eab64SJohn Marino loosely defined...) 222*5f2eab64SJohn Marino 223*5f2eab64SJohn Marino Mailing lists 224*5f2eab64SJohn Marino 225*5f2eab64SJohn Marino tre-general@lists.laurikari.net 226*5f2eab64SJohn Marino This list is for any discussion on the TRE software, including 227*5f2eab64SJohn Marino reporting bugs, feature requests, requests for help, and other 228*5f2eab64SJohn Marino things. 229*5f2eab64SJohn Marino 230*5f2eab64SJohn Marino tre-announce@lists.laurikari.net 231*5f2eab64SJohn Marino Subscribe to this list to get announcements of new releases of 232*5f2eab64SJohn Marino TRE. Alternatively, you can subscribe to the freshmeat.net 233*5f2eab64SJohn Marino project and get similar announcements that way, if you prefer. 234*5f2eab64SJohn Marino 235*5f2eab64SJohn MarinoVille Laurikari <vl@iki.fi> 236