xref: /dflybsd-src/contrib/tre/README (revision 071cbfc5b674dbc05e198747dddf53a507e3290a)
1*5f2eab64SJohn MarinoIntroduction
2*5f2eab64SJohn Marino
3*5f2eab64SJohn Marino   TRE is a lightweight, robust, and efficient POSIX compliant regexp
4*5f2eab64SJohn Marino   matching library with some exciting features such as approximate
5*5f2eab64SJohn Marino   (fuzzy) matching.
6*5f2eab64SJohn Marino
7*5f2eab64SJohn Marino   The matching algorithm used in TRE uses linear worst-case time in
8*5f2eab64SJohn Marino   the length of the text being searched, and quadratic worst-case
9*5f2eab64SJohn Marino   time in the length of the used regular expression. In other words,
10*5f2eab64SJohn Marino   the time complexity of the algorithm is O(M^2N), where M is the
11*5f2eab64SJohn Marino   length of the regular expression and N is the length of the
12*5f2eab64SJohn Marino   text. The used space is also quadratic on the length of the regex,
13*5f2eab64SJohn Marino   but does not depend on the searched string. This quadratic
14*5f2eab64SJohn Marino   behaviour occurs only on pathological cases which are probably very
15*5f2eab64SJohn Marino   rare in practice.
16*5f2eab64SJohn Marino
17*5f2eab64SJohn MarinoFeatures
18*5f2eab64SJohn Marino
19*5f2eab64SJohn Marino   TRE is not just yet another regexp matcher. TRE has some features
20*5f2eab64SJohn Marino   which are not there in most free POSIX compatible
21*5f2eab64SJohn Marino   implementations. Most of these features are not present in non-free
22*5f2eab64SJohn Marino   implementations either, for that matter.
23*5f2eab64SJohn Marino
24*5f2eab64SJohn MarinoApproximate matching
25*5f2eab64SJohn Marino
26*5f2eab64SJohn Marino   Approximate pattern matching allows matches to be approximate, that
27*5f2eab64SJohn Marino   is, allows the matches to be close to the searched pattern under
28*5f2eab64SJohn Marino   some measure of closeness. TRE uses the edit-distance measure (also
29*5f2eab64SJohn Marino   known as the Levenshtein distance) where characters can be
30*5f2eab64SJohn Marino   inserted, deleted, or substituted in the searched text in order to
31*5f2eab64SJohn Marino   get an exact match. Each insertion, deletion, or substitution adds
32*5f2eab64SJohn Marino   the distance, or cost, of the match. TRE can report the matches
33*5f2eab64SJohn Marino   which have a cost lower than some given threshold value. TRE can
34*5f2eab64SJohn Marino   also be used to search for matches with the lowest cost.
35*5f2eab64SJohn Marino
36*5f2eab64SJohn Marino   TRE includes a version of the agrep (approximate grep) command line
37*5f2eab64SJohn Marino   tool for approximate regexp matching in the style of grep. Unlike
38*5f2eab64SJohn Marino   other agrep implementations (like the one by Sun Wu and Udi Manber
39*5f2eab64SJohn Marino   from University of Arizona available here) TRE agrep allows full
40*5f2eab64SJohn Marino   regexps of any length, any number of errors, and non-uniform costs
41*5f2eab64SJohn Marino   for insertion, deletion and substitution.
42*5f2eab64SJohn Marino
43*5f2eab64SJohn MarinoStrict standard conformance
44*5f2eab64SJohn Marino
45*5f2eab64SJohn Marino   POSIX defines the behaviour of regexp functions precisely. TRE
46*5f2eab64SJohn Marino   attempts to conform to these specifications as strictly as
47*5f2eab64SJohn Marino   possible. TRE always returns the correct matches for subpatterns,
48*5f2eab64SJohn Marino   for example. Very few other implementations do this correctly. In
49*5f2eab64SJohn Marino   fact, the only other implementations besides TRE that I am aware of
50*5f2eab64SJohn Marino   (free or not) that get it right are Rx by Tom Lord, Regex++ by John
51*5f2eab64SJohn Marino   Maddock, and the AT&T ast regex by Glenn Fowler and Doug McIlroy.
52*5f2eab64SJohn Marino
53*5f2eab64SJohn Marino   The standard TRE tries to conform to is the IEEE Std 1003.1-2001,
54*5f2eab64SJohn Marino   or Open Group Base Specifications Issue 6, commonly referred to as
55*5f2eab64SJohn Marino   "POSIX".  It can be found online here. The relevant parts are the
56*5f2eab64SJohn Marino   base specifications on regular expressions (and the rationale) and
57*5f2eab64SJohn Marino   the description of the regcomp() API.
58*5f2eab64SJohn Marino
59*5f2eab64SJohn Marino   For an excellent survey on POSIX regexp matchers, see the testregex
60*5f2eab64SJohn Marino   pages by Glenn Fowler of AT&T Labs Research.
61*5f2eab64SJohn Marino
62*5f2eab64SJohn MarinoPredictable matching speed
63*5f2eab64SJohn Marino
64*5f2eab64SJohn Marino   Because of the matching algorithm used in TRE, the maximum time
65*5f2eab64SJohn Marino   consumed by any regexec() call is always directly proportional to
66*5f2eab64SJohn Marino   the length of the searched string. There is one exception: if back
67*5f2eab64SJohn Marino   references are used, the matching may take time that grows
68*5f2eab64SJohn Marino   exponentially with the length of the string. This is because
69*5f2eab64SJohn Marino   matching back references is an NP complete problem, and almost
70*5f2eab64SJohn Marino   certainly requires exponential time to match in the worst case.
71*5f2eab64SJohn Marino
72*5f2eab64SJohn MarinoPredictable and modest memory consumption
73*5f2eab64SJohn Marino
74*5f2eab64SJohn Marino   A regexec() call never allocates memory from the heap. TRE
75*5f2eab64SJohn Marino   allocates all the memory it needs during a regcomp() call, and some
76*5f2eab64SJohn Marino   temporary working space from the stack frame for the duration of
77*5f2eab64SJohn Marino   the regexec() call. The amount of temporary space needed is
78*5f2eab64SJohn Marino   constant during matching and does not depend on the searched
79*5f2eab64SJohn Marino   string. For regexps of reasonable size TRE needs less than 50K of
80*5f2eab64SJohn Marino   dynamically allocated memory during the regcomp() call, less than
81*5f2eab64SJohn Marino   20K for the compiled pattern buffer, and less than two kilobytes of
82*5f2eab64SJohn Marino   temporary working space from the stack frame during a regexec()
83*5f2eab64SJohn Marino   call. There is no time/memory tradeoff. TRE is also small in code
84*5f2eab64SJohn Marino   size; statically linking with TRE increases the executable size
85*5f2eab64SJohn Marino   less than 30K (gcc-3.2, x86, GNU/Linux).
86*5f2eab64SJohn Marino
87*5f2eab64SJohn MarinoWide character and multibyte character set support
88*5f2eab64SJohn Marino
89*5f2eab64SJohn Marino   TRE supports multibyte character sets. This makes it possible to
90*5f2eab64SJohn Marino   use regexps seamlessly with, for example, Japanese locales. TRE
91*5f2eab64SJohn Marino   also provides a wide character API.
92*5f2eab64SJohn Marino
93*5f2eab64SJohn MarinoBinary pattern and data support
94*5f2eab64SJohn Marino
95*5f2eab64SJohn Marino   TRE provides APIs which allow binary zero characters both in
96*5f2eab64SJohn Marino   regexps and searched strings. The standard API cannot be easily
97*5f2eab64SJohn Marino   used to, for example, search for printable words from binary data
98*5f2eab64SJohn Marino   (although it is possible with some hacking). Searching for patterns
99*5f2eab64SJohn Marino   which contain binary zeroes embedded is not possible at all with
100*5f2eab64SJohn Marino   the standard API.
101*5f2eab64SJohn Marino
102*5f2eab64SJohn MarinoCompletely thread safe
103*5f2eab64SJohn Marino
104*5f2eab64SJohn Marino   TRE is completely thread safe. All the exported functions are
105*5f2eab64SJohn Marino   re-entrant, and a single compiled regexp object can be used
106*5f2eab64SJohn Marino   simultaneously in multiple contexts; e.g. in main() and a signal
107*5f2eab64SJohn Marino   handler, or in many threads of a multithreaded application.
108*5f2eab64SJohn Marino
109*5f2eab64SJohn MarinoPortable
110*5f2eab64SJohn Marino
111*5f2eab64SJohn Marino   TRE is portable across multiple platforms. Here's a table of
112*5f2eab64SJohn Marino   platforms and compilers that have been successfully used to compile
113*5f2eab64SJohn Marino   and run TRE:
114*5f2eab64SJohn Marino
115*5f2eab64SJohn Marino      Platform(s)                       | Compiler(s)
116*5f2eab64SJohn Marino      ----------------------------------+------------
117*5f2eab64SJohn Marino      AIX 4.3.2 - 5.3.0                 | GCC, C for AIX compiler version 5
118*5f2eab64SJohn Marino      Compaq Tru64 UNIX V5.1A/B         | Compaq C V6.4-014 - V6.5-011
119*5f2eab64SJohn Marino      Cygwin 1.3 - 1.5                  | GCC
120*5f2eab64SJohn Marino      Digital UNIX V4.0                 | DEC C V5.9-005
121*5f2eab64SJohn Marino      FreeBSD 4 and above               | GCC
122*5f2eab64SJohn Marino      GNU/Linux systems on x86, x86_64, | GCC
123*5f2eab64SJohn Marino      ppc64, s390			|
124*5f2eab64SJohn Marino      HP-UX 10.20- 11.00                | GCC, HP C Compiler
125*5f2eab64SJohn Marino      IRIX 6.5                          | GCC, MIPSpro Compilers 7.3.1.3m
126*5f2eab64SJohn Marino      Max OS X				|
127*5f2eab64SJohn Marino      NetBSD 1.5 and above              | GCC, egcs
128*5f2eab64SJohn Marino      OpenBSD 3.3 and above             | GCC
129*5f2eab64SJohn Marino      Solaris 2.7-10 sparc/x86          | GCC, Sun Workshop 6 compilers
130*5f2eab64SJohn Marino      Windows 98 - XP                   | Microsoft Visual C++ 6.0
131*5f2eab64SJohn Marino
132*5f2eab64SJohn Marino   TRE 0.7.5 should compile without changes on all of the above
133*5f2eab64SJohn Marino   platforms.  Tell me if you are using TRE on a platform that is not
134*5f2eab64SJohn Marino   listed above, and I'll add it to the list. Also let me know if TRE
135*5f2eab64SJohn Marino   does not work on a listed platform.
136*5f2eab64SJohn Marino
137*5f2eab64SJohn Marino   Depending on the platform, you may need to install libutf8 to get
138*5f2eab64SJohn Marino   wide character and multibyte character set support.
139*5f2eab64SJohn Marino
140*5f2eab64SJohn Marino Free
141*5f2eab64SJohn Marino
142*5f2eab64SJohn Marino   TRE is released under a license which is essentially the same as
143*5f2eab64SJohn Marino   the "2 clause" BSD-style license used in NetBSD.  See the file
144*5f2eab64SJohn Marino   LICENSE for details.
145*5f2eab64SJohn Marino
146*5f2eab64SJohn MarinoRoadmap
147*5f2eab64SJohn Marino
148*5f2eab64SJohn Marino   There are currently two features, both related to collating
149*5f2eab64SJohn Marino   elements, missing from 100% POSIX compliance. These are:
150*5f2eab64SJohn Marino
151*5f2eab64SJohn Marino     * Support for collating elements (e.g. [[.<X>.]], where <X> is a
152*5f2eab64SJohn Marino       collating element). It is not possible to support
153*5f2eab64SJohn Marino       multi-character collating elements portably, since POSIX does
154*5f2eab64SJohn Marino       not define a way to determine whether a character sequence is a
155*5f2eab64SJohn Marino       multi-character collating element or not.
156*5f2eab64SJohn Marino
157*5f2eab64SJohn Marino     * Support for equivalence classes, for example [[=<X>=]], where
158*5f2eab64SJohn Marino       <X> is a collating element. An equivalence class matches any
159*5f2eab64SJohn Marino       character which has the same primary collation weight as
160*5f2eab64SJohn Marino       <X>. Again, POSIX provides no portable mechanism for
161*5f2eab64SJohn Marino       determining the primary collation weight of a collating
162*5f2eab64SJohn Marino       element.
163*5f2eab64SJohn Marino
164*5f2eab64SJohn Marino   Note that other portable regexp implementations don't support
165*5f2eab64SJohn Marino   collating elements either. The single exception is Regex++, which
166*5f2eab64SJohn Marino   comes with its own database for collating elements for different
167*5f2eab64SJohn Marino   locales. Support for collating elements and equivalence classes has
168*5f2eab64SJohn Marino   not been widely requested and is not very high on the TODO list at
169*5f2eab64SJohn Marino   the moment.
170*5f2eab64SJohn Marino
171*5f2eab64SJohn Marino   These are other features I'm planning to implement real soon now:
172*5f2eab64SJohn Marino
173*5f2eab64SJohn Marino     * All the missing GNU extensions enabled in GNU regex, such as
174*5f2eab64SJohn Marino       [[:<:]] and [[:>:]]
175*5f2eab64SJohn Marino
176*5f2eab64SJohn Marino     * A REG_SHORTEST regexec() flag for returning the shortest match
177*5f2eab64SJohn Marino       instead of the longest match.
178*5f2eab64SJohn Marino
179*5f2eab64SJohn Marino     * Perl-compatible syntax
180*5f2eab64SJohn Marino
181*5f2eab64SJohn Marino            [:^class:]
182*5f2eab64SJohn Marino               Matches anything but the characters in class. Note that
183*5f2eab64SJohn Marino               [^[:class:]] works already, this would be just a
184*5f2eab64SJohn Marino               convenience shorthand.
185*5f2eab64SJohn Marino
186*5f2eab64SJohn Marino            \A
187*5f2eab64SJohn Marino               Match only at beginning of string
188*5f2eab64SJohn Marino
189*5f2eab64SJohn Marino            \Z
190*5f2eab64SJohn Marino               Match only at end of string, or before newline at the end
191*5f2eab64SJohn Marino
192*5f2eab64SJohn Marino            \z
193*5f2eab64SJohn Marino               Match only at end of string
194*5f2eab64SJohn Marino
195*5f2eab64SJohn Marino            \l
196*5f2eab64SJohn Marino               Lowercase next char (think vi)
197*5f2eab64SJohn Marino
198*5f2eab64SJohn Marino            \u
199*5f2eab64SJohn Marino               Uppercase next char (think vi)
200*5f2eab64SJohn Marino
201*5f2eab64SJohn Marino            \L
202*5f2eab64SJohn Marino               Lowercase till \E (think vi)
203*5f2eab64SJohn Marino
204*5f2eab64SJohn Marino            \U
205*5f2eab64SJohn Marino               Uppercase till \E (think vi)
206*5f2eab64SJohn Marino
207*5f2eab64SJohn Marino            (?=pattern)
208*5f2eab64SJohn Marino               Zero-width positive look-ahead assertions.
209*5f2eab64SJohn Marino
210*5f2eab64SJohn Marino            (?!pattern)
211*5f2eab64SJohn Marino               Zero-width negative look-ahead assertions.
212*5f2eab64SJohn Marino
213*5f2eab64SJohn Marino            (?<=pattern)
214*5f2eab64SJohn Marino               Zero-width positive look-behind assertions.
215*5f2eab64SJohn Marino
216*5f2eab64SJohn Marino            (?<!pattern)
217*5f2eab64SJohn Marino               Zero-width negative look-behind assertions.
218*5f2eab64SJohn Marino
219*5f2eab64SJohn Marino   Documentation especially for the nonstandard features of TRE, such
220*5f2eab64SJohn Marino   as approximate matching, is a work in progress (with "progress"
221*5f2eab64SJohn Marino   loosely defined...)
222*5f2eab64SJohn Marino
223*5f2eab64SJohn Marino   Mailing lists
224*5f2eab64SJohn Marino
225*5f2eab64SJohn Marino   tre-general@lists.laurikari.net
226*5f2eab64SJohn Marino      This list is for any discussion on the TRE software, including
227*5f2eab64SJohn Marino      reporting bugs, feature requests, requests for help, and other
228*5f2eab64SJohn Marino      things.
229*5f2eab64SJohn Marino
230*5f2eab64SJohn Marino   tre-announce@lists.laurikari.net
231*5f2eab64SJohn Marino      Subscribe to this list to get announcements of new releases of
232*5f2eab64SJohn Marino      TRE.  Alternatively, you can subscribe to the freshmeat.net
233*5f2eab64SJohn Marino      project and get similar announcements that way, if you prefer.
234*5f2eab64SJohn Marino
235*5f2eab64SJohn MarinoVille Laurikari    <vl@iki.fi>
236