1*0Sstevel@tonic-gatepackage Text::Soundex; 2*0Sstevel@tonic-gaterequire 5.000; 3*0Sstevel@tonic-gaterequire Exporter; 4*0Sstevel@tonic-gate 5*0Sstevel@tonic-gate@ISA = qw(Exporter); 6*0Sstevel@tonic-gate@EXPORT = qw(&soundex $soundex_nocode); 7*0Sstevel@tonic-gate 8*0Sstevel@tonic-gate$VERSION = '1.01'; 9*0Sstevel@tonic-gate 10*0Sstevel@tonic-gate# $Id: soundex.pl,v 1.2 1994/03/24 00:30:27 mike Exp $ 11*0Sstevel@tonic-gate# 12*0Sstevel@tonic-gate# Implementation of soundex algorithm as described by Knuth in volume 13*0Sstevel@tonic-gate# 3 of The Art of Computer Programming, with ideas stolen from Ian 14*0Sstevel@tonic-gate# Phillipps <ian@pipex.net>. 15*0Sstevel@tonic-gate# 16*0Sstevel@tonic-gate# Mike Stok <Mike.Stok@meiko.concord.ma.us>, 2 March 1994. 17*0Sstevel@tonic-gate# 18*0Sstevel@tonic-gate# Knuth's test cases are: 19*0Sstevel@tonic-gate# 20*0Sstevel@tonic-gate# Euler, Ellery -> E460 21*0Sstevel@tonic-gate# Gauss, Ghosh -> G200 22*0Sstevel@tonic-gate# Hilbert, Heilbronn -> H416 23*0Sstevel@tonic-gate# Knuth, Kant -> K530 24*0Sstevel@tonic-gate# Lloyd, Ladd -> L300 25*0Sstevel@tonic-gate# Lukasiewicz, Lissajous -> L222 26*0Sstevel@tonic-gate# 27*0Sstevel@tonic-gate# $Log: soundex.pl,v $ 28*0Sstevel@tonic-gate# Revision 1.2 1994/03/24 00:30:27 mike 29*0Sstevel@tonic-gate# Subtle bug (any excuse :-) spotted by Rich Pinder <rpinder@hsc.usc.edu> 30*0Sstevel@tonic-gate# in the way I handles leasing characters which were different but had 31*0Sstevel@tonic-gate# the same soundex code. This showed up comparing it with Oracle's 32*0Sstevel@tonic-gate# soundex output. 33*0Sstevel@tonic-gate# 34*0Sstevel@tonic-gate# Revision 1.1 1994/03/02 13:01:30 mike 35*0Sstevel@tonic-gate# Initial revision 36*0Sstevel@tonic-gate# 37*0Sstevel@tonic-gate# 38*0Sstevel@tonic-gate############################################################################## 39*0Sstevel@tonic-gate 40*0Sstevel@tonic-gate# $soundex_nocode is used to indicate a string doesn't have a soundex 41*0Sstevel@tonic-gate# code, I like undef other people may want to set it to 'Z000'. 42*0Sstevel@tonic-gate 43*0Sstevel@tonic-gate$soundex_nocode = undef; 44*0Sstevel@tonic-gate 45*0Sstevel@tonic-gatesub soundex 46*0Sstevel@tonic-gate{ 47*0Sstevel@tonic-gate local (@s, $f, $fc, $_) = @_; 48*0Sstevel@tonic-gate 49*0Sstevel@tonic-gate push @s, '' unless @s; # handle no args as a single empty string 50*0Sstevel@tonic-gate 51*0Sstevel@tonic-gate foreach (@s) 52*0Sstevel@tonic-gate { 53*0Sstevel@tonic-gate $_ = uc $_; 54*0Sstevel@tonic-gate tr/A-Z//cd; 55*0Sstevel@tonic-gate 56*0Sstevel@tonic-gate if ($_ eq '') 57*0Sstevel@tonic-gate { 58*0Sstevel@tonic-gate $_ = $soundex_nocode; 59*0Sstevel@tonic-gate } 60*0Sstevel@tonic-gate else 61*0Sstevel@tonic-gate { 62*0Sstevel@tonic-gate ($f) = /^(.)/; 63*0Sstevel@tonic-gate tr/AEHIOUWYBFPVCGJKQSXZDTLMNR/00000000111122222222334556/; 64*0Sstevel@tonic-gate ($fc) = /^(.)/; 65*0Sstevel@tonic-gate s/^$fc+//; 66*0Sstevel@tonic-gate tr///cs; 67*0Sstevel@tonic-gate tr/0//d; 68*0Sstevel@tonic-gate $_ = $f . $_ . '000'; 69*0Sstevel@tonic-gate s/^(.{4}).*/$1/; 70*0Sstevel@tonic-gate } 71*0Sstevel@tonic-gate } 72*0Sstevel@tonic-gate 73*0Sstevel@tonic-gate wantarray ? @s : shift @s; 74*0Sstevel@tonic-gate} 75*0Sstevel@tonic-gate 76*0Sstevel@tonic-gate1; 77*0Sstevel@tonic-gate 78*0Sstevel@tonic-gate__END__ 79*0Sstevel@tonic-gate 80*0Sstevel@tonic-gate=head1 NAME 81*0Sstevel@tonic-gate 82*0Sstevel@tonic-gateText::Soundex - Implementation of the Soundex Algorithm as Described by Knuth 83*0Sstevel@tonic-gate 84*0Sstevel@tonic-gate=head1 SYNOPSIS 85*0Sstevel@tonic-gate 86*0Sstevel@tonic-gate use Text::Soundex; 87*0Sstevel@tonic-gate 88*0Sstevel@tonic-gate $code = soundex $string; # get soundex code for a string 89*0Sstevel@tonic-gate @codes = soundex @list; # get list of codes for list of strings 90*0Sstevel@tonic-gate 91*0Sstevel@tonic-gate # set value to be returned for strings without soundex code 92*0Sstevel@tonic-gate 93*0Sstevel@tonic-gate $soundex_nocode = 'Z000'; 94*0Sstevel@tonic-gate 95*0Sstevel@tonic-gate=head1 DESCRIPTION 96*0Sstevel@tonic-gate 97*0Sstevel@tonic-gateThis module implements the soundex algorithm as described by Donald Knuth 98*0Sstevel@tonic-gatein Volume 3 of B<The Art of Computer Programming>. The algorithm is 99*0Sstevel@tonic-gateintended to hash words (in particular surnames) into a small space using a 100*0Sstevel@tonic-gatesimple model which approximates the sound of the word when spoken by an English 101*0Sstevel@tonic-gatespeaker. Each word is reduced to a four character string, the first 102*0Sstevel@tonic-gatecharacter being an upper case letter and the remaining three being digits. 103*0Sstevel@tonic-gate 104*0Sstevel@tonic-gateIf there is no soundex code representation for a string then the value of 105*0Sstevel@tonic-gateC<$soundex_nocode> is returned. This is initially set to C<undef>, but 106*0Sstevel@tonic-gatemany people seem to prefer an I<unlikely> value like C<Z000> 107*0Sstevel@tonic-gate(how unlikely this is depends on the data set being dealt with.) Any value 108*0Sstevel@tonic-gatecan be assigned to C<$soundex_nocode>. 109*0Sstevel@tonic-gate 110*0Sstevel@tonic-gateIn scalar context C<soundex> returns the soundex code of its first 111*0Sstevel@tonic-gateargument, and in list context a list is returned in which each element is the 112*0Sstevel@tonic-gatesoundex code for the corresponding argument passed to C<soundex> e.g. 113*0Sstevel@tonic-gate 114*0Sstevel@tonic-gate @codes = soundex qw(Mike Stok); 115*0Sstevel@tonic-gate 116*0Sstevel@tonic-gateleaves C<@codes> containing C<('M200', 'S320')>. 117*0Sstevel@tonic-gate 118*0Sstevel@tonic-gate=head1 EXAMPLES 119*0Sstevel@tonic-gate 120*0Sstevel@tonic-gateKnuth's examples of various names and the soundex codes they map to 121*0Sstevel@tonic-gateare listed below: 122*0Sstevel@tonic-gate 123*0Sstevel@tonic-gate Euler, Ellery -> E460 124*0Sstevel@tonic-gate Gauss, Ghosh -> G200 125*0Sstevel@tonic-gate Hilbert, Heilbronn -> H416 126*0Sstevel@tonic-gate Knuth, Kant -> K530 127*0Sstevel@tonic-gate Lloyd, Ladd -> L300 128*0Sstevel@tonic-gate Lukasiewicz, Lissajous -> L222 129*0Sstevel@tonic-gate 130*0Sstevel@tonic-gateso: 131*0Sstevel@tonic-gate 132*0Sstevel@tonic-gate $code = soundex 'Knuth'; # $code contains 'K530' 133*0Sstevel@tonic-gate @list = soundex qw(Lloyd Gauss); # @list contains 'L300', 'G200' 134*0Sstevel@tonic-gate 135*0Sstevel@tonic-gate=head1 LIMITATIONS 136*0Sstevel@tonic-gate 137*0Sstevel@tonic-gateAs the soundex algorithm was originally used a B<long> time ago in the US 138*0Sstevel@tonic-gateit considers only the English alphabet and pronunciation. 139*0Sstevel@tonic-gate 140*0Sstevel@tonic-gateAs it is mapping a large space (arbitrary length strings) onto a small 141*0Sstevel@tonic-gatespace (single letter plus 3 digits) no inference can be made about the 142*0Sstevel@tonic-gatesimilarity of two strings which end up with the same soundex code. For 143*0Sstevel@tonic-gateexample, both C<Hilbert> and C<Heilbronn> end up with a soundex code 144*0Sstevel@tonic-gateof C<H416>. 145*0Sstevel@tonic-gate 146*0Sstevel@tonic-gate=head1 AUTHOR 147*0Sstevel@tonic-gate 148*0Sstevel@tonic-gateThis code was implemented by Mike Stok (C<stok@cybercom.net>) from the 149*0Sstevel@tonic-gatedescription given by Knuth. Ian Phillipps (C<ian@pipex.net>) and Rich Pinder 150*0Sstevel@tonic-gate(C<rpinder@hsc.usc.edu>) supplied ideas and spotted mistakes. 151