Comparison of a Modified Spanish Phonetic, Soundex, and Phonex coding functions during data matching process

The present paper is aimed to help native spanish speakers to identify an open and effective spanish encoding function during data matching process. We present the implementation and enhancement of the encoding algorithm Spanish Phonetic Soundex [1]. We have carried out an evaluation of data matching considering Spanish Phonetic Soundex, Soundex [2], [3] and Phonex [4] in terms of precision-recall and f-measure. As far as we know, such comparison against these phonetic functions has not been presented before. We suggest spanish speaker users a Modified Spanish Phonetic Soundex function, that has a better performance in terms of precision, f-measure and similarity values derived from the encoding phase than the common phonetic coding functions utilized until now.

[1]  Fabio Stella,et al.  A Privacy Preserving Framework for Accuracy and Completeness Quality Assessment , 2009 .

[2]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[3]  Peter Christen,et al.  Preparation of name and address data for record linkage using hidden Markov models , 2002, BMC Medical Informatics Decis. Mak..

[4]  Peter Christen,et al.  Quality and Complexity Measures for Data Linkage and Deduplication , 2007, Quality Measures in Data Mining.

[5]  Christine L. Borgman,et al.  Getty's Synoname™ and its cousins: A survey of applications of personal name‐matching algorithms , 1992 .

[6]  F. Moreno,et al.  Algoritmo fonético para detección de cadenas de texto duplicadas en el idioma español , 2012 .

[7]  Peter Christen,et al.  Febrl: a freely available record linkage system with a graphical user interface , 2008 .

[8]  Brian Randell,et al.  An Assessment of Name Matching Algorithms , 1996 .

[9]  Lawrence Philips,et al.  The double metaphone search algorithm , 2000 .

[10]  Christine L. Borgman,et al.  Getty's Synoname and Its Cousins: A Survey of Applications of Personal Name-Matching Algorithms , 1992, J. Am. Soc. Inf. Sci..

[11]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[12]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[13]  Ian H. Witten,et al.  Managing gigabytes 2nd edition , 1999 .

[14]  David O. Holmes,et al.  Improving precision and recall for Soundex retrieval , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.