Hybrid Matching Algorithm for Personal Names

Companies acquire personal information from phone, World Wide Web, or email in order to sell or send an advertisement about their product. However, when this information is acquired, moved, copied, or edited, the data may lose its quality. Often, the use of data administrators or a tool that has limited capabilities to correct the mistyped information can cause many problems. Moreover, most of the correction techniques are particularly implemented for the words used in daily conversations. Since personal names have different characteristics compared to general text, a hybrid matching algorithm (PNRS) which employs phonetic encoding, string matching and statistical facts to provide a possible candidate for misspelled names is developed. At the end, the efficiency of the proposed algorithm is compared with other well known spelling correction techniques.

[1]  Peter Christen,et al.  Febrl - A Parallel Open Source Data Linkage System: http://datamining.anu.edu.au/linkage.html , 2004, PAKDD.

[2]  William E. Yancey Evaluating String Comparator Performance for Record Linkage , 2005 .

[3]  Norbert Fuhr,et al.  Retrieval Effectiveness of Proper Name Search Methods , 1996, Inf. Process. Manag..

[4]  P. Fung,et al.  Multilingual spoken language processing , 2008, IEEE Signal Processing Magazine.

[5]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[6]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[7]  W. P. Rogers,et al.  Report of the Presidential Commission on the Space Shuttle Challenger Accident, Volume 1 , 1986 .

[8]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[9]  Julian R. Ullmann,et al.  A Binary n-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion and Reversal Errors in Words , 1977, Comput. J..

[10]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[11]  NavarroGonzalo A guided tour to approximate string matching , 2001 .

[12]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[13]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[14]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[15]  Lillian Lee,et al.  Measures of Distributional Similarity , 1999, ACL.

[16]  Justin Zobel,et al.  Phonetic string matching: lessons from information retrieval , 1996, SIGIR '96.

[17]  Robert Ludwig,et al.  Application of Software Engineering Fundamentals: A Hands on Experience , 2005, Software Engineering Research and Practice.

[18]  Diane M. Strong,et al.  Information quality benchmarks: product and service performance , 2002, CACM.

[19]  Emmanuel J. Yannakoudakis,et al.  The rules of spelling errors , 1983, Inf. Process. Manag..

[20]  Ruibin Gong,et al.  Syllable Alignment: A Novel Model for Phonetic String Search , 2006, IEICE Trans. Inf. Syst..

[21]  Peter Christen,et al.  A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[22]  T. N. Gadd,et al.  PHOENIX: the algorithm , 1990 .

[23]  C. Bayrak,et al.  Applied software engineering education , 2005, 2005 6th International Conference on Information Technology Based Higher Education and Training.

[24]  C Friedman,et al.  Tolerating spelling errors during patient validation. , 1992, Computers and biomedical research, an international journal.

[25]  John M. Trenkle,et al.  Disambiguation and spelling correction for a neural network based character recognition system , 1994, Electronic Imaging.

[26]  David Alex Lamb,et al.  Spelling correction in user interfaces , 1983, CACM.

[27]  Alex Nowrasteh,et al.  The Fiscal Burden of Illegal Immigration on United States Taxpayers , 2011 .

[28]  Esko Ukkonen,et al.  A Comparison of Approximate String Matching Algorithms , 1996 .

[29]  Brian Randell,et al.  An Assessment of Name Matching Algorithms , 1996 .

[30]  Peter Christen,et al.  Quality and Complexity Measures for Data Linkage and Deduplication , 2007, Quality Measures in Data Mining.

[31]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[32]  Stuart E. Madnick,et al.  Data quality requirements analysis and modeling , 2011, Proceedings of IEEE 9th International Conference on Data Engineering.

[33]  L. Philips,et al.  Hanging on the metaphone , 1990 .

[34]  Esko Ukkonen,et al.  A Comparison of Approximate String Matching Algorithms , 1996, Softw. Pract. Exp..

[35]  Claudio Becchetti,et al.  Speech Recognition: Theory and C++ Implementation , 1999 .

[36]  Vijay V. Raghavan,et al.  A critical investigation of recall and precision as measures of retrieval system performance , 1989, TOIS.

[37]  Fabrice Guillet,et al.  Quality Measures in Data Mining (Studies in Computational Intelligence) , 2007 .

[38]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[39]  Diane M. Strong,et al.  Data quality in context , 1997, CACM.

[40]  Fred J. Damerau Evaluating computer-generated domain-oriented vocabularies , 1990, Inf. Process. Manag..

[41]  Raghavendra Udupa,et al.  Hashing-Based Approaches to Spelling Correction of Personal Names , 2010, EMNLP.

[42]  Stuart E. Madnick,et al.  Overview and Framework for Data and Information Quality Research , 2009, JDIQ.