NLP Versus IR Approaches to Fuzzy Name Searching in Digital Libraries

Name Search is an important search function in Digital Library systems and various types of information retrieval systems, such as directory search systems, electronic phonebooks and yellow pages. The paper discusses two main approaches to fuzzy name matchingthe natural language processing (NLP) approach and the information retrieval (IR) approachand proposes a hybrid approach. Person names can be considered a (sub-)language, in which case a name search system will be developed using Natural Language Processing apparatus including dictionary, thesaurus and grammatical schema. On the other hand, if names are perceived as (free) text, then an entirely different system may be built incorporating indexing, retrieving, relevance ranking and other Information Retrieval techniques. These two schools of thought, NLP and IR, have somewhat different sets of techniques originating from different theoretical concerns and research traditions. A selective combination of their complementary features is likely to be more effective for fuzzy name matching. Two principles, position attribute identity (PAI) and position transition likelihood (PTL), are proposed to incorporate aspects of both approaches. The two principles have been implemented in an NLP- and IR- hybrid model system called Friendly Name Search (FNS) for real world applications in multilingual directory searches on the Singapore Yellowpages website.

[1]  K. G. Roughton,et al.  Browsing with sound: sound-based codes and automated authority control , 1985 .

[2]  Susan L. Siegfried,et al.  Synoname1: The Getty's new approach to pattern matching for personal names , 1991 .

[3]  E. Michael Keen Some aspects of proximity searching in text retrieval systems , 1992, J. Inf. Sci..

[4]  F J MOORE,et al.  Mechaniation a Large Register of First Order Patient Data , 1965, Methods of Information in Medicine.

[5]  Kalervo Järvelin,et al.  Fuzzy translation of cross-lingual spelling variants , 2003, SIGIR.

[6]  Leon Davidson,et al.  Retrieval of misspelled names in an airlines passenger record system , 1962, CACM.

[7]  Steven J. DeRose,et al.  Grammatical Category Disambiguation by Statistical Optimization , 1988, CL.

[8]  Walter Daelemans,et al.  Data-Oriented Methods for Grapheme-to-Phoneme Conversion , 1993, EACL.

[9]  Norbert Fuhr,et al.  Retrieval Effectiveness of Proper Name Search Methods , 1996, Inf. Process. Manag..

[10]  Michael F. Lynch,et al.  Application of the Variety-Generator Approach to Searches of Personal Names in Bibliographic Data Bases--Part 2. Optimization of Key-Sets, and Evaluation of Their Retrieval Efficiency , 1974 .

[11]  Anil Sethi,et al.  Matching records in a national medical patient index , 2001, CACM.

[12]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[13]  Michael F. Lynch,et al.  Application of the Variety-Generator Approach to Searches of Personal Names in Bibliographic Data Bases--Part 1. Microstructure of Personal Authors' Names , 1974 .

[14]  Hans-Peter Frei,et al.  Concept based query expansion , 1993, SIGIR.

[15]  Andrew R. Golding Pronouncing names by a combination of rule-based and case-based reasoning , 1992 .

[16]  Christine L. Borgman,et al.  Getty's Synoname™ and its cousins: A survey of applications of personal name‐matching algorithms , 1992 .

[17]  Wing Shing Wong,et al.  A Hybrid Approach to Address Normalization , 1994, IEEE Expert.

[18]  Kevin Knight,et al.  Translating Names and Technical Terms in Arabic Text , 1998, SEMITIC@COLING.

[19]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.