Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching

Personal names are important and common information in many data sources, ranging from social networks and news articles to patient records and scientific documents. They are often used as queries for retrieving records and also as key information for linking documents from multiple sources. Matching personal names can be challenging due to variations in spelling and various formatting of names. While many approximated name matching techniques have been proposed, most are generic string-matching algorithms. Unlike other types of proper names, personal names are highly cultural. Many ethnicities have their own unique naming systems and identifiable characteristics. In this paper we explore such relationships between ethnicities and personal names to improve the name matching performance. First, we propose a name-ethnicity classifier based on the multinomial logistic regression. Our model can effectively identify nameethnicity from personal names in Wikipedia, which we use to define name-ethnicity, to within 85% accuracy. Next, we propose a novel alignment-based name matching algorithm, based on Smith-Waterman algorithm and logistic regression. Different name matching models are then trained for different name-ethnicity groups. Our preliminary experimental result on DBLP's disambiguated author dataset yields a performance of 99% precision and 89% recall. Surprisingly, textual features carry more weight than phonetic ones in nameethnicity classification.

[1]  Raj Bhopal,et al.  Limitations and potential of country of birth as proxy for ethnic group , 2005, BMJ : British Medical Journal.

[2]  Lars Backstrom,et al.  ePluribus: Ethnicity on Social Networks , 2010, ICWSM.

[3]  K. Fiscella,et al.  Use of geocoding and surname analysis to estimate race and ethnicity. , 2006, Health services research.

[4]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[5]  Felix Naumann,et al.  Frequency-aware similarity measures: why Arnold Schwarzenegger is always a duplicate , 2011, CIKM '11.

[6]  A. Coldman,et al.  The classification of ethnic status using name information. , 1988, Journal of epidemiology and community health.

[7]  Andrew Freeman,et al.  Cross Linguistic Name Matching in English and Arabic , 2006, NAACL.

[8]  Steven Skiena,et al.  Name-ethnicity classification from open sources , 2009, KDD.

[9]  P. Mateos A review of name-based ethnicity classification methods and their potential in population studies , 2007 .

[10]  Douglas W. Oard,et al.  Matching person names through name transformation , 2009, CIKM.

[11]  James Allan,et al.  Using Soundex Codes for Indexing Names in ASR Documents , 2004, HLT-NAACL 2004.

[12]  J. Garland THE NEW ENGLAND JOURNAL OF MEDICINE , 1977, The Lancet.

[13]  Lawrence Philips,et al.  The double metaphone search algorithm , 2000 .

[14]  Peter Christen,et al.  A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[15]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[16]  N. Risch,et al.  The importance of race and ethnic background in biomedical research and clinical practice. , 2003, The New England journal of medicine.