A "Roziah" by any other name: a simple Bayesian method for determining ethnicity from names.

Correct identification of ethnicity is central to many epidemiologic analyses. Unfortunately, ethnicity data are often missing. Successful classification typically relies on large databases (n > 500,000 names) of known name-ethnicity associations. We propose an alternative naïve Bayesian strategy that uses substrings of full names. Name and ethnicity data for Malays, Indians, and Chinese were provided by a health and demographic surveillance site operating in Malaysia from 2011-2013. The data comprised a training data set (n = 10,104) and a test data set (n = 9,992). Names were spliced into contiguous 3-letter substrings, and these were used as the basis for the Bayesian analysis. Performance was evaluated on both data sets using Cohen's κ and measures of sensitivity and specificity. There was little difference between the classification performance in the training and test data (κ = 0.93 and 0.94, respectively). For the test data, the sensitivity values for the Malay, Indian, and Chinese names were 0.997, 0.855, and 0.932, respectively, and the specificity values were 0.907, 0.998, and 0.997, respectively. A naïve Bayesian strategy for the classification of ethnicity is promising. It performs at least as well as more sophisticated approaches. The possible application to smaller data sets is particularly appealing. Further research examining other substring lengths and other ethnic groups is warranted.

[1]  D. Hewitt,et al.  Uses of the surname in epidemiologic research. , 1972, American journal of epidemiology.

[2]  Sue Wilson,et al.  Use of name recognition software, census data and multiple imputation to predict missing data on ethnicity: application to cancer registry records , 2012, BMC Medical Informatics and Decision Making.

[3]  Yutaka Yasui,et al.  Lessons Learned from the Application of a Vietnamese Surname List for Survey Research , 2011, Journal of Immigrant and Minority Health.

[4]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[5]  P. Mangtani,et al.  Validation and utility of a computerized South Asian names and group recognition algorithm in ascertaining South Asian ethnicity in the national renal registry. , 2009, QJM : monthly journal of the Association of Physicians.

[6]  Harry Zhang,et al.  Naive Bayes for optimal ranking , 2008, J. Exp. Theor. Artif. Intell..

[7]  V. Pendred University of London , 1907, Nature.

[8]  N. Ponce,et al.  Racial and ethnic health disparities: evidence of discrimination's effects across the SEP spectrum , 2010, Ethnicity & health.

[9]  H. Vermeulen Rethinking ethnicity. Arguments and explorations , 1997 .

[10]  A. Silman,et al.  Determining aspects of ethnicity amongst persons of South Asian origin: the use of a surname-classification programme (Nam Pehchan). , 2007, Public health.

[11]  P. Norris,et al.  Coverage and accuracy of ethnicity data on three Asian ethnic groups in New Zealand , 2010, Australian and New Zealand journal of public health.

[12]  D. Crews,et al.  Ethnicity as a taxonomic tool in biomedical and biosocial research. , 1991, Ethnicity & disease.

[13]  Olanrewaju O. Omojokun,et al.  Health Disparities in the United States: Social Class, Race, Ethnicity, and Health , 2008 .

[14]  P Mateos,et al.  Name analysis to classify populations by ethnicity in public health: validation of Onomap in Scotland. , 2011, Public health.

[15]  Karien Stronks,et al.  Unravelling the impact of ethnicity on health in Europe: the HELIUS study , 2013, BMC Public Health.

[16]  M. Elliott,et al.  A new method for estimating race/ethnicity and associated disparities where administrative records lack self-reported race/ethnicity. , 2008, Health services research.

[17]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[18]  A. Heinz,et al.  Random Sampling for a Mental Health Survey in a Deprived Multi-Ethnic Area of Berlin , 2012, Community Mental Health Journal.

[19]  R. Bhopal,et al.  Inappropriate use of the term 'Asian': an obstacle to ethnicity and health research. , 1991, Journal of public health medicine.

[20]  Diane S. Lauderdale,et al.  Asian American ethnic identification by surname , 2000 .

[21]  I. D. S. Silva,et al.  Development and validation of a computerized South Asian Names and Group Recognition Algorithm (SANGRA) for use in British health-related studies. , 2001, Journal of public health medicine.

[22]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[23]  T. Blakely,et al.  Linkage of data in the study of ethnic inequalities and inequities in health outcomes in Scotland, New Zealand and The Netherlands: insights for global study of ethnicity and health. , 2012, Public health.

[24]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[25]  Sandro Galea,et al.  Validation of an Arab name algorithm in the determination of Arab ancestry for use in health research , 2010, Ethnicity & health.

[26]  P. Silcocks,et al.  An assessment of the Nam Pehchan computer program for the identification of names of south Asian ethnic origin. , 1999, Journal of public health medicine.

[27]  S. Haga,et al.  Characterization of clinical study populations by race and ethnicity in biomedical literature. , 2012, Ethnicity & disease.

[28]  Eliseo Guallar,et al.  Ethnic Differences in the Prevalence of Metabolic Syndrome: Results from a Multi-Ethnic Population-Based Survey in Malaysia , 2012, PloS one.

[29]  R. Buechley A Reproducible Method of Counting Persons of Spanish Surname , 1961 .

[30]  P. Mateos An ontology of ethnicity based upon personal names: with implications for neighbourhood profiling , 2007 .

[31]  Raj Bhopal,et al.  Ethnic and socio-economic inequalities in coronary heart disease, diabetes and risk factors in Europeans and South Asians. , 2002, Journal of public health medicine.