Avoiding bias when inferring race using name-based approaches

Racial disparity in academia is a widely acknowledged problem. The quantitative understanding of racial-based systemic inequalities is an important step towards a more equitable research system. However, few large-scale analyses have been performed on this topic, mostly because of the lack of robust race-disambiguation algorithms. Identifying author information does not generally include the author’s race. Therefore, an algorithm needs to be employed, using known information about authors, i.e., their names, to infer their perceived race. Nevertheless, as any other algorithm, the process of racial inference can generate biases if it is not carefully considered. When the research is focused on the understanding of racial-based inequalities, such biases undermine the objectives of the investigation and may perpetuate inequities. The goal of this article is to assess the biases introduced by the different approaches used name-based racial inference. We use information from US census and mortgage applications to infer the race of US author names in the Web of Science. We estimate the effects of using given and family names, thresholds or continuous distributions, and imputation. Our results demonstrate that the validity of name-based inference varies by race and ethnicity and that threshold approaches underestimate Black authors and overestimate White authors. We conclude with recommendations to avoid potential biases. This article fills an important research gap that will allow more systematic and unbiased studies on racial disparity in science.

[1]  D. Barker Global gender disparities in science , 2013 .

[2]  L S Penrose,et al.  Hereditary genius. , 1951, The Eugenics review.

[3]  K. Fiscella,et al.  Use of geocoding and surname analysis to estimate race and ethnicity. , 2006, Health services research.

[4]  Wei Huang,et al.  Collaborating with People Like Me: Ethnic Co-Authorship within the Us , 2014, SSRN Electronic Journal.

[5]  C. Prescod-Weinstein Making Black Women Scientists under White Empiricism: The Racialization of Epistemology in Physics , 2020, Signs: Journal of Women in Culture and Society.

[6]  D. Ginther,et al.  Publications as predictors of racial and ethnic differences in NIH research awards , 2018, PloS one.

[7]  Konstantinos Tzioumis,et al.  Demographic aspects of first names , 2018, Scientific Data.

[8]  Nandita B. Basu,et al.  Disparities in publication patterns by gender, race and ethnicity based on a survey of a random sample of authors , 2012, Scientometrics.

[9]  Arvind Narayanan,et al.  Semantics derived automatically from language corpora contain human-like biases , 2016, Science.

[10]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[11]  Bruce A. Weinberg,et al.  Replication data for: Last Place? The Intersection of Ethnicity, Gender, and Race in Biomedical Authorship , 2019 .

[12]  L. Cook Violence and economic activity: evidence from African American patents, 1870–1940 , 2014 .

[13]  D. McCaffrey,et al.  Using the Census Bureau’s surname list to improve estimates of race/ethnicity and associated disparities , 2009, Health Services and Outcomes Research Methodology.

[14]  Maxwell A. Bertolero,et al.  Racial and ethnic imbalance in neuroscience reference lists and intersections with gender , 2020, bioRxiv.

[15]  L. Setton,et al.  Fund Black scientists , 2021, Cell.

[16]  Michael S. Lauer,et al.  Topic choice contributes to the lower rate of NIH awards to African-American/black scientists , 2019, Science Advances.

[17]  Lawrence J. Buntin Historical Statistics of the United States: Colonial Times to 1970 , 1976 .

[18]  Ethnicity‐based name partitioning for author name disambiguation using supervised machine learning , 2021, J. Assoc. Inf. Sci. Technol..

[19]  Niamh McCrea Systemic Racism: A Theory of Oppression , 2007 .

[20]  Hewan Girma Black Names, Immigrant Names: Navigating Race and Ethnicity Through Personal Names , 2019 .

[21]  H. D. Horton Toward A Critical Demography of Race and Ethnicity: Introduction of the “R” Word , 1998 .

[22]  Steven D. Levitt,et al.  The Causes and Consequences of Distinctively Black Names , 2003 .

[23]  Natalia Kovalyova,et al.  Data feminism , 2020, Information, Communication & Society.

[24]  Stephen L. Hupp,et al.  Historical Statistics of the United States , 1997 .

[25]  Roberto R. Ramirez,et al.  Overview of Race and Hispanic Origin: 2010 , 2011 .

[26]  H. D. Horton,et al.  Reconsidering wealth, status, and power: Critical Demography and the measurement of racism , 2001 .

[27]  Gaurav Sood,et al.  Predicting Race and Ethnicity From the Sequence of Characters in a Name , 2018, 1805.02109.

[28]  Daniel A. McFarland,et al.  The Diversity–Innovation Paradox in Science , 2019, Proceedings of the National Academy of Sciences.

[29]  C. Buntain,et al.  Identifying social media user demographics and topic diversity with computational social science: a case study of a major international policy forum , 2020, J. Comput. Soc. Sci..

[30]  Laurel L. Haak,et al.  Race, Ethnicity, and NIH Research Awards , 2011, Science.

[31]  F. Furstenberg In the Name of the Father: Washington's Legacy, Slavery, and the Making of a Nation , 2006 .

[32]  Yee Whye Teh,et al.  Dirichlet Process , 2017, Encyclopedia of Machine Learning and Data Mining.

[33]  Pierre Bourdieu,et al.  Science of Science and Reflexivity , 2004 .

[34]  S. Holcombe Caste: The Origins of Our Discontents , 2020, CASTE / A Global Journal on Social Exclusion.

[35]  Mustafa Emirbayer,et al.  Race and reflexivity , 2012 .

[36]  T. Zuberi Thicker Than Blood: How Racial Statistics Lie , 2001 .