Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning

Building on more than one million crowdsourced annotations that we publicly release, we propose a new automated disambiguation solution exploiting this data (i) to learn an accurate classifier for identifying coreferring authors and (ii) to guide the clustering of scientific publications by distinct authors in a semi-supervised way. To the best of our knowledge, our analysis is the first to be carried out on data of this size and coverage. With respect to the state of the art, we validate the general pipeline used in most existing solutions, and improve by: (i) proposing new phonetic-based blocking strategies, thereby increasing recall; (ii) adding strong ethnicity-sensitive features for learning a linkage function, thereby tailoring disambiguation to non-Western author names whenever necessary; and (iii) showing the importance of balancing negative and positive examples when learning the linkage function.

[1]  S. Ruggles Integrated Public Use Microdata Series , 2021, Encyclopedia of Gerontology and Population Aging.

[2]  Bradley Malin,et al.  Unsupervised Name Disambiguation via Social Network Similarity , 2005 .

[3]  José M. Soler Separating the articles of authors with the same name , 2007, Scientometrics.

[4]  Gilles Louppe,et al.  Understanding variable importances in forests of randomized trees , 2013, NIPS.

[5]  Taha Yasseri,et al.  The distorted mirror of Wikipedia: a quantitative analysis of Wikipedia coverage of academics , 2013, EPJ Data Science.

[6]  M. Newman,et al.  The structure of scientific collaboration networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Jianyong Wang,et al.  On Graph-Based Name Disambiguation , 2011, JDIQ.

[8]  C. Lee Giles,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[9]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[10]  Andreas Strotmann,et al.  Author name disambiguation: What difference does it make in author-based citation analysis? , 2012, J. Assoc. Inf. Sci. Technol..

[11]  Gilles Louppe,et al.  Independent consultant , 2013 .

[12]  Adriano Veloso,et al.  Effective self-training author name disambiguation in scholarly digital libraries , 2010, JCDL '10.

[13]  Marcos André Gonçalves,et al.  A brief survey of automatic methods for author name disambiguation , 2012, SGMD.

[14]  Neil R. Smalheiser,et al.  Author name disambiguation in MEDLINE , 2009, TKDD.

[15]  Shou-De Lin,et al.  Effective string processing and matching for author disambiguation , 2013, KDD Cup '13.

[16]  C. Lee Giles,et al.  Disambiguating authors in academic publications using random forests , 2009, JCDL '09.

[17]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[18]  Madian Khabsa,et al.  Large scale author name disambiguation in digital libraries , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[19]  Andrew McCallum,et al.  Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function , 2007 .

[20]  Yang Song,et al.  Efficient topic-based unsupervised name disambiguation , 2007, JCDL '07.

[21]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[22]  Neil R. Smalheiser,et al.  Author name disambiguation , 2009, Annu. Rev. Inf. Sci. Technol..

[23]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[24]  Won-Kyung Sung,et al.  On co-authorship for author disambiguation , 2009, Inf. Process. Manag..

[25]  Wanli Liu,et al.  Author Name Disambiguation for PubMed , 2013, J. Assoc. Inf. Sci. Technol..

[26]  Dirk Helbing,et al.  Exploiting citation networks for large-scale author name disambiguation , 2014, EPJ Data Science.

[27]  Lawrence Philips,et al.  The double metaphone search algorithm , 2000 .

[28]  Nigel Shadbolt,et al.  Also by the same author: AKTiveAuthor, a citation graph approach to name disambiguation , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[29]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[30]  C. Lee Giles,et al.  Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching , 2012, AAAI.

[31]  Felix Naumann,et al.  Frequency-aware similarity measures: why Arnold Schwarzenegger is always a duplicate , 2011, CIKM '11.

[32]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[33]  Salvatore Mele,et al.  Information Resources in High-Energy Physics: Surveying the Present Landscape and Charting the Future Course , 2009, J. Assoc. Inf. Sci. Technol..

[34]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[35]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[36]  Tien Do,et al.  Author Name Disambiguation by Using Deep Neural Network , 2014, ACIIDS.

[37]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[38]  Daniel Jurafsky,et al.  Citation-based bootstrapping for large-scale author disambiguation , 2012, J. Assoc. Inf. Sci. Technol..