Bibliographic Entity Automatic Recognition and Disambiguation

This master thesis reports an applied machine learning research internship done at digital library of the European Organization for Nuclear Research (CERN). The way an author’s name may vary in its representation across scientific publications creates ambiguity when it comes to uniquely identifying an author; In the database of any scientific digital library, the same full name variation can be used by more than one author. This may occur even between authors from the same research affiliation. In this work, we built a machine learning based author name disambiguation solution. The approach consists in learning a distance function from a ground-truth data, blocking publications of broadly similar author names, and clustering the publications using a semi-supervised strategy within each of the blocks. The main contributions of this work are twofold; first, improving the distance model by taking into account the (estimated) ethnicity of the author’s full name. Indeed, names from different ethnicities, for example Asian versus Arabic names, should be processed differently. This added feature led to a better clustering evaluation. It also got a high contribution percentage in the feature importances analysis. The second main contribution was to decide on a thresholding strategy to form a flat clustering from the agglomerative hierarchical clustering. Six different strategies were evaluated to estimate the number of clusters in each block. The strategy that provides the best evaluation results was using a blocking function that groups signatures with common last name and first name initial, then applying the semi-supervised clustering on the blocks that contains samples from the ground truth. The blocks that do not have any labeled sample will form a single cluster. A smaller contribution also made to the distance model including feature engineering and pairs sampling. Overall, the model accuracy is 98% compared to 94% if we only disambiguate on the common normalized last name and first name initial. My work contributed to raise the accuracy from 97% to slightly more than 98%. This is equivalent to reduce the error rate by about 35%. During the project, I have also contributed to an open source project which will eventually be deployed in the high-energy physics digital library of CERN (http://inspirehep.net). There were many factors that led to achieve such an accurate disambiguation model. A key factor was having a ground-truth data which allowed us to design a very good semi-supervised clustering. Another factor was learning an accurate distance model with an appropriate feature engineering in which we manage to incorporate an external knowledge of the name ethnicity.

[1]  Alexandre Galvão Patriota,et al.  A non-parametric method to estimate the number of clusters , 2014, Comput. Stat. Data Anal..

[2]  Michalis Vazirgiannis,et al.  Clustering validity checking methods: part II , 2002, SGMD.

[3]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[4]  Chunyan Miao,et al.  Author Name Disambiguation Using a New Categorical Distribution Similarity , 2012, ECML/PKDD.

[5]  Madian Khabsa,et al.  Large scale author name disambiguation in digital libraries , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[6]  Divesh Srivastava,et al.  Incremental Record Linkage , 2014, Proc. VLDB Endow..

[7]  Neil R. Smalheiser,et al.  A probabilistic similarity metric for Medline records: A model for author name disambiguation , 2005, J. Assoc. Inf. Sci. Technol..

[8]  Andrew McCallum,et al.  Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function , 2007 .

[9]  Marcos André Gonçalves,et al.  A brief survey of automatic methods for author name disambiguation , 2012, SGMD.

[10]  Gilles Louppe,et al.  Independent consultant , 2013 .

[11]  Stephen Grossberg,et al.  Adaptive Resonance Theory , 2010, Encyclopedia of Machine Learning.

[12]  L. Philips,et al.  Hanging on the metaphone , 1990 .

[13]  Louis L. McQuitty Improved Hierarchical Syndrome Analysis of Discrete and Continuous Data , 1966 .

[14]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[15]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[16]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[17]  Daniel Jurafsky,et al.  Citation-based bootstrapping for large-scale author disambiguation , 2012, J. Assoc. Inf. Sci. Technol..

[18]  G. N. Lance,et al.  A general theory of classificatory sorting strategies: II. Clustering systems , 1967, Comput. J..

[19]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[20]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[21]  Nikola Kasabov,et al.  ECM — A Novel On-line, Evolving Clustering Method and Its Applications , 2001 .

[22]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[23]  John A. Swets,et al.  Signal Detection Theory and ROC Analysis in Psychology and Diagnostics: Collected Papers , 1996 .

[24]  Jian Chen,et al.  Approach for Name Ambiguity Problem Using a Multiple-Layer Clustering , 2009, 2009 International Conference on Computational Science and Engineering.

[25]  Taisir Eldos,et al.  Performance Optimization of Adaptive Resonance Neural Networks Using Genetic Algorithms , 2007, 2007 IEEE Symposium on Foundations of Computational Intelligence.

[26]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2008, Information Retrieval.

[27]  Stasa Milojevic,et al.  Accuracy of simple, initials-based methods for author name disambiguation , 2013, J. Informetrics.

[28]  Ashwin Machanavajjhala,et al.  Network sampling , 2013, KDD.

[29]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[30]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[31]  Shou-De Lin,et al.  Combination of feature engineering and ranking models for paper-author identification in KDD Cup 2013 , 2013, KDD Cup '13.