Authormagic: an approach to author disambiguation in large-scale digital libraries

A collaboration of leading research centers in the field of High Energy Physics (HEP) has built INSPIRE, a novel information infrastructure, which comprises the entire corpus of about one million documents produced within the discipline, including a rich set of metadata, citation information and half a million full-text documents, and offers a unique opportunity for author disambiguation strategies. The presented approach features extended metadata comparison metrics and a three-step unsupervised graph clustering technique. The algorithm aided in identifying 200'000 individuals from 6'500'000 author signatures. Preliminary tests based on knowledge of external experts and a pilot of a crowd-sourcing system show a success rate of more than 96% within the selected test cases. The obtained author clusters serve as a recommendation for INSPIRE users to further clean the publication list in a crowd-sourced approach.

[1]  Andrew M. Dai,et al.  Author Disambiguation: A Nonparametric Topic andCo-authorship Model , 2009 .

[2]  Jian Pei,et al.  An effective approach to entity resolution problem using quasi-clique and its application to digital libraries , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[3]  Andrew McCallum,et al.  Semi-Supervised Clustering with User Feedback , 2003 .

[4]  Won-Kyung Sung,et al.  On co-authorship for author disambiguation , 2009, Inf. Process. Manag..

[5]  Jennifer Widom,et al.  Lineage tracing for general data warehouse transformations , 2003, The VLDB Journal.

[6]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[7]  Mike Cohn,et al.  User Stories Applied: For Agile Software Development , 2004 .

[8]  Neil R. Smalheiser The Arrowsmith Project: 2005 Status Report , 2005, Discovery Science.

[9]  Salvatore Mele,et al.  Innovation in Scholarly Communication: Vision and Projects from High-Energy Physics , 2008, Inf. Serv. Use.

[10]  Andrew M. Dai,et al.  Author Disambiguation: A Nonparametric Topic andCo-authorship Model , 2009 .

[11]  Griet... Jans,et al.  Study on mobility patterns and career paths of EU researches , 2010 .

[12]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[13]  Marcos André Gonçalves,et al.  An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations , 2010 .

[14]  Neil R. Smalheiser,et al.  Author name disambiguation in MEDLINE , 2009, TKDD.

[15]  Adriano Veloso,et al.  Effective self-training author name disambiguation in scholarly digital libraries , 2010, JCDL '10.

[16]  Cheng Li,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[17]  Andrew McCallum,et al.  Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function , 2007 .

[18]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .