Detecting Ambiguous Author Names in Crowdsourced Scholarly Data

The name ambiguity problem is a challenge in many areas, especially in the field of bibliographic digital libraries. For example, in services that use citation data to compute the impact of authors, ambiguous names lead to biased measures. The problem is amplified where names are collected from heterogeneous sources, including crowd sourced annotations. This is the case in the Scholaro meter system, which cross-correlates author names in user queries with those retrieved from bibliographic data. The uncontrolled nature of user-generated annotations is very valuable, but creates the need to detect ambiguous names. In this paper, we propose an approach to detect ambiguous names at query time, which makes it applicable in the context of a social computing application. We explore two kinds of heuristic features based on citations and crowd sourced topics. Our approach can detect ambiguous author names in crowd sourced scholarly data with an accuracy of 75%.

[1]  C. Lee Giles,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[2]  Yang Song,et al.  Efficient topic-based unsupervised name disambiguation , 2007, JCDL '07.

[3]  Wei Xu,et al.  A hierarchical naive Bayes mixture model for name disambiguation in author citations , 2005, SAC '05.

[4]  Filippo Menczer,et al.  Crowdsourcing Scholarly Data , 2010 .

[5]  Gregory R. Crane,et al.  Disambiguating Geographic Names in a Historical Digital Library , 2001, ECDL.

[6]  Bradley Malin,et al.  Unsupervised Name Disambiguation via Social Network Similarity , 2005 .

[7]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[8]  Jan-Ming Ho,et al.  Author Name Disambiguation for Citations Using Topic and Web Correlation , 2008, ECDL.

[9]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[10]  Ping Chen,et al.  Biomedical Term Disambiguation: An Application to Gene-Protein Name Disambiguation , 2006, Third International Conference on Information Technology: New Generations (ITNG'06).

[11]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[12]  Marcos André Gonçalves,et al.  A Heuristic-based Hierarchical Clustering Method for Author Name Disambiguation in Digital Libraries , 2007, SBBD.

[13]  Andrew McCallum,et al.  Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function , 2007 .

[14]  José M. Soler Separating the articles of authors with the same name , 2007, Scientometrics.

[15]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[16]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.