Person name disambiguation is essential to distinguish between persons that share the same name where unique identifiers are not present. In many domains this is a common problem including digital libraries where the same name can refer to multiple unique authors. Correctly attributing work and citations requires the digital library's database to be disambiguated. In this work we describe a large scale framework for disambiguating author names efficiently and effectively. The framework uses a density based clustering algorithm with a random forest based distance function to clusters unique authors. Effective use of blocking functions allows the clustering algorithm to be run in parallel. In our experiments we show that the framework disambiguates authors of more than 4 million papers in 24 hours.
[1]
C. Lee Giles,et al.
Disambiguating authors in academic publications using random forests
,
2009,
JCDL '09.
[2]
Leo Breiman,et al.
Random Forests
,
2001,
Machine Learning.
[3]
Ole Tange,et al.
GNU Parallel: The Command-Line Power Tool
,
2011,
login Usenix Mag..
[4]
Hans-Peter Kriegel,et al.
A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise
,
1996,
KDD.
[5]
Julio Gonzalo,et al.
A testbed for people searching strategies in the WWW
,
2005,
SIGIR '05.
[6]
C. Lee Giles,et al.
Efficient Name Disambiguation for Large-Scale Databases
,
2006,
PKDD.