Random Forest DBSCAN Clustering for USPTO Inventor Name Disambiguation and Conflation

Name disambiguation and the subsequent name conflation are essential for the correct processing of person name queries in a digital library or other database. It distinguishes each unique person from all other records in the database. We study inventor name disambiguation for a patent database using methods and features from earlier work on author name disambiguation and propose a feature set appropriate for a patent database. A random forest was selected for the pairwise linking classifier since they outperform Naive Bayes, Logistic Regression, Support Vector Machines (SVM), Conditional Inference Tree, and Decision Trees. Blocking size, very important for scaling, was selected based on experiments that determined feature importance and accuracy. The DBSCAN algorithm is used for clustering records, using a distance function derived from random forest classifier. For additional scalability clustering was parallelized. Tests on the USPTO patent database show that our method successfully disambiguated 12 million inventor mentions within 6.5 hours. Evaluation on datasets from USPTO PatentsView inventor name disambiguation competition shows our algorithm outperforms all algorithms in the competition.

[1]  Devdatt P. Dubhashi,et al.  Entity disambiguation in anonymized graphs using graph kernels , 2013, CIKM.

[2]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[3]  C. Lee Giles,et al.  Disambiguating authors in academic publications using random forests , 2009, JCDL '09.

[4]  Madian Khabsa,et al.  Online Person Name Disambiguation with Constraints , 2015, JCDL.

[5]  Edward A. Fox,et al.  A relevance feedback approach for the author name disambiguation problem , 2013, JCDL '13.

[6]  Erica R.H. Fuchs,et al.  Seeing the non-stars: (Some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records , 2014 .

[7]  Ole Tange,et al.  GNU Parallel: The Command-Line Power Tool , 2011, login Usenix Mag..

[8]  Yang Song,et al.  Efficient topic-based unsupervised name disambiguation , 2007, JCDL '07.

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  William E. Winkler,et al.  Matching and record linkage , 2011 .

[11]  Andrew McCallum,et al.  A Discriminative Hierarchical Model for Fast Coreference at Large Scale , 2012, ACL.

[12]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[13]  Ke-Wei Huang,et al.  Engineer/Scientist Careers: Patents, Online Profiles, and Misclassification Bias , 2014 .

[14]  Madian Khabsa,et al.  Inventor name disambiguation for a patent database using a random forest and DBSCAN , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[15]  Jianyong Wang,et al.  On Graph-Based Name Disambiguation , 2011, JDIQ.

[16]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[17]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[18]  Madian Khabsa,et al.  Large scale author name disambiguation in digital libraries , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[19]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[20]  Cheng Li,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[21]  Marcos André Gonçalves,et al.  Combining domain-specific heuristics for author name disambiguation , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[22]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[23]  Pierre Azoulay,et al.  The anatomy of medical school patenting. , 2007, The New England journal of medicine.

[24]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[25]  Manuel Trajtenberg,et al.  THE PINHAS SAPIR CENTER FOR DEVELOPMENT TEL AVIV UNIVERSITY Identification and Mobility of Israeli Patenting Inventors , 2008 .