论文信息 - Random Forest DBSCAN for USPTO Inventor Name Disambiguation

Random Forest DBSCAN for USPTO Inventor Name Disambiguation

Name disambiguation and the subsequent name conflation are essential for the correct processing of person name queries in a digital library or other database. It distinguishes each unique person from all other records in the database. We study inventor name disambiguation for a patent database using methods and features from earlier work on author name disambiguation and propose a feature set appropriate for a patent database. A random forest was selected for the pairwise linking classifier since they outperform Naive Bayes, Logistic Regression, Support Vector Machines (SVM), Conditional Inference Tree, and Decision Trees. Blocking size, very important for scaling, was selected based on experiments that determined feature importance and accuracy. The DBSCAN algorithm is used for clustering records, using a distance function derived from random forest classifier. For additional scalability clustering was parallelized. Tests on the USPTO patent database show that our method successfully disambiguated 12 million inventor mentions within 6.5 hours. Evaluation on datasets from USPTO PatentsView inventor name disambiguation competition shows our algorithm outperforms all algorithms in the competition.

Madian Khabsa | C. Lee Giles | Kunho Kim | Madian Khabsa | Kunho Kim

[1] Jianyong Wang,et al. On Graph-Based Name Disambiguation , 2011, JDIQ.

[2] C. Lee Giles,et al. Disambiguating authors in academic publications using random forests , 2009, JCDL '09.

[3] Hui Han,et al. Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[4] Madian Khabsa,et al. Online Person Name Disambiguation with Constraints , 2015, JCDL.

[5] C. Lee Giles,et al. Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[6] Edward A. Fox,et al. A relevance feedback approach for the author name disambiguation problem , 2013, JCDL '13.

[7] C. Lee Giles,et al. Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[8] Ke-Wei Huang,et al. Engineer/Scientist Careers: Patents, Online Profiles, and Misclassification Bias , 2014 .

[9] Erica R.H. Fuchs,et al. Seeing the non-stars: (Some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records , 2014 .

[10] Ole Tange,et al. GNU Parallel: The Command-Line Power Tool , 2011, login Usenix Mag..

[11] Manuel Trajtenberg,et al. THE PINHAS SAPIR CENTER FOR DEVELOPMENT TEL AVIV UNIVERSITY Identification and Mobility of Israeli Patenting Inventors , 2008 .

[12] Devdatt P. Dubhashi,et al. Entity disambiguation in anonymized graphs using graph kernels , 2013, CIKM.

[13] William E. Winkler,et al. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[14] David Yarowsky,et al. Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[15] Yang Song,et al. Efficient topic-based unsupervised name disambiguation , 2007, JCDL '07.

[16] Madian Khabsa,et al. Large scale author name disambiguation in digital libraries , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[17] Andrew McCallum,et al. A Discriminative Hierarchical Model for Fast Coreference at Large Scale , 2012, ACL.

[18] Madian Khabsa,et al. Inventor name disambiguation for a patent database using a random forest and DBSCAN , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[19] Donald E. Knuth,et al. The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[20] Raymond J. Mooney,et al. Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[21] Pierre Azoulay,et al. The anatomy of medical school patenting. , 2007, The New England journal of medicine.

[22] Marcos André Gonçalves,et al. Combining domain-specific heuristics for author name disambiguation , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[23] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[24] Marcos André Gonçalves,et al. A brief survey of automatic methods for author name disambiguation , 2012, SGMD.

[25] Donald E. Knuth,et al. The art of computer programming: sorting and searching (volume 3) , 1973 .

[26] Hans-Peter Kriegel,et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[27] William E. Winkler,et al. Matching and record linkage , 2011 .