Author Name Disambiguation in Technology Trend Analysis Using SVM and Random Forests and Novel Topic Based Features

Technology trend analysis systems use data mining to process vast amounts of papers, patents and news articles to analyze and predict the life cycles of technologies, products and other kinds of entities. Some systems can also extract relations between entities such as technologies, authors and products. In order to establish precise relations between entities, entity disambiguation has to be performed. In this study, we focused on author disambiguation in the context of technology trend analysis. We used Random Forests and SVM to learn a pair wise similarity function to decide whether two articles were written by the same author or not. Besides comparing common features such as article titles and author affiliations we also studied features that were built from the analyses that were made by KISTI's InSciTe system. For training and evaluation a corpus containing 24, 750 pair wise article similarities was manually constructed using data from InSciTe. Using this corpus, Random Forests outperformed SVM and reached an accuracy value of 98.31%. Only using the newly introduced features, an accuracy of 94.79% was achieved, proving their usefulness.

[1]  Byung-Won On,et al.  Comparative study of name disambiguation problem using a scalable blocking-based framework , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[4]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[5]  Dongwon Lee,et al.  Search engine driven author disambiguation , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[6]  Neil R. Smalheiser,et al.  Author name disambiguation in MEDLINE , 2009, TKDD.

[7]  C. Lee Giles,et al.  Disambiguating authors in academic publications using random forests , 2009, JCDL '09.

[8]  Sa-Kwang Song,et al.  SINDI-WALKS: Workbench for PLOT-based Technological Information Extraction and Management , 2012, 2012 IEEE International Conference on Green Computing and Communications.

[9]  Yang Song,et al.  Efficient topic-based unsupervised name disambiguation , 2007, JCDL '07.

[10]  Andrew McCallum,et al.  Efficient Strategies for Improving Partitioning-Based Author Coreference by Incorporating Web Pages as Graph Nodes , 2007 .

[11]  Daniel Jurafsky,et al.  Citation-based bootstrapping for large-scale author disambiguation , 2012, J. Assoc. Inf. Sci. Technol..

[12]  Won-Kyung Sung,et al.  Supporting Technical Decision-Making with InSciTe ® , 2010 .

[13]  Won-Kyung Sung,et al.  On co-authorship for author disambiguation , 2009, Inf. Process. Manag..

[14]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[15]  Won-Kyung Sung,et al.  InSciTe Advanced: Service for Technology Opportunity Discovery , 2011 .