Semantic fingerprints-based author name disambiguation in Chinese documents

Author name disambiguation is an important problem that needs to be resolved in bibliometric analysis or tech mining. Many techniques have been presented; however, most of them require a long run time or additional information. A new method based on semantic fingerprints was presented to disambiguate author names without external data. A manually annotated dataset was built to testify on the efficiency of the presented method. Experiments using co-author features, institution features, and text fingerprints were conducted respectively. We found that the first two methods had higher precision, but their recall was low, and the text fingerprint method had higher recall and satisfied precision. Based on these results, we integrated co-author features, institution features, and text fingerprints to provide semantic fingerprints for disambiguating author names and achieving better performance on the F-measure.

[1]  Julio Gonzalo,et al.  WePS-3 Evaluation Campaign: Overview of the Web People Search Clustering and Attribute Extraction Tasks , 2010, CLEF.

[2]  Concha Bielza,et al.  Cluster methods for assessing research performance: exploring Spanish computer science , 2013, Scientometrics.

[3]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[4]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[5]  Hui Han,et al.  A Model-based K-means Algorithm for Name Disambiguation , 2003 .

[6]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2009, Information Retrieval.

[7]  Sung-Ryul Kim,et al.  Fingerprint-Based Near-Duplicate Document Detection with Applications to SNS Spam Detection , 2014, Int. J. Distributed Sens. Networks.

[8]  Mohamed Elkhidir,et al.  Plagiarism detection using free-text fingerprint analysis , 2015, 2015 World Symposium on Computer Networks and Information Security (WSCNIS).

[9]  Wei Hu,et al.  GHOST: an effective graph-based framework for name distinction , 2008, CIKM '08.

[10]  Marcos André Gonçalves,et al.  A brief survey of automatic methods for author name disambiguation , 2012, SGMD.

[11]  Julio Gonzalo,et al.  WePS 2 Evaluation Campaign: Overview of the Web People Search Clustering Task , 2009 .

[12]  Madian Khabsa,et al.  Large scale author name disambiguation in digital libraries , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[13]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[14]  Berthier A. Ribeiro-Neto,et al.  Using web information for author name disambiguation , 2009, JCDL '09.

[15]  魏屹东,et al.  Scientometrics , 2018, Encyclopedia of Big Data.

[16]  Francisco De Sousa Webber,et al.  Semantic Folding Theory-White Paper , 2015 .

[17]  Byung-Won On,et al.  Comparative study of name disambiguation problem using a scalable blocking-based framework , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[18]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[19]  Anne-Wil Harzing,et al.  Health warning: might contain multiple personalities—the problem of homonyms in Thomson Reuters Essential Science Indicators , 2015, Scientometrics.

[20]  David Stolin,et al.  Using Semantic Fingerprinting in Finance , 2016 .

[21]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[22]  Danushka Bollegala,et al.  AUTOMATIC ANNOTATION OF AMBIGUOUS PERSONAL NAMES ON THE WEB , 2012, Comput. Intell..

[23]  Li Tang,et al.  Bibliometric fingerprints: name disambiguation based on approximate structure equivalence of cognitive maps , 2010, Scientometrics.

[24]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[25]  Neil R. Smalheiser,et al.  Author name disambiguation in MEDLINE , 2009, TKDD.

[26]  Fabio Massimo Zanzotto,et al.  Identifying Relational Concept Lexicalisations by Using General Linguistic Knowledge , 2004, ECAI.

[27]  C. Lee Giles,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[28]  Andreas Strotmann,et al.  Author name disambiguation for collaboration network analysis and visualization , 2009, ASIST.

[29]  Neil R. Smalheiser,et al.  Author name disambiguation , 2009, Annu. Rev. Inf. Sci. Technol..

[30]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[31]  Rada Mihalcea,et al.  Word Sense Disambiguation , 2015, Encyclopedia of Machine Learning.

[32]  Julio Gonzalo,et al.  The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).