Web Person Name Disambiguation by Relevance Weighting of Extended Feature Sets

This paper describes our approach to the Person Name Disambiguation clustering task in the Third Web People Search Evaluation Campaign(WePS3). The method focuses on two aspects: the extended feature sets, and feature relevance weighting. Bag-of-words and named entities are most commonly used features in many existing web entity disambiguation algorithms and we further extend this basic feature set with Wikipedia concepts. Then two feature weighting models are employed. One is the feature relevance to the target person name(or “query name”), and the other is the feature relevance to the text content. Similarity score is calculated according to the feature weights for clustering documents of the same person. Experiments show that the system based on our approach has generated the best results among all the WePS-3’s submissions.

[1]  Min-Yen Kan,et al.  PSNUS: Web People Name Disambiguation by Simple Clustering with Rich Features , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[2]  Julio Gonzalo,et al.  WePS 2 Evaluation Campaign: Overview of the Web People Search Clustering Task , 2009 .

[3]  Cheng Niu,et al.  Weakly Supervised Learning for Cross-document Person Name Disambiguation Supported by Information Extraction , 2004, ACL.

[4]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[5]  Xiaojun Wan,et al.  Person resolution in person search results: WebHawk , 2005, CIKM '05.

[6]  Hiroshi Nakagawa,et al.  Person name disambiguation by bootstrapping , 2010, SIGIR.

[7]  Julio Gonzalo,et al.  The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[8]  Amanda Spink,et al.  Searching for people on Web search engines , 2004, J. Documentation.

[9]  S. Sekine,et al.  The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task , 2007, *SEMEVAL.

[10]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[11]  Hiroshi Nakagawa,et al.  Person Name Disambiguation on the Web by Two-Stage Clustering , 2009 .

[12]  Andrew McCallum,et al.  Disambiguating Web appearances of people in a social network , 2005, WWW '05.

[13]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[14]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[15]  J. Friedman Stochastic gradient boosting , 2002 .

[16]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[17]  Deepa Paranjpe,et al.  Learning document aboutness from implicit user feedback and document structure , 2009, CIKM.

[18]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.