MC4WEPS: a multilingual corpus for Web people search disambiguation

This article introduces the MC4WEPS corpus, a new resource for evaluating Web people search disambiguation tasks, and describes its design, collection and annotation process, the agreement between the different annotators, and finally introduces a baseline evaluation. This corpus is built by compiling multilingual search engines results where the queries are person names. Proper noun disambiguation is an open problem in natural language ambiguity resolution and, specifically, resolving the ambiguity of person names in Web search results is still a challenging problem. However, state-of-the-art approaches have been evaluated only with monolingual web page collections. The MC4WEPS corpus aims to provide the research community with a reference corpus for the task of disambiguating search engine results where the query is a person name shared by homonymous individuals. The features of this new corpus stand out from existing corpora for the same task, namely multilingualism and inclusion of social networking websites. These characteristics make it more representative of a real search scenario, especially for evaluating person name disambiguation in a multilingual context. The article also includes detailed information about the format and the availability of the corpus.

[1]  J. Fleiss Statistical methods for rates and proportions , 1974 .

[2]  Amir Zeldes Tony McEnery, Richard Xiao & Yukio Tono. 2006. Corpus-Based Language Studies. An Advanced Resource Book (Routledge Applied Linguistics). London, New York: Routledge. xx, 386 S , 2010 .

[3]  Bernice W. Polemis Nonparametric Statistics for the Behavioral Sciences , 1959 .

[4]  Raquel Martínez Unanue,et al.  An Unsupervised Algorithm for Person Name Disambiguation in the Web , 2014 .

[5]  Soto Montalvo,et al.  A Data Driven Approach for Person Name Disambiguation in Web Search Results , 2014, COLING.

[6]  Julio Gonzalo,et al.  The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[7]  James R. Curran,et al.  Web Text Corpus for Natural Language Processing , 2006, EACL.

[8]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[9]  Julio Gonzalo,et al.  WePS-3 Evaluation Campaign: Overview of the Web People Search Clustering and Attribute Extraction Tasks , 2010, CLEF.

[10]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[11]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[12]  Julio Gonzalo,et al.  WePS 2 Evaluation Campaign: Overview of the Web People Search Clustering Task , 2009 .

[13]  Philip S. Yu,et al.  ADANA: Active Name Disambiguation , 2011, 2011 IEEE 11th International Conference on Data Mining.

[14]  Ying Li,et al.  Personal name classification in web queries , 2008, WSDM '08.

[15]  David Yarowsky,et al.  Multi-document statistical fact extraction and fusion , 2006 .

[16]  Javier Artiles WEB PEOPLE SEARCH , 2009 .

[17]  Barbara Di Eugenio,et al.  Squibs and Discussions: The Kappa Statistic: A Second Look , 2004, CL.

[18]  W. Grove Statistical Methods for Rates and Proportions, 2nd ed , 1981 .

[19]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[20]  Felix Naumann,et al.  Bootstrapped Grouping of Results to Ambiguous Person Name Queries , 2013, ArXiv.

[21]  Andrew McCallum,et al.  Disambiguating Web appearances of people in a social network , 2005, WWW '05.

[22]  Ted Pedersen,et al.  An Unsupervised Language Independent Method of Name Discrimination Using Second Order Co-occurrence Features , 2006, CICLing.

[23]  Atsuhiro Takasu,et al.  Name Disambiguation Boosted by Latent Topics from Web Directories , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[24]  Dmitri V. Kalashnikov,et al.  Exploiting Web querying for Web people search , 2012, ACM Trans. Database Syst..

[25]  Viggo Kann,et al.  Comparing Comparisons: Document Clustering Evaluation Using Two Manual Classifications , 2004 .

[26]  Hiroshi Nakagawa,et al.  Person name disambiguation by bootstrapping , 2010, SIGIR.

[27]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[28]  Jian Xu,et al.  High Performance Clustering for Web Person Name Disambiguation Using Topic Capturing , 2011 .

[29]  Jian Xu,et al.  Web Person Disambiguation Using Hierarchical Co-reference Model , 2015, CICLing.

[30]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[31]  Nitin Indurkhya,et al.  Handbook of Natural Language Processing , 2010 .

[32]  Camille Roth,et al.  Natural Scales in Geographical Patterns , 2017, Scientific Reports.

[33]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[34]  Barbara Maria Di Eugenio,et al.  Squibs and Discussions - The Kappa Statistic , 2004 .

[35]  Chu-Ren Huang,et al.  A robust web personal name information extraction system , 2012, Expert Syst. Appl..

[36]  Maarten de Rijke,et al.  Result Disambiguation in Web People Search , 2012, ECIR.

[37]  Tony McEnery,et al.  Corpus-Based Language Studies: An Advanced Resource Book , 2006 .

[38]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[39]  Anupam Basu,et al.  An Agreement Measure for Determining Inter-Annotator Reliability of Human Judgements on Affective Text , 2008, Proceedings of the Workshop on Human Judgements in Computational Linguistics - HumanJudge '08.