Related Entity Finding Using Semantic Clustering Based on Wikipedia Categories

We present a system that performs Related Entity Finding, that is, Question Answering that exploits Semantic Information from the WWW and returns URIs as answers. Our system uses a search engine to gather all candidate answer entities and then a linear combination of Information Retrieval measures to choose the most relevant. For each one we look up its Wikipedia page and construct a novel vector representation based on the tokenization of the Wikipedia category names. This novel representation gives our system the ability to compute a measure of semantic relatedness between entities, even if the entities do not share any common category. We use this property to perform a semantic clustering of the candidate entities and show that the biggest cluster contains entities that are closely related semantically and can be considered as answers to the query. Performance measured on 20 topics from the 2009 TREC Related Entity Finding task shows competitive results.

[1]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[2]  Mark J. van der Laan,et al.  A Method to Identify Significant Clusters in Gene Expression Data , 2002 .

[3]  I. Jolliffe Principal Component Analysis , 2002 .

[4]  K. Becker,et al.  Analysis of microarray data using Z score transformation. , 2003, The Journal of molecular diagnostics : JMD.

[5]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[6]  Diana Santos,et al.  GikiCLEF: Crosscultural Issues in an International Setting: Asking non-English-centered Questions to Wikipedia , 2009, CLEF.

[7]  Günter Neumann,et al.  Mining Web Snippets to Answer List Questions , 2007, AIDM.

[8]  J. E. Jackson A User's Guide to Principal Components , 1991 .

[9]  Sven Hartrumpf,et al.  Revamping Question Answering with a Semantic Approach over World Knowledge , 2010, CLEF.

[10]  Jimmy J. Lin,et al.  Answering Clinical Questions with Knowledge-Based and Statistical Techniques , 2007, CL.

[11]  Paul Thomas,et al.  Overview of the TREC 2009 Entity Track , 2009, TREC.

[12]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[13]  Bojana Dalbelo Basic,et al.  Exploring Classification Concept Drift on a Large News Text Corpus , 2012, CICLing.

[14]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[15]  Beatrice Santorini,et al.  Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision) , 1990 .

[16]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .