Computing inter-document similarity with Context Semantic Analysis

Abstract We propose a novel knowledge-based technique for inter-document similarity computation, called Context Semantic Analysis (CSA). Several specialized approaches built on top of specific knowledge base (e.g. Wikipedia) exist in literature, but CSA differs from them because it is designed to be portable to any RDF knowledge base. In fact, our technique relies on a generic RDF knowledge base (e.g. DBpedia and Wikidata) to extract from it a Semantic Context Vector, a novel model for representing the context of a document, which is exploited by CSA to compute inter-document similarity effectively. Moreover, we show how CSA can be effectively applied in the Information Retrieval domain. Experimental results show that: (i) for the general task of inter-document similarity, CSA outperforms baselines built on top of traditional methods, and achieves a performance similar to the ones built on top of specific knowledge bases; (ii) for Information Retrieval tasks, enriching documents with context (i.e., employing the Semantic Context Vector model) improves the results quality of the state-of-the-art technique that employs such similar semantic enrichment.

[1]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[2]  Eneko Agirre,et al.  WikiWalk: Random walks on Wikipedia for Semantic Relatedness , 2009, Graph-based Methods for Natural Language Processing.

[3]  Wenpu Xing,et al.  Weighted PageRank algorithm , 2004, Proceedings. Second Annual Conference on Communication Networks and Services Research, 2004..

[4]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[5]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[6]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[7]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[8]  Simone Paolo Ponzetto,et al.  Knowledge-based graph document modeling , 2014, WSDM.

[9]  Sonia Bergamaschi,et al.  QUEST: A Keyword Search System for Relational Data based on Semantic and Machine Learning Techniques , 2013, Proc. VLDB Endow..

[10]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[11]  S. Dumais Latent Semantic Analysis. , 2005 .

[12]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[13]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[14]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[15]  Johannes Keizer,et al.  The AGROVOC Linked Dataset , 2013, Semantic Web.

[16]  Amit P. Sheth,et al.  SemRank: ranking complex relationship search results on the semantic web , 2005, WWW '05.

[17]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[18]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval: A Critical Review , 2008, Found. Trends Inf. Retr..

[19]  Wael Hassan Gomaa,et al.  A Survey of Text Similarity Approaches , 2013 .

[20]  Rada Mihalcea,et al.  Semantic Relatedness Using Salient Semantic Analysis , 2011, AAAI.

[21]  Sonia Bergamaschi,et al.  Combining user and database perspective for solving keyword queries over relational databases , 2016, Inf. Syst..

[22]  Maurizio Vincini,et al.  Semantic annotation of the CEREALAB database by the AGROVOC linked dataset , 2015, Ecol. Informatics.

[23]  Kevin Donnelly,et al.  SNOMED-CT: The advanced terminology and coding system for eHealth. , 2006, Studies in health technology and informatics.

[24]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[25]  T. V. D. Cruys Two multivariate generalizations of pointwise mutual information , 2011 .

[26]  Sonia Bergamaschi,et al.  Keyword search over relational databases: a metadata approach , 2011, SIGMOD '11.

[27]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[28]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[29]  Enrico Motta,et al.  Semantically enhanced Information Retrieval: An ontology-based approach , 2011, J. Web Semant..

[30]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.