Sentence Similarity by Combining Explicit Semantic Analysis and Overlapping N-Grams

We propose a similarity measure between sentences which combines a knowledge-based measure, that is a lighter version of ESA (Explicit Semantic Analysis), and a distributional measure, Rouge. We used this hybrid measure with two French domain-orientated corpora collected from the Web and we compared its similarity scores to those of human judges. In both domains, ESA and Rouge perform better when they are mixed than they do individually. Besides, using the whole Wikipedia base in ESA did not prove necessary since the best results were obtained with a low number of well selected concepts.

[1]  Eneko Agirre,et al.  *SEM 2013 shared task: Semantic Textual Similarity , 2013, *SEMEVAL.

[2]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[3]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[4]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[5]  Jinwoo Park,et al.  Automatic Text Categorization using the Importance of Sentences , 2002, COLING.

[6]  Pushpak Bhattacharyya,et al.  CFILT-CORE: Semantic Textual Similarity using Universal Networking Language , 2013, *SEM@NAACL-HLT.

[7]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[8]  Timothy Baldwin,et al.  Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity , 2013 .

[9]  George Tsatsaronis,et al.  A Generalized Vector Space Model for Text Retrieval Based on Semantic Relatedness , 2009, EACL.

[10]  Philipp Cimiano,et al.  Cross-language Information Retrieval with Explicit Semantic Analysis , 2008, CLEF.

[11]  Rakesh Gupta,et al.  Text Categorization with Knowledge Transfer from Heterogeneous Data Sources , 2008, AAAI.

[12]  Benno Stein,et al.  Insights into explicit semantic analysis , 2011, CIKM '11.

[13]  Takahiro Hara,et al.  Wikipedia Mining for an Association Web Thesaurus Construction , 2007, WISE.

[14]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[15]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[16]  Davide Buscaldi,et al.  LIPN-CORE: Semantic Text Similarity using n-grams, WordNet, Syntactic Analysis, ESA and Information Retrieval based Features , 2013, *SEMEVAL.

[17]  Benno Stein,et al.  Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[18]  Regina Barzilay,et al.  Sentence Alignment for Monolingual Comparable Corpora , 2003, EMNLP.

[19]  Zuhair Bandar,et al.  Sentence similarity based on semantic nets and corpus statistics , 2006, IEEE Transactions on Knowledge and Data Engineering.

[20]  Iryna Gurevych,et al.  A Study on the Semantic Relatedness of Query and Document Terms in Information Retrieval , 2009, EMNLP.

[21]  Evgeniy Gabrilovich,et al.  Concept-Based Information Retrieval Using Explicit Semantic Analysis , 2011, TOIS.

[22]  James Allan,et al.  A comparison of sentence retrieval techniques , 2007, SIGIR.

[23]  Xiaohua Hu,et al.  The Evaluation of Sentence Similarity Measures , 2008, DaWaK.

[24]  Venu Gopala Rao,et al.  A TEXT CATEGORIZATION ON SEMANTIC ANALYSIS , 2013 .