Sentence Similarity Computational Model Based on Information Content

Sentence similarity computation is an increasingly important task in applications of natural language processing such as information retrieval, machine translation, text summarization and so on. From the viewpoint of information theory, the essential attribute of natural language is that the carrier of information and the capacity of information can be measured by information content which is already successfully used for word similarity computation in simple ways. Existing sentence similarity methods don’t emphasize the information contained by the sentence, and the complicated models they employ often need using empirical parameters or training parameters. This paper presents a fully unsupervised computational model of sentence semantic similarity. It is also a simply and straightforward model that neither needs any empirical parameter nor rely on other NLP tools. The method can obtain state-of-the-art experimental results which show that sentence similarity evaluated by the model is closer to human judgment than multiple competing baselines. The paper also tests the proposed model on the influence of external corpus, the performance of various sizes of the semantic net, and the relationship between efficiency and accuracy. key words: sentence semantic similarity, information content, inclusionexclusion principle, natural language processing, information retrieval

[1]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[2]  Eneko Agirre,et al.  SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity , 2012, *SEMEVAL.

[3]  M. Dolores del Castillo,et al.  SyMSS: A syntax-based measure for short-text semantic similarity , 2011, Data Knowl. Eng..

[4]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[5]  Dekang Lin,et al.  Using Syntactic Dependency as Local Context to Resolve Word Sense Ambiguity , 1997, ACL.

[6]  Hakan Ferhatosmanoglu,et al.  Short text classification in twitter to improve information filtering , 2010, SIGIR.

[7]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[8]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[9]  Wataru Kameyama,et al.  Khmer POS Tagger: A Transformation-based Approach with Hybrid Unknown Word Handling , 2007 .

[10]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[11]  Regina Barzilay,et al.  Paraphrasing for Automatic Evaluation , 2006, NAACL.

[12]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[13]  Berthier A. Ribeiro-Neto,et al.  Image retrieval using multiple evidence ranking , 2004, IEEE Transactions on Knowledge and Data Engineering.

[14]  Zuhair Bandar,et al.  Sentence similarity based on semantic nets and corpus statistics , 2006, IEEE Transactions on Knowledge and Data Engineering.

[15]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[16]  Eneko Agirre,et al.  *SEM 2013 shared task: Semantic Textual Similarity , 2013, *SEMEVAL.

[17]  Xiao-Ying Liu,et al.  Sentence Similarity based on Dynamic Time Warping , 2007 .

[18]  Claire Cardie,et al.  SemEval-2014 Task 10: Multilingual Semantic Textual Similarity , 2014, *SEMEVAL.

[19]  Dong-Yul Ra,et al.  Techniques for improving web retrieval effectiveness , 2005, Inf. Process. Manag..

[20]  Filippo Menczer,et al.  Algorithmic detection of semantic similarity , 2005, WWW '05.

[21]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[22]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[23]  Diana Inkpen,et al.  Semantic text similarity using corpus-based word similarity and string similarity , 2008, ACM Trans. Knowl. Discov. Data.

[24]  W. Marsden I and J , 2012 .

[25]  Ramiz M. Aliguliyev,et al.  A new sentence similarity measure and sentence based extractive technique for automatic text summarization , 2009, Expert Syst. Appl..

[26]  Yorick Wilks,et al.  Measuring Text Reuse , 2002, ACL.

[27]  Claire Cardie,et al.  SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability , 2015, *SEMEVAL.