Assessing sentence similarity through lexical, syntactic and semantic analysis

Abstract The degree of similarity between sentences is assessed by sentence similarity methods. Sentence similarity methods play an important role in areas such as summarization, search, and categorization of texts, machine translation, etc. The current methods for assessing sentence similarity are based only on the similarity between the words in the sentences. Such methods either represent sentences as bag of words vectors or are restricted to the syntactic information of the sentences. Two important problems in language understanding are not addressed by such strategies: the word order and the meaning of the sentence as a whole. The new sentence similarity assessment measure presented here largely improves and refines a recently published method that takes into account the lexical, syntactic and semantic components of sentences. The new method was benchmarked using Li–McLean, showing that it outperforms the state of the art systems and achieves results comparable to the evaluation made by humans. Besides that, the method proposed was extensively tested using the SemEval 2012 sentence similarity test set and in the evaluation of the degree of similarity between summaries using the CNN-corpus. In both cases, the measure proposed here was proved effective and useful.

[1]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[2]  Roberto Navigli,et al.  A Robust Approach to Aligning Heterogeneous Lexical Resources , 2014, ACL.

[3]  Tao Liu,et al.  Text Similarity Computing Based on Standard Deviation , 2005, ICIC.

[4]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[5]  Eneko Agirre,et al.  SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity , 2012, *SEMEVAL.

[6]  Elena Lloret,et al.  Text summarisation in progress: a literature review , 2011, Artificial Intelligence Review.

[7]  Frederic P. Miller,et al.  Levenshtein Distance: Information theory, Computer science, String (computer science), String metric, Damerau?Levenshtein distance, Spell checker, Hamming distance , 2009 .

[8]  Diana Inkpen,et al.  Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words , 2006, LREC.

[9]  Pushpak Bhattacharyya,et al.  Text Clustering using Semantics , 2002 .

[10]  Fabio A. González,et al.  Text Comparison Using Soft Cardinality , 2010, SPIRE.

[11]  Stephen T. Wu,et al.  Structured Composition of Semantic Vectors , 2011, IWCS.

[12]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[13]  Berthier A. Ribeiro-Neto,et al.  Image retrieval using multiple evidence ranking , 2004, IEEE Transactions on Knowledge and Data Engineering.

[14]  George D. C. Cavalcanti,et al.  Assessing sentence scoring techniques for extractive text summarization , 2013, Expert Syst. Appl..

[15]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[16]  Chung-Hsien Wu,et al.  Psychiatric document retrieval using a discourse-aware model , 2009, Artif. Intell..

[17]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[18]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[19]  Jan Snajder,et al.  TakeLab: Systems for Measuring Semantic Text Similarity , 2012, *SEMEVAL.

[20]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[21]  Furu Wei,et al.  A document-sensitive graph model for multi-document summarization , 2010, Knowledge and Information Systems.

[22]  Rafael Dueire Lins,et al.  A multi-document summarization system based on statistics and linguistic treatment , 2014, Expert Syst. Appl..

[23]  Grzegorz Kondrak,et al.  N-Gram Similarity and Distance , 2005, SPIRE.

[24]  Anders Holst,et al.  Random indexing of text samples for latent semantic analysis , 2000 .

[25]  Jonathan Weese,et al.  UMBC_EBIQUITY-CORE: Semantic Textual Similarity Systems , 2013, *SEMEVAL.

[26]  Nitin Madnani,et al.  ETS: Discriminative Edit Models for Paraphrase Scoring , 2012, *SEMEVAL.

[27]  Diana Inkpen,et al.  Semantic text similarity using corpus-based word similarity and string similarity , 2008, ACM Trans. Knowl. Discov. Data.

[28]  Alon Lavie,et al.  The Meteor metric for automatic evaluation of machine translation , 2009, Machine Translation.

[29]  Michael Philippsen,et al.  Finding Plagiarisms among a Set of Programs with JPlag , 2002, J. Univers. Comput. Sci..

[30]  Bingru Yang,et al.  Graph-based text representation model and its realization , 2010, Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010).

[31]  Dragomir R. Radev,et al.  Summarization evaluation using relative utility , 2003, CIKM '03.

[32]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[33]  M. Dolores del Castillo,et al.  SyMSS: A syntax-based measure for short-text semantic similarity , 2011, Data Knowl. Eng..

[34]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[35]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[36]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[37]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[38]  Björn Gambäck,et al.  NTNU-CORE: Combining strong features for semantic similarity , 2013, *SEM@NAACL-HLT.

[39]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[40]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[41]  Alexander F. Gelbukh,et al.  UNAL-NLP: Combining Soft Cardinality Features for Semantic Textual Similarity, Relatedness and Entailment , 2014, *SEMEVAL.

[42]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[43]  Ani Nenkova,et al.  The Pyramid Method: Incorporating human content selection variation in summarization evaluation , 2007, TSLP.

[44]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[45]  John Atkinson,et al.  Rhetorics-based multi-document summarization , 2013, Expert Syst. Appl..

[46]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[47]  Jacques Savoy,et al.  When stopword lists make the difference , 2010 .

[48]  Harish Karnick,et al.  sranjans : Semantic Textual Similarity using Maximal Weighted Bipartite Graph Matching , 2012, SemEval@NAACL-HLT.

[49]  Alexander F. Gelbukh,et al.  Soft Cardinality: A Parameterized Similarity Function for Text Comparison , 2012, *SEMEVAL.

[50]  Philipp Koehn,et al.  (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[51]  Rafael Dueire Lins,et al.  A New Sentence Similarity Method Based on a Three-Layer Sentence Representation , 2014, 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[52]  Ani Nenkova,et al.  Summarization evaluation for text and speech: issues and approaches , 2006, INTERSPEECH.

[53]  Vlado Keselj,et al.  Text Similarity Using Google Tri-grams , 2012, Canadian Conference on AI.

[54]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[55]  Philipp Koehn,et al.  Further Meta-Evaluation of Machine Translation , 2008, WMT@ACL.

[56]  Tat-Seng Chua,et al.  Paraphrase Recognition via Dissimilarity Significance Classification , 2006, EMNLP.

[57]  Rafael Dueire Lins,et al.  A new sentence similarity assessment measure based on a three-layer sentence representation , 2014, DocEng '14.

[58]  Iryna Gurevych,et al.  UKP: Computing Semantic Textual Similarity by Combining Multiple Content Similarity Measures , 2012, *SEMEVAL.

[59]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[60]  Mitchell P. Marcus,et al.  OntoNotes: The 90% Solution , 2006, NAACL.

[61]  Magnus Sahlgren,et al.  From Words to Understanding , 2001 .

[62]  Zuhair Bandar,et al.  Sentence similarity based on semantic nets and corpus statistics , 2006, IEEE Transactions on Knowledge and Data Engineering.

[63]  Nitin Madnani,et al.  Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric , 2009, WMT@EACL.

[64]  Christopher R. Johnson,et al.  Background to Framenet , 2003 .

[65]  Jeff Z. Pan,et al.  Resource Description Framework , 2020, Definitions.

[66]  Noah A. Smith,et al.  Probabilistic Frame-Semantic Parsing , 2010, NAACL.

[67]  Alexander F. Gelbukh,et al.  SOFTCARDINALITY-CORE: Improving Text Overlap with Distributional Measures for Semantic Textual Similarity , 2013, *SEMEVAL.