Semantic text similarity using corpus-based word similarity and string similarity

We present a method for measuring the semantic similarity of texts using a corpus-based measure of semantic word similarity and a normalized and modified version of the Longest Common Subsequence (LCS) string matching algorithm. Existing methods for computing text similarity have focused mainly on either large documents or individual words. We focus on computing the similarity between two sentences or two short paragraphs. The proposed method can be exploited in a variety of applications involving textual knowledge representation and knowledge discovery. Evaluation results on two different data sets show that our method outperforms several competing methods.

[1]  Gunter Saake,et al.  Efficient similarity-based operations for data integration , 2004, Data Knowl. Eng..

[2]  Eleazar Eskin,et al.  Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning , 1999, EMNLP.

[3]  I. Dan Melamed,et al.  Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[4]  John Sinclair,et al.  Collins Cobuild English dictionary for advanced learners , 2001 .

[5]  Trevor I. Dix,et al.  A Bit-String Longest-Common-Subsequence Algorithm , 1986, Inf. Process. Lett..

[6]  Ying Liu,et al.  Example-based Chinese-English MT , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[7]  Ray Jackendoff Semantics and Cognition , 1983 .

[8]  AnHai Doan,et al.  Corpus-based schema matching , 2005, 21st International Conference on Data Engineering (ICDE'05).

[9]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[10]  Jiawei Han,et al.  ACM Transactions on Knowledge Discovery from Data: Introduction , 2007 .

[11]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[12]  Peter Wiemer-Hastings,et al.  Adding syntactic information to LSA , 2000 .

[13]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[14]  Diana Inkpen,et al.  Applications of corpus-based semantic similarity and word segmentation to database schema matching , 2008, The VLDB Journal.

[15]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[16]  Tao Liu,et al.  Text Similarity Computing Based on Standard Deviation , 2005, ICIC.

[17]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[18]  Dong-Yul Ra,et al.  Techniques for improving web retrieval effectiveness , 2005, Inf. Process. Manag..

[19]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[20]  Mirella Lapata,et al.  Automatic Evaluation of Text Coherence: Models and Representations , 2005, IJCAI.

[21]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[22]  Curt Burgess,et al.  Explorations in context space: Words, sentences, discourse , 1998 .

[23]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[24]  Diana Inkpen,et al.  Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words , 2006, LREC.

[25]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[26]  Charles T. Meadow,et al.  Text information retrieval systems , 1992 .

[27]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[28]  Rada Mihalcea,et al.  Measuring the Semantic Similarity of Texts , 2005, EMSEE@ACL.

[29]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[30]  Max J. Egenhofer,et al.  Determining Semantic Similarity among Entity Classes from Different Ontologies , 2003, IEEE Trans. Knowl. Data Eng..

[31]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[32]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[33]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[34]  Zuhair Bandar,et al.  Sentence similarity based on semantic nets and corpus statistics , 2006, IEEE Transactions on Knowledge and Data Engineering.

[35]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[36]  John Lyons,et al.  语义学引论 = Linguistic Semantics , 2000 .

[37]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[38]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[39]  David J. Weir,et al.  Characterising Measures of Lexical Distributional Similarity , 2004, COLING.

[40]  Peter W. Foltz,et al.  The Measurement of Textual Coherence with Latent Semantic Analysis. , 1998 .

[41]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[42]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[43]  Stan Szpakowicz,et al.  Roget's thesaurus and semantic similarity , 2012, RANLP.

[44]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[45]  Piotr S. Szczepaniak,et al.  Classification of RSS-Formatted Documents Using Full Text Similarity Measures , 2005, ICWE.

[46]  Grzegorz Kondrak,et al.  N-Gram Similarity and Distance , 2005, SPIRE.

[47]  Filippo Menczer,et al.  Algorithmic detection of semantic similarity , 2005, WWW '05.

[48]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[49]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[50]  Jinwoo Park,et al.  Improving text categorization using the importance of sentences , 2004, Inf. Process. Manag..

[51]  Berthier A. Ribeiro-Neto,et al.  Image retrieval using multiple evidence ranking , 2004, IEEE Transactions on Knowledge and Data Engineering.

[52]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.