Sentence similarity based on semantic nets and corpus statistics

Sentence similarity measures play an increasingly important role in text-related research and applications in areas such as text mining, Web page retrieval, and dialogue systems. Existing methods for computing sentence similarity have been adopted from approaches used for long text documents. These methods process sentences in a very high-dimensional space and are consequently inefficient, require human input, and are not adaptable to some application domains. This paper focuses directly on computing the similarity between very short texts of sentence length. It presents an algorithm that takes account of semantic information and word order information implied in the sentences. The semantic similarity of two sentences is calculated using information from a structured lexical database and from corpus statistics. The use of a lexical database enables our method to model human common sense knowledge and the incorporation of corpus statistics allows our method to be adaptable to different domains. The proposed method can be used in a variety of applications that involve text knowledge representation and discovery. Experiments on two sets of selected sentence pairs demonstrate that the proposed method provides a similarity measure that shows a significant correlation to human intuition

[1]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[2]  James L. McClelland,et al.  Mechanisms of Sentence Processing: Assigning Roles to Constituents of Sentences , 1986 .

[3]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[4]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[5]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[6]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[7]  Charles T. Meadow,et al.  Text information retrieval systems , 1992 .

[8]  H. Kozima Computing Lexical Cohesion as a Tool for Text Analysis , 1993 .

[9]  小嶋 秀樹,et al.  Computing lexical cohesion as a tool for text analysis , 1994 .

[10]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[11]  James F. Allen Natural language understanding (2nd ed.) , 1995 .

[12]  Bob Rehder,et al.  How Well Can Passage Meaning be Derived without Using Word Order? A Comparison of Latent Semantic Analysis and Humans , 1997 .

[13]  Peter W. Foltz,et al.  Learning Human-like Knowledge by Singular Value Decomposition: A Progress Report , 1997, NIPS.

[14]  Michael Mc Hale,et al.  A Comparison of WordNet and Roget’s Taxonomy for Measuring Semantic Similarity , 1998, WordNet@ACL/COLING.

[15]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[16]  Curt Burgess,et al.  Explorations in context space: Words, sentences, discourse , 1998 .

[17]  Peter W. Foltz,et al.  The Measurement of Textual Coherence with Latent Semantic Analysis. , 1998 .

[18]  Eleazar Eskin,et al.  Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning , 1999, EMNLP.

[19]  Jean Aitchison,et al.  Linguistics : an introduction , 1999 .

[20]  W. Charles Contextual correlates of meaning , 2000, Applied Psycholinguistics.

[21]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[22]  Peter Wiemer-Hastings,et al.  Adding syntactic information to LSA , 2000 .

[23]  Donald Michie,et al.  Return of the Imitation Game , 2001, Electron. Trans. Artif. Intell..

[24]  John Sinclair,et al.  Collins Cobuild English dictionary for advanced learners , 2001 .

[25]  Max J. Egenhofer,et al.  Determining Semantic Similarity among Entity Classes from Different Ontologies , 2003, IEEE Trans. Knowl. Data Eng..

[26]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[27]  Naoaki Okazaki,et al.  Sentence Extraction by Spreading Activation through Sentence Similarity , 2003 .

[28]  Ying Liu,et al.  Example-based Chinese-English MT , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[29]  Berthier A. Ribeiro-Neto,et al.  Image retrieval using multiple evidence ranking , 2004, IEEE Transactions on Knowledge and Data Engineering.

[30]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[31]  Graeme Hirst,et al.  Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures , 2004 .

[32]  Jinwoo Park,et al.  Improving text categorization using the importance of sentences , 2004, Inf. Process. Manag..

[33]  Chris Mellish,et al.  Combining information extraction with genetic algorithms for text mining , 2004, IEEE Intelligent Systems.

[34]  Jung-Hsien Chiang,et al.  Literature Extraction of Protein Functions Using Sentence Pattern Mining , 2005, IEEE Trans. Knowl. Data Eng..

[35]  Manuel Vilares Ferro,et al.  Semantic Similarity Between Sentences Through Approximate Tree Matching , 2005, IbPRIA.

[36]  Dong-Yul Ra,et al.  Techniques for improving web retrieval effectiveness , 2005, Inf. Process. Manag..