A short text modeling method combining semantic and statistical information

A novel modeling method for a collection of short text snippets is presented in this paper to measure the similarity between pairs of snippets. The method takes account of both the semantic and statistical information within the short text snippets, and consists of three steps. Given a set of raw short text snippets, it first establishes the initial similarity between words by using a lexical database. The method then iteratively calculates both word similarity and short text similarity. Finally, a proximity matrix is constructed based on word similarity and used to convert the raw text snippets into vectors. Word similarity and text clustering experiments show that the proposed short text modeling method improves the performance of existing text-related information retrieval (IR) techniques.

[1]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[2]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[3]  Alan F. Smeaton,et al.  Progress in the Application of Natural Language Processing to Information Retrieval Tasks , 1992, Comput. J..

[4]  Hong Yan,et al.  Supervised classification of share price trends , 2008, Inf. Sci..

[5]  Dov Dori,et al.  Object-process methodology - a holistic systems paradigm , 2013 .

[6]  Hong-Gee Kim,et al.  Exploiting noun phrases and semantic relationships for text document clustering , 2009, Inf. Sci..

[7]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[8]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[9]  Eleazar Eskin,et al.  Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning , 1999, EMNLP.

[10]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[11]  Donald Michie,et al.  Return of the Imitation Game , 2001, Electron. Trans. Artif. Intell..

[12]  Jinwoo Park,et al.  Improving text categorization using the importance of sentences , 2004, Inf. Process. Manag..

[13]  Jung-Hsien Chiang,et al.  Literature Extraction of Protein Functions Using Sentence Pattern Mining , 2005, IEEE Trans. Knowl. Data Eng..

[14]  Naoaki Okazaki,et al.  Sentence Extraction by Spreading Activation through Sentence Similarity , 2003 .

[15]  Ying Liu,et al.  Example-based Chinese-English MT , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[16]  Berthier A. Ribeiro-Neto,et al.  Image retrieval using multiple evidence ranking , 2004, IEEE Transactions on Knowledge and Data Engineering.

[17]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[18]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[19]  T. Moon The expectation-maximization algorithm , 1996, IEEE Signal Process. Mag..

[20]  Erkki Oja,et al.  Independent Component Analysis , 2001 .

[21]  Dong-Yul Ra,et al.  Techniques for improving web retrieval effectiveness , 2005, Inf. Process. Manag..

[22]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[23]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[24]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[25]  Joydeep Ghosh,et al.  Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.

[26]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[27]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[28]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[29]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[30]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[31]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[32]  Hassan J. Eghbali,et al.  K-S Test for Detecting Changes from Landsat Imagery Data , 1979, IEEE Transactions on Systems, Man, and Cybernetics.

[33]  Martti Juhola,et al.  On principal component analysis, cosine and Euclidean measures in information retrieval , 2007, Inf. Sci..

[34]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[35]  Zuhair Bandar,et al.  Sentence similarity based on semantic nets and corpus statistics , 2006, IEEE Transactions on Knowledge and Data Engineering.