Short Text Classification Based on Semantics

Data sparseness and unseen words are two major problems in short text classification. In such a case, it is unsuitable to directly use the vector space model (VSM) which focuses on the statistical occurrence of the terms to represent the text. To solve these problems, we present a novel short text classification method based on semantics. The method of K-Means is used to perform it. In the experiments, we exploit the continuous word embeddings which were trained on very large unrelated corpora to represent the semantic relationships. The experimental results on an open dataset show that the application of semantics greatly improves the performance in short text classification, comparing with a state-of-the-art baseline in VSM; and that the proposed method can reduce the costs of collecting the training data.

[1]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[2]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[3]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[4]  Paolo Rosso,et al.  Text Categorization and Information Retrieval Using WordNet Senses , 2004 .

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Peter Willett,et al.  Readings in information retrieval , 1997 .

[8]  Florentina Hristea Statistical Natural Language Processing , 2011, International Encyclopedia of Statistical Science.

[9]  Aixin Sun,et al.  Short text classification using very few words , 2012, SIGIR '12.

[10]  Florent Perronnin,et al.  Aggregating Continuous Word Embeddings for Information Retrieval , 2013, CVSM@ACL.

[11]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[12]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[13]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[14]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.