N-gram-Based Low-Dimensional Representation for Document Classification

The bag-of-words (BOW) model is the common approach for classifying documents, where words are used as feature for training a classifier. This generally involves a huge number of features. Some techniques, such as Latent Semantic Analysis (LSA) or Latent Dirichlet Allocation (LDA), have been designed to summarize documents in a lower dimension with the least semantic information loss. Some semantic information is nevertheless always lost, since only words are considered. Instead, we aim at using information coming from n-grams to overcome this limitation, while remaining in a low-dimension space. Many approaches, such as the Skip-gram model, provide good word vector representations very quickly. We propose to average these representations to obtain representations of n-grams. All n-grams are thus embedded in a same semantic space. A K-means clustering can then group them into semantic concepts. The number of features is therefore dramatically reduced and documents can be represented as bag of semantic concepts. We show that this model outperforms LSA and LDA on a sentiment classification task, and yields similar results than a traditional BOW-model with far less features.

[1]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[2]  Steffen Staab,et al.  Ontology-based Text Document Clustering , 2002, Künstliche Intell..

[3]  Ronan Collobert,et al.  Word Embeddings through Hellinger PCA , 2013, EACL.

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Xiaohua Hu,et al.  Exploiting Wikipedia as external knowledge for document clustering , 2009, KDD.

[6]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[7]  Koray Kavukcuoglu,et al.  Learning word embeddings efficiently with noise-contrastive estimation , 2013, NIPS.

[8]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[9]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[10]  Yuan-Fang Wang,et al.  The use of bigrams to enhance text categorization , 2002, Inf. Process. Manag..

[11]  Massih-Reza Amini,et al.  Improving document clustering in a learned concept space , 2010, Inf. Process. Manag..

[12]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[13]  Dekang Lin,et al.  Phrase Clustering for Discriminative Learning , 2009, ACL.

[14]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[15]  Xinyun Chen Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[16]  Dimitar Kazakov,et al.  WordNet-based text document clustering , 2004 .

[17]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[18]  Christopher D. Manning,et al.  Baselines and Bigrams: Simple, Good Sentiment and Topic Classification , 2012, ACL.