Improving Document Classification with Multi-Sense Embeddings

Efficient representation of text documents is an important building block in many NLP tasks. Research on long text categorization has shown that simple weighted averaging of word vectors for sentence representation often outperforms more sophisticated neural models. Recently proposed Sparse Composite Document Vector (SCDV) (Mekala et. al, 2017) extends this approach from sentences to documents using soft clustering over word vectors. However, SCDV disregards the multi-sense nature of words, and it also suffers from the curse of higher dimensionality. In this work, we address these shortcomings and propose SCDV-MS. SCDV-MS utilizes multi-sense word embeddings and learns a lower dimensional manifold. Through extensive experiments on multiple real-world datasets, we show that SCDV-MS embeddings outperform previous state-of-the-art embeddings on multi-class and multi-label text categorization tasks. Furthermore, SCDV-MS embeddings are more efficient than SCDV in terms of time and space complexity on textual classification tasks.

[1]  Kevin Gimpel,et al.  From Paraphrase Database to Compositional Paraphrase Model and Back , 2015, Transactions of the Association for Computational Linguistics.

[2]  Minmin Chen,et al.  Efficient Vector Representation for Documents through Corruption , 2017, ICLR.

[3]  Geoffrey E. Hinton Tensor Product Variable Binding and the Representation of Symbolic Structures in Connectionist Systems , 1991 .

[4]  Mirella Lapata,et al.  Composition in Distributional Models of Semantics , 2010, Cogn. Sci..

[5]  Prateek Jain,et al.  Sparse Local Embeddings for Extreme Multi-label Classification , 2015, NIPS.

[6]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[7]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[8]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[9]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[10]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[11]  Daniel Jurafsky,et al.  Do Multi-Sense Embeddings Improve Natural Language Understanding? , 2015, EMNLP.

[12]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[13]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[14]  Zhiyuan Liu,et al.  Topical Word Embeddings , 2015, AAAI.

[15]  Ting Wang,et al.  Improving Distributed Word Representation and Topic Model by Word-Topic Mixture Model , 2016, ACML.

[16]  Bo Fu,et al.  Integrating Topic Modeling with Word Embeddings by Mixtures of vMFs , 2016, COLING.

[17]  Rajarshi Das,et al.  Gaussian LDA for Topic Models with Word Embeddings , 2015, ACL.

[18]  Iryna Gurevych,et al.  Concatenated Power Mean Word Embeddings as Universal Cross-Lingual Sentence Representations , 2018, 1803.01400.

[19]  H. Abdi,et al.  Principal component analysis , 2010 .

[20]  Partha Pratim Talukdar,et al.  Unsupervised Document Representation using Partition Word-Vectors Averaging , 2018 .

[21]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[22]  Junhua He,et al.  LTSG: Latent Topical Skip-Gram for Mutually Learning Topic Model and Vector Representations , 2018, PRCV.

[23]  Sungzoon Cho,et al.  Bag-of-concepts: Comprehending document representation through clustering words in distributed representation , 2017, Neurocomputing.

[24]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[25]  David M. Blei,et al.  The Dynamic Embedded Topic Model , 2019, ArXiv.

[26]  Kevin Gimpel,et al.  Towards Universal Paraphrastic Sentence Embeddings , 2015, ICLR.

[27]  Harish Karnick,et al.  Product Classification in E-Commerce using Distributional Semantics , 2016, COLING.

[28]  Christopher E. Moody,et al.  Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec , 2016, ArXiv.

[29]  Hadi Amiri,et al.  Vector of Locally Aggregated Embeddings for Text Representation , 2019, NAACL.

[30]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[31]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[32]  Dat Quoc Nguyen,et al.  Improving Topic Models with Latent Feature Word Representations , 2015, TACL.

[33]  Harish Karnick,et al.  SCDV : Sparse Composite Document Vectors using soft clustering over distributional representations , 2016, EMNLP.

[34]  Andrew Gordon Wilson,et al.  Probabilistic FastText for Multi-Sense Word Embeddings , 2018, ACL.

[35]  Mirella Lapata,et al.  Vector-based Models of Semantic Composition , 2008, ACL.

[36]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[37]  Chunyan Miao,et al.  Generative Topic Embedding: a Continuous Representation of Documents , 2016, ACL.

[38]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[39]  Peng Jin,et al.  Bag-of-Embeddings for Text Classification , 2016, IJCAI.

[40]  Xuanjing Huang,et al.  Learning Context-Sensitive Word Embeddings with Neural Tensor Skip-Gram Model , 2015, IJCAI.

[41]  Amitabha Mukerjee,et al.  Words are not Equal: Graded Weighting Model for Building Composite Document Vectors , 2015, ICON.

[42]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[43]  David M. Blei,et al.  Topic Modeling in Embedding Spaces , 2019, Transactions of the Association for Computational Linguistics.

[44]  Xinyu Dai,et al.  Topic2Vec: Learning distributed representations of topics , 2015, 2015 International Conference on Asian Language Processing (IALP).

[45]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..