Efficient Vector Representation for Documents through Corruption

We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors informative or rare words while forcing the embeddings of common and non-discriminative ones to be close to zero. Doc2VecC produces significantly better word embeddings than Word2Vec. We compare Doc2VecC with several state-of-the-art document representation learning algorithms. The simple model architecture introduced by Doc2VecC matches or out-performs the state-of-the-art in generating high-quality document representations for sentiment analysis, document classification as well as semantic relatedness tasks. The simplicity of the model enables training on billions of words per hour on a single machine. At the same time, the model is very efficient in generating representations of unseen documents at test time.

[1]  Yoshua Bengio,et al.  Marginalized Denoising Auto-encoders for Nonlinear Representations , 2014, ICML.

[2]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[3]  Xiang Zhang,et al.  Text Understanding from Scratch , 2015, ArXiv.

[4]  Christopher D. Manning,et al.  Baselines and Bigrams: Simple, Good Sentiment and Topic Classification , 2012, ACL.

[5]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[6]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[9]  Xinyun Chen Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[10]  Andrew Y. Ng,et al.  Improving Word Representations via Global Context and Multiple Word Prototypes , 2012, ACL.

[11]  Yoshua Bengio,et al.  Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach , 2011, ICML.

[12]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[13]  Kilian Q. Weinberger,et al.  Marginalized Denoising Autoencoders for Domain Adaptation , 2012, ICML.

[14]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[15]  Mirella Lapata,et al.  Composition in Distributional Models of Semantics , 2010, Cogn. Sci..

[16]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[17]  M. Marelli,et al.  SemEval-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment , 2014, *SEMEVAL.

[18]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[19]  Stephen Tyree,et al.  Learning with Marginalized Corrupted Features , 2013, ICML.

[20]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[21]  Quoc V. Le,et al.  Document Embedding with Paragraph Vectors , 2015, ArXiv.

[22]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[23]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[24]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[25]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[26]  Mehrnoosh Sadrzadeh,et al.  Multi-Step Regression Learning for Compositional Distributional Semantics , 2013, IWCS.

[27]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[28]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[29]  Ioannis Korkontzelos,et al.  Estimating Linear Models for Compositional Distributional Semantics , 2010, COLING.

[30]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[31]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[32]  Claire Cardie,et al.  Compositional Matrix-Space Models for Sentiment Analysis , 2011, EMNLP.

[33]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[34]  Sida I. Wang,et al.  Dropout Training as Adaptive Regularization , 2013, NIPS.

[35]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.

[36]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[37]  Marc'Aurelio Ranzato,et al.  Ensemble of Generative and Discriminative Techniques for Sentiment Analysis of Movie Reviews , 2014, ICLR.