Unsupervised Document Embedding via Contrastive Augmentation

We present a contrasting learning approach with data augmentation techniques to learn document representations in an unsupervised manner. Inspired by recent contrastive selfsupervised learning algorithms used for image and NLP pretraining, we hypothesize that high-quality document embedding should be invariant to diverse paraphrases that preserve the semantics of the original document. With different backbones and contrastive learning frameworks, our study reveals the enormous benefits of contrastive augmentation for document representation learning with two additional insights: 1) including data augmentation in a contrastive way can substantially improve the embedding quality in unsupervised document representation learning, and 2) in general, stochastic augmentations generated by simple word-level manipulation work much better than sentence-level and document-level ones. We plug our method into a classifier and compare it with a broad range of baseline methods on six benchmark datasets. Our method can decrease the classification error rate by up to 6.4% over the SOTA approaches on the document classification task, matching or even surpassing fully-supervised methods.

[1]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[2]  S. Goutal,et al.  Text Data Augmentation: Towards better detection of spear-phishing emails , 2020, ArXiv.

[3]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[4]  Radu Tudor Ionescu,et al.  Automated essay scoring with string kernels and word embeddings , 2018, ACL.

[5]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[7]  Alex Waibel,et al.  Improving Sequence-To-Sequence Speech Recognition Training with On-The-Fly Data Augmentation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Wallace Koehler,et al.  Information science as "Little Science":The implications of a bibliometric analysis of theJournal of the American Society for Information Science , 2001, Scientometrics.

[9]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[10]  Yelong Shen,et al.  A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation , 2020, ArXiv.

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[13]  Yelong Shen,et al.  CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for Natural Language Understanding , 2020, ICLR.

[14]  Gary D. Bader,et al.  DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations , 2020, ACL.

[15]  Hwee Tou Ng,et al.  Improved Word Sense Disambiguation Using Pre-Trained Contextualized Word Representations , 2019, EMNLP.

[16]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[17]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[18]  Franco Martín Luque,et al.  Atalaya at TASS 2019: Data Augmentation and Robust Embeddings for Sentiment Analysis , 2019, IberLEF@SEPLN.

[19]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[20]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[21]  Shou-De Lin,et al.  Self-Discriminative Learning for Unsupervised Document Embedding , 2019, NAACL.

[22]  Cícero Nogueira dos Santos,et al.  Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts , 2014, COLING.

[23]  Honglak Lee,et al.  An efficient framework for learning sentence representations , 2018, ICLR.

[24]  Christof Monz,et al.  Data Augmentation for Low-Resource Neural Machine Translation , 2017, ACL.

[25]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[26]  Graham Neubig,et al.  SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation , 2018, EMNLP.

[27]  Tie-Yan Liu,et al.  Soft Contextual Data Augmentation for Neural Machine Translation , 2019, ACL.

[28]  Pradeep Ravikumar,et al.  Word Mover’s Embedding: From Word2Vec to Document Embedding , 2018, EMNLP.

[29]  Florent Perronnin,et al.  Aggregating Continuous Word Embeddings for Information Retrieval , 2013, CVSM@ACL.

[30]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[31]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  W. Bruce Croft,et al.  Language Modeling for Information Retrieval , 2010, The Springer International Series on Information Retrieval.

[34]  Xing Wu,et al.  Conditional BERT Contextual Augmentation , 2018, ICCS.

[35]  Zhe Gan,et al.  Learning Generic Sentence Representations Using Convolutional Neural Networks , 2016, EMNLP.

[36]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[37]  D. Marr,et al.  Representation and recognition of the spatial organization of three-dimensional shapes , 1978, Proceedings of the Royal Society of London. Series B. Biological Sciences.

[38]  Madian Khabsa,et al.  CLEAR: Contrastive Learning for Sentence Representation , 2020, ArXiv.

[39]  Felix Hill,et al.  Learning Distributed Representations of Sentences from Unlabelled Data , 2016, NAACL.

[40]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[41]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[42]  Jiancheng Li,et al.  TreeNet: Learning Sentence Representations with Unconstrained Tree Structure , 2018, IJCAI.

[43]  Jingbo Zhu,et al.  Shared-Private Bilingual Word Embeddings for Neural Machine Translation , 2019, ACL.

[44]  Minmin Chen,et al.  Efficient Vector Representation for Documents through Corruption , 2017, ICLR.

[45]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[46]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[47]  Quoc V. Le,et al.  Unsupervised Data Augmentation for Consistency Training , 2019, NeurIPS.

[48]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Pushpak Bhattacharyya,et al.  Unsupervised Most Frequent Sense Detection using Word Embeddings , 2015, HLT-NAACL.

[50]  Zhangyang Wang,et al.  Graph Contrastive Learning with Augmentations , 2020, NeurIPS.

[51]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[52]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[53]  Graham W. Taylor,et al.  Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[54]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[55]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[56]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[57]  Yuan Luo,et al.  Graph Convolutional Networks for Text Classification , 2018, AAAI.

[58]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.