Semi-Supervised Sequence Modeling with Cross-View Training

Unsupervised representation learning algorithms such as word2vec and ELMo improve the accuracy of many supervised NLP models, mainly because they can take advantage of large amounts of unlabeled text. However, the supervised models only learn from task-specific labeled data during the main training phase. We therefore propose Cross-View Training (CVT), a semi-supervised learning algorithm that improves the representations of a Bi-LSTM sentence encoder using a mix of labeled and unlabeled data. On labeled examples, standard supervised learning is used. On unlabeled examples, CVT teaches auxiliary prediction modules that see restricted views of the input (e.g., only part of a sentence) to match the predictions of the full model seeing the whole input. Since the auxiliary modules and the full model share intermediate representations, this in turn improves the full model. Moreover, we show that CVT is particularly effective when combined with multi-task learning. We evaluate CVT on five sequence tagging tasks, machine translation, and dependency parsing, achieving state-of-the-art results.

[1]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[2]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.

[3]  Jan Niehues,et al.  The IWSLT 2015 Evaluation Campaign , 2015, IWSLT.

[4]  Mark Steedman,et al.  CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank , 2007, CL.

[5]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[6]  Shin Ishii,et al.  Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Tolga Tasdizen,et al.  Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning , 2016, NIPS.

[8]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[9]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[10]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[11]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[12]  Andrew M. Dai,et al.  Adversarial Training Methods for Semi-Supervised Text Classification , 2016, ICLR.

[13]  Mitchell P. Marcus,et al.  OntoNotes: The 90% Solution , 2006, NAACL.

[14]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[15]  Christopher D. Manning,et al.  Stanford Neural Machine Translation Systems for Spoken Language Domains , 2015, IWSLT.

[16]  Ioannis Mitliagkas,et al.  Manifold Mixup: Encouraging Meaningful On-Manifold Interpolation as a Regularizer , 2018, ArXiv.

[17]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[18]  Qiang Yang,et al.  An Overview of Multi-task Learning , 2018 .

[19]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[20]  Eugene Charniak,et al.  Parsing as Language Modeling , 2016, EMNLP.

[21]  Shin Ishii,et al.  Distributional Smoothing with Virtual Adversarial Training , 2015, ICLR 2016.

[22]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[23]  Il-Chul Moon,et al.  Adversarial Dropout for Supervised and Semi-supervised Learning , 2017, AAAI.

[24]  Xiang Wei,et al.  Improving the Improved Training of Wasserstein GANs: A Consistency Term and Its Dual Effect , 2018, ICLR.

[25]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[26]  Anders Søgaard,et al.  Deep multi-task learning with low level tasks supervised at lower layers , 2016, ACL.

[27]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[28]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[29]  Timo Aila,et al.  Temporal Ensembling for Semi-Supervised Learning , 2016, ICLR.

[30]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[31]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[32]  Mikhail Belkin,et al.  A Co-Regularization Approach to Semi-supervised Learning with Multiple Views , 2005 .

[33]  Quoc V. Le,et al.  Multi-task Sequence to Sequence Learning , 2015, ICLR.

[34]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[36]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[37]  Yuan Zhang,et al.  Stack-propagation: Improved Representation Learning for Syntax , 2016, ACL.

[38]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[39]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[40]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[41]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[42]  Quoc V. Le,et al.  Unsupervised Pretraining for Sequence to Sequence Learning , 2016, EMNLP.

[43]  Xiang Ren,et al.  Empower Sequence Labeling with Task-Aware Neural Language Model , 2017, AAAI.

[44]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[45]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[46]  Andrew McCallum,et al.  Fast and Accurate Sequence Labeling with Iterated Dilated Convolutions , 2017, ArXiv.

[47]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[48]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[50]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[51]  Luke S. Zettlemoyer,et al.  LSTM CCG Parsing , 2016, NAACL.

[52]  Christopher Joseph Pal,et al.  Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning , 2018, ICLR.

[53]  Iryna Gurevych,et al.  Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging , 2017, EMNLP.

[54]  Jiajun Zhang,et al.  Shortcut Sequence Tagging , 2017, NLPCC.

[55]  Felix Hill,et al.  Learning Distributed Representations of Sentences from Unlabelled Data , 2016, NAACL.

[56]  Yue Zhang,et al.  In-Order Transition-based Constituent Parsing , 2017, TACL.

[57]  Yann LeCun,et al.  Convolutional networks and applications in vision , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[58]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[59]  Barbara Plank,et al.  Strong Baselines for Neural Semi-Supervised Learning under Domain Shift , 2018, ACL.

[60]  Timothy Dozat,et al.  Deep Biaffine Attention for Neural Dependency Parsing , 2016, ICLR.

[61]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[62]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[63]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[64]  Dacheng Tao,et al.  A Survey on Multi-view Learning , 2013, ArXiv.

[65]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[66]  Andrew Zisserman,et al.  Multi-task Self-Supervised Visual Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[67]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[68]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[69]  Yoshimasa Tsuruoka,et al.  A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks , 2016, EMNLP.

[70]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[71]  Marek Rei,et al.  Semi-supervised Multitask Learning for Sequence Labeling , 2017, ACL.

[72]  Noah A. Smith,et al.  Deep Multitask Learning for Semantic Dependency Parsing , 2017, ACL.

[73]  Zhi-Hua Zhou,et al.  Tri-training: exploiting unlabeled data using three classifiers , 2005, IEEE Transactions on Knowledge and Data Engineering.

[74]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[75]  Philip Bachman,et al.  Learning with Pseudo-Ensembles , 2014, NIPS.

[76]  Eduard H. Hovy,et al.  Neural Probabilistic Model for Non-projective MST Parsing , 2017, IJCNLP.

[77]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[78]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[79]  Zachary Chase Lipton,et al.  Born Again Neural Networks , 2018, ICML.

[80]  Chandra Bhagavatula,et al.  Semi-supervised sequence tagging with bidirectional language models , 2017, ACL.

[81]  H. J. Scudder,et al.  Probability of error of some adaptive pattern-recognition machines , 1965, IEEE Trans. Inf. Theory.

[82]  Noah A. Smith,et al.  What Do Recurrent Neural Network Grammars Learn About Syntax? , 2016, EACL.

[83]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[84]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[85]  Jingzhou Liu,et al.  Stack-Pointer Networks for Dependency Parsing , 2018, ACL.

[86]  Fan Yang,et al.  Good Semi-supervised Learning That Requires a Bad GAN , 2017, NIPS.