Towards the ImageNet-CNN of NLP: Pretraining Sentence Encoders with Machine Translation

Computer vision has benefited from initializing multiple deep layers with weights pre-trained on large supervised training sets like ImageNet. In contrast, deep models for language tasks currently only benefit from transfer of unsupervised word vectors and randomly initialize all higher layers. In this paper, we use the encoder of an attentional sequence-to-sequence model trained for machine translation to initialize models for other, different language tasks. We show that this transfer improves performance over just using word vectors on a wide variety of common NLP tasks: sentiment analysis (SST, IMDb), question classification (QC), entailment (SNLI), and question answering (SQuAD).