SLM: Learning a Discourse Language Representation with Sentence Unshuffling

We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation in a fully self-supervised manner. Recent pre-training methods in NLP focus on learning either bottom or top-level language representations: contextualized word representations derived from language model objectives at one extreme and a whole sequence representation learned by order classification of two given textual segments at the other. However, these models are not directly encouraged to capture representations of intermediate-size structures that exist in natural languages such as sentences and the relationships among them. To that end, we propose a new approach to encourage learning of a contextualized sentence-level representation by shuffling the sequence of input sentences and training a hierarchical transformer model to reconstruct the original ordering. Through experiments on downstream tasks such as GLUE, SQuAD, and DiscoEval, we show that this feature of our model improves the performance of the original BERT by large margins.

[1]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[2]  Honglak Lee,et al.  Sentence Ordering and Coherence Modeling using Recurrent Neural Networks , 2016, AAAI.

[3]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[4]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[5]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[6]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[7]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[8]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[9]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[10]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[11]  Luo Si,et al.  StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding , 2019, ICLR.

[12]  Ming-Wei Chang,et al.  Language Model Pre-training for Hierarchical Document Representations , 2019, ArXiv.

[13]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[14]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Dan Iter,et al.  Pretraining with Contrastive Sentence Objectives Improves Discourse Performance of Language Models , 2020, ACL.

[16]  Samuel R. Bowman,et al.  Discourse-Based Objectives for Fast Unsupervised Sentence Representation Learning , 2017, ArXiv.

[17]  Xuanjing Huang,et al.  Neural Sentence Ordering , 2016, ArXiv.

[18]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[19]  Kevin Gimpel,et al.  Evaluation Benchmarks and Learning Criteriafor Discourse-Aware Sentence Representations , 2019, EMNLP/IJCNLP.

[20]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[21]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[22]  Zhe Gan,et al.  Learning Generic Sentence Representations Using Convolutional Neural Networks , 2016, EMNLP.

[23]  Shibamouli Lahiri,et al.  Complexity of Word Collocation Networks: A Preliminary Structural Analysis , 2013, EACL.

[24]  Ming Zhou,et al.  HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization , 2019, ACL.

[25]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[26]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[27]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[28]  Felix Hill,et al.  Learning Distributed Representations of Sentences from Unlabelled Data , 2016, NAACL.

[29]  Hao Tian,et al.  ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[30]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[31]  Danqi Chen,et al.  CoQA: A Conversational Question Answering Challenge , 2018, TACL.

[32]  Xuanjing Huang,et al.  End-to-End Neural Sentence Ordering Using Pointer Network , 2016, ArXiv.