论文信息 - ERNIE-Doc: A Retrospective Long-Document Modeling Transformer - 字舞流文

ERNIE-Doc: A Retrospective Long-Document Modeling Transformer

Transformers are not suited for processing long document input due to its quadratically increasing memory and time consumption. Simply truncating a long document or applying the sparse attention mechanism will incur the context fragmentation problem or inferior modeling capability with comparable model size. In this paper, we propose ERNIEDOC, a document-level language pretraining model based on Recurrence Transformers (Dai et al., 2019). Two well-designed techniques, namely the retrospective feed mechanism and the enhanced recurrence mechanism enable ERNIE-DOC with much longer effective context length to capture the contextual information of a whole document. We pretrain ERNIE-DOC to explicitly learn the relationship among segments with an additional document-aware segment reordering objective. Various experiments on both English and Chinese document-level tasks are conducted. ERNIE-DOC achieves SOTA language modeling result of 16.8 ppl on WikiText103 and outperforms competitive pretraining models on most language understanding tasks such as text classification, question answering by a large margin.

Hao Tian | Junyuan Shang | Hua Wu | Haifeng Wang | Shuohuan Wang | Yu Sun | Siyu Ding | Hua Wu | Haifeng Wang | Hao Tian | Yu Sun | Shuohuan Wang | Junyuan Shang | Siyu Ding

[1] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[2] Benno Stein,et al. SemEval-2019 Task 4: Hyperpartisan News Detection , 2019, *SEMEVAL.

[3] Wanxiang Che,et al. Pre-Training with Whole Word Masking for Chinese BERT , 2019, ArXiv.

[4] Yuting Lai,et al. DRCD: a Chinese Machine Reading Comprehension Dataset , 2018, ArXiv.

[5] Zheng Zhang,et al. BP-Transformer: Modelling Long-Range Context via Binary Partitioning , 2019, ArXiv.

[6] Hao Tian,et al. ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[7] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[8] Wen Gao,et al. Segatron: Segment-Aware Transformer for Language Modeling and Understanding. , 2020 .

[9] Richard Socher,et al. An Analysis of Neural Language Modeling at Multiple Scales , 2018, ArXiv.

[10] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11] Arman Cohan,et al. Longformer: The Long-Document Transformer , 2020, ArXiv.

[12] Mausam,et al. A Simple Yet Strong Pipeline for HotpotQA , 2020, EMNLP.

[13] Timothy P. Lillicrap,et al. Compressive Transformers for Long-Range Sequence Modelling , 2019, ICLR.

[14] Alexei Baevski,et al. Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.

[15] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[16] Dian Yu,et al. CLUE: A Chinese Language Understanding Evaluation Benchmark , 2020, COLING.

[17] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[18] Luchen Tan,et al. SegaBERT: Pre-training of Segment-aware BERT for Language Understanding , 2020, ArXiv.

[19] Chenyan Xiong,et al. Open Domain Web Keyphrase Extraction Beyond Language Modeling , 2019, EMNLP.

[20] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[21] Nicolas Usunier,et al. Improving Neural Language Models with a Continuous Cache , 2016, ICLR.

[22] Ming Zhou,et al. HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization , 2019, ACL.

[23] Sebastian Riedel,et al. Constructing Datasets for Multi-hop Reading Comprehension Across Documents , 2017, TACL.

[24] Si Sun,et al. Joint Keyphrase Chunking and Salience Ranking with BERT , 2020, ArXiv.

[25] Jimmy J. Lin,et al. Pretrained Transformers for Text Ranking: BERT and Beyond , 2020, NAACL.

[26] Richard Socher,et al. Pointer Sentinel Mixture Models , 2016, ICLR.

[27] Claire Cardie,et al. Probing Prior Knowledge Needed in Challenging Chinese Machine Reading Comprehension , 2019, ArXiv.

[28] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[29] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30] Christopher Clark,et al. Simple and Effective Multi-Paragraph Reading Comprehension , 2017, ACL.

[31] Ashish Vaswani,et al. Self-Attention with Relative Position Representations , 2018, NAACL.

[32] Yu Sun,et al. ERNIE: Enhanced Representation through Knowledge Integration , 2019, ArXiv.

[33] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[34] Eunsol Choi,et al. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[35] Quoc V. Le,et al. A Simple Method for Commonsense Reasoning , 2018, ArXiv.

[36] Yann Dauphin,et al. Language Modeling with Gated Convolutional Networks , 2016, ICML.

[37] Liu Yang,et al. Sparse Sinkhorn Attention , 2020, ICML.

[38] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.

[39] Wentao Ma,et al. A Span-Extraction Dataset for Chinese Machine Reading Comprehension , 2019, EMNLP-IJCNLP.

[40] Xianpei Han,et al. CAIL2019-SCM: A Dataset of Similar Case Matching in Legal Domain , 2019, ArXiv.

[41] Li Yang,et al. ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.

[42] Xinyan Xiao,et al. DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications , 2017, QA@ACL.

[43] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[44] Yoshua Bengio,et al. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.