LordBERT: Embedding Long Text by Segment Ordering with BERT

Although BERT has achieved significant improvements on many downstream NLP tasks, it has difficulty handling long text because of its quadratic computation complexity. A typical approach to this issue is splitting the input into shorter segments and utilizing order-independent attention mechanism to conduct intersegment interaction, but the approach ignores the segment order information, which is greatly beneficial for capturing implicit relations across different segments. To address this problem, we propose a novel multi-task learning framework, named LordBERT, which fully exploits both intra- and inter-segment information in long text by segment ordering with BERT. LordBERT learns segment-level representations from segments through BERT and a reasoner, and utilizes an auxiliary segment ordering module to reorder disordered segments. With this module, the model implicitly encodes intersegment relations and global information of long text into segment representations. The downstream task and the ordering task are jointly optimized during training, while for inferencing we mainly conduct the downstream task. Experimental results show that LordBERT outperforms the state-of-the-art models by up to 0.58% in accuracy for text classification tasks on long text.

[1]  Christian Szegedy,et al.  Hierarchical Transformers Are More Efficient Language Models , 2021, NAACL-HLT.

[2]  Jianlin Su,et al.  RoFormer: Enhanced Transformer with Rotary Position Embedding , 2021, Neurocomputing.

[3]  Jianzhong Qi,et al.  Automatic Webpage Briefing , 2021, 2021 IEEE 37th International Conference on Data Engineering (ICDE).

[4]  Hao Tian,et al.  ERNIE-Doc: A Retrospective Long-Document Modeling Transformer , 2021, ACL.

[5]  Tie-Yan Liu,et al.  Rethinking Positional Encoding in Language Pre-training , 2020, ICLR.

[6]  L. Verhoeven,et al.  Profiling children's reading comprehension: A dynamic approach , 2020, Learning and Individual Differences.

[7]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[8]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[9]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[10]  Timothy P. Lillicrap,et al.  Compressive Transformers for Long-Range Sequence Modelling , 2019, ICLR.

[11]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[12]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[13]  Chang Zhou,et al.  CogLTX: Applying BERT to Long Texts , 2020, NeurIPS.

[14]  Jes'us Villalba,et al.  Hierarchical Transformers for Long Document Classification , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[15]  Ruixuan Zhang,et al.  BERT-AL: BERT for Arbitrarily Long Document Understanding , 2019 .

[16]  Mirella Lapata,et al.  Text Summarization with Pretrained Encoders , 2019, EMNLP.

[17]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[18]  Xueqi Cheng,et al.  ReCoSa: Detecting the Relevant Contexts with Self-Attention for Multi-turn Dialogue Generation , 2019, ACL.

[19]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[20]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[21]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[22]  Yuan Luo,et al.  Graph Convolutional Networks for Text Classification , 2018, AAAI.

[23]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[24]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[27]  Christopher Clark,et al.  Simple and Effective Multi-Paragraph Reading Comprehension , 2017, ACL.

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[30]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[31]  Xuanjing Huang,et al.  Recurrent Neural Network for Text Classification with Multi-Task Learning , 2016, IJCAI.

[32]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[33]  C. Ho,et al.  A model of reading comprehension in Chinese elementary school children , 2013 .

[34]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[35]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.