论文信息 - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - 字舞流文

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5 (7.7 point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

Ming-Wei Chang | Kenton Lee | Kristina Toutanova | Jacob Devlin | Jacob Devlin | Ming-Wei Chang | Kenton Lee | Kristina Toutanova

[1] Wilson L. Taylor,et al. “Cloze Procedure”: A New Tool for Measuring Readability , 1953 .

[2] Robert L. Mercer,et al. Class-Based n-gram Models of Natural Language , 1992, CL.

[3] Erik F. Tjong Kim Sang,et al. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[4] Chris Brockett,et al. Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[5] Tong Zhang,et al. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[6] Ido Dagan,et al. The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[7] John Blitzer,et al. Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[8] Jason Weston,et al. A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[9] Yoshua Bengio,et al. Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[10] Geoffrey E. Hinton,et al. A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[11] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[12] Ido Dagan,et al. The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[13] Yoshua Bengio,et al. Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[14] Peter Clark,et al. The Seventh PASCAL Recognizing Textual Entailment Challenge , 2011, TAC.

[15] Hector J. Levesque,et al. The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[16] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[17] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[18] Yoshua Bengio,et al. How transferable are features in deep neural networks? , 2014, NIPS.

[19] Thorsten Brants,et al. One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[20] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[21] Quoc V. Le,et al. Distributed Representations of Sentences and Documents , 2014, ICML.

[22] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23] Quoc V. Le,et al. Semi-supervised Sequence Learning , 2015, NIPS.

[24] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[25] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.

[26] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[27] Felix Hill,et al. Learning Distributed Representations of Sentences from Unlabelled Data , 2016, NAACL.

[28] Jakob Uszkoreit,et al. A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[29] Kevin Gimpel,et al. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units , 2016, ArXiv.

[30] Ido Dagan,et al. context2vec: Learning Generic Context Embedding with Bidirectional LSTM , 2016, CoNLL.

[31] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[32] Chandra Bhagavatula,et al. Semi-supervised sequence tagging with bidirectional language models , 2017, ACL.

[33] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[34] Ali Farhadi,et al. Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[35] Eneko Agirre,et al. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[36] Samuel R. Bowman,et al. Discourse-Based Objectives for Fast Unsupervised Sentence Representation Learning , 2017, ArXiv.

[37] Richard Socher,et al. Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[38] Holger Schwenk,et al. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[39] Eunsol Choi,et al. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[40] Quoc V. Le,et al. Semi-Supervised Sequence Modeling with Cross-View Training , 2018, EMNLP.

[41] Sebastian Ruder,et al. Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[42] Wei Wang,et al. Multi-Granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering , 2018, ACL.

[43] Yang Liu,et al. U-Net: Machine Reading Comprehension with Unanswerable Questions , 2018, ArXiv.

[44] Christopher Clark,et al. Simple and Effective Multi-Paragraph Reading Comprehension , 2017, ACL.

[45] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[46] Roland Vollgraf,et al. Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[47] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[48] Andrew M. Dai,et al. MaskGAN: Better Text Generation via Filling in the ______ , 2018, ICLR.

[49] Quoc V. Le,et al. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension , 2018, ICLR.

[50] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[51] Luke S. Zettlemoyer,et al. Dissecting Contextual Word Embeddings: Architecture and Representation , 2018, EMNLP.

[52] Yejin Choi,et al. SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[53] Honglak Lee,et al. An efficient framework for learning sentence representations , 2018, ICLR.

[54] Ming Zhou,et al. Reinforced Mnemonic Reader for Machine Reading Comprehension , 2017, IJCAI.

[55] Noah Constant,et al. Character-Level Language Modeling with Deeper Self-Attention , 2018, AAAI.

[56] Samuel R. Bowman,et al. Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.