Incorporating Residual and Normalization Layers into Analysis of Masked Language Models
暂无分享,去创建一个
Sho Yokoi | Tatsuki Kuribayashi | Kentaro Inui | Goro Kobayashi | Kentaro Inui | Tatsuki Kuribayashi | Sho Yokoi | Goro Kobayashi
[1] Shikha Bordia,et al. Do Attention Heads in BERT Track Syntactic Dependencies? , 2019, ArXiv.
[2] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.
[3] Yang Liu,et al. On Identifiability in Transformers , 2020, ICLR.
[4] Erik F. Tjong Kim Sang,et al. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.
[5] Sanjeev Arora,et al. A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.
[6] Ming-Wei Chang,et al. Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .
[7] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Anna Rumshisky,et al. A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.
[9] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.
[10] Willem Zuidema,et al. Quantifying Attention Flow in Transformers , 2020, ACL.
[11] Hans Peter Luhn,et al. The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..
[12] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[13] Anna Rumshisky,et al. Revealing the Dark Secrets of BERT , 2019, EMNLP.
[14] Benjamin J. Wilson,et al. Measuring Word Significance using Distributed Representations of Words , 2015, ArXiv.
[15] Jörg Tiedemann,et al. An Analysis of Encoder Representations in Transformer-Based Machine Translation , 2018, BlackboxNLP@EMNLP.
[16] Omer Levy,et al. What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.
[17] Liwei Wang,et al. On Layer Normalization in the Transformer Architecture , 2020, ICML.
[18] Alexander M. Rush,et al. OpenNMT: Neural Machine Translation Toolkit , 2018, AMTA.
[19] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[20] Omer Levy,et al. Are Sixteen Heads Really Better than One? , 2019, NeurIPS.
[21] Dipanjan Das,et al. BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.
[22] Jürgen Schmidhuber,et al. Training Very Deep Networks , 2015, NIPS.
[23] Joakim Nivre,et al. An Analysis of Attention Mechanisms: The Case of Word Sense Disambiguation in Neural Machine Translation , 2018, WMT.
[24] Yuexian Zou,et al. Rethinking Skip Connection with Layer Normalization , 2020, COLING.
[25] Martin Wattenberg,et al. Visualizing and Measuring the Geometry of BERT , 2019, NeurIPS.
[26] Kentaro Inui,et al. Attention is Not Only a Weight: Analyzing Transformers with Vector Norms , 2020, EMNLP.
[27] Kentaro Inui,et al. Word Rotator’s Distance , 2020, EMNLP.
[28] Stefan L. Frank,et al. Human Sentence Processing: Recurrence or Attention? , 2021, CMCL.
[29] Leila Wehbe,et al. Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain) , 2019, NeurIPS.
[30] Roman Vershynin,et al. High-Dimensional Probability , 2018 .
[31] Fedor Moiseev,et al. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.
[32] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[33] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.
[34] Rudolf Rosa,et al. From Balustrades to Pierre Vinken: Looking for Syntax in Transformer Self-Attentions , 2019, BlackboxNLP@ACL.
[35] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[36] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.
[37] Omer Levy,et al. Emergent linguistic structure in artificial neural networks trained by self-supervision , 2020, Proceedings of the National Academy of Sciences.
[38] Kawin Ethayarajh,et al. How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings , 2019, EMNLP.
[39] Adrian Brasoveanu,et al. Visualizing Transformers for NLP: A Brief Survey , 2020, 2020 24th International Conference Information Visualisation (IV).
[40] Robert Frank,et al. Open Sesame: Getting inside BERT’s Linguistic Knowledge , 2019, BlackboxNLP@ACL.
[41] Alexander D'Amour,et al. The MultiBERTs: BERT Reproductions for Robustness Analysis , 2021, ArXiv.
[42] Andreas Loukas,et al. Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth , 2021, ICML.
[43] Jaime S. Cardoso,et al. Machine Learning Interpretability: A Survey on Methods and Metrics , 2019, Electronics.