论文信息 - Char2Subword: Extending the Subword Embedding Space Using Robust Character Compositionality - 字舞流文

Char2Subword: Extending the Subword Embedding Space Using Robust Character Compositionality

Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword tokenization process of language models as it provides multiple benefits. However, this process is solely based on pre-training data statistics, making it hard for the tokenizer to handle infrequent spellings. On the other hand, though robust to misspellings, pure character-level models often lead to unreasonably long sequences and make it harder for the model to learn meaningful words. To alleviate these challenges, we propose a character-based subword module (char2subword)1 that learns the subword embedding table in pre-trained models like BERT. Our char2subword module builds representations from characters out of the subword vocabulary, and it can be used as a dropin replacement of the subword embedding table. The module is robust to character-level alterations such as misspellings, word inflection, casing, and punctuation. We integrate it further with BERT through pre-training while keeping BERT transformer parameters fixed– and thus, providing a practical method. Finally, we show that incorporating our module to mBERT significantly improves the performance on the social media linguistic codeswitching evaluation (LinCE) benchmark.

Thamar Solorio | Bryan McCann | Gustavo Aguilar | Nazneen Rajani | Nitish Keskar | Tong Niu | N. Keskar | Bryan McCann | T. Solorio | Nazneen Rajani | Gustavo Aguilar | Tong Niu

[1] Parminder Bhatia,et al. Morphological Priors for Probabilistic Neural Word Embeddings , 2016, EMNLP.

[2] Alexander M. Rush,et al. Character-Aware Neural Language Models , 2015, AAAI.

[3] Ponnurangam Kumaraguru,et al. A Twitter Corpus for Hindi-English Code Mixed POS Tagging , 2018, SocialNLP@ACL.

[4] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[5] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[6] John X. Morris,et al. TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP , 2020, EMNLP.

[7] Omer Levy,et al. word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[8] Anna Rumshisky,et al. A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[9] Jacob Eisenstein,et al. What to do about bad language on the internet , 2013, NAACL.

[10] Wang Ling,et al. Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , 2015, EMNLP.

[11] Graham Neubig,et al. Multilingual Neural Machine Translation With Soft Decoupled Encoding , 2019, ICLR.

[12] Lav R. Varshney,et al. CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[13] Barbara Plank,et al. Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss , 2016, ACL.

[14] Omer Levy,et al. What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[15] Ankur Bapna,et al. Revisiting Character-Based Neural Machine Translation with Capacity and Compression , 2018, EMNLP.

[16] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[17] Veselin Stoyanov,et al. Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[18] Jacob Eisenstein,et al. Phonological Factors in Social Media Writing , 2013 .

[19] Doug Downey,et al. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[20] Aditi Raghunathan,et al. Robust Encodings: A Framework for Combating Adversarial Typos , 2020, ACL.

[21] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[22] Roland Vollgraf,et al. Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[23] Rotem Dror,et al. The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing , 2018, ACL.

[24] Yoshua Bengio,et al. A Character-level Decoder without Explicit Segmentation for Neural Machine Translation , 2016, ACL.

[25] Hinrich Schütze,et al. Attentive Mimicking: Better Word Embeddings by Attending to Informative Contexts , 2019, NAACL-HLT.

[26] Yanjun Qi,et al. Black-Box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[27] Yann LeCun,et al. Very Deep Convolutional Networks for Text Classification , 2016, EACL.

[28] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[29] Thamar Solorio,et al. From English to Code-Switching: Transfer Learning with Strong Morphological Clues , 2020, ACL.

[30] Thamar Solorio,et al. Overview for the Second Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[31] Timothy Baldwin,et al. Shared Tasks of the 2015 Workshop on Noisy User-generated Text: Twitter Lexical Normalization and Named Entity Recognition , 2015, NUT@IJCNLP.

[32] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[33] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[34] Christopher D. Manning,et al. Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[35] Alexander M. Fraser,et al. Target-side Word Segmentation Strategies for Neural Machine Translation , 2017, WMT.

[36] Julia Hirschberg,et al. Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task , 2018, CodeSwitch@ACL.

[37] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Taro Watanabe,et al. Substring-based machine translation , 2013, Machine Translation.

[39] Kevin Gimpel,et al. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units , 2016, ArXiv.

[40] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[41] José A. R. Fonollosa,et al. Character-based Neural Machine Translation , 2016, ACL.

[42] Christopher D. Manning,et al. Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models , 2016, ACL.

[43] Taku Kudo,et al. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[44] Jürgen Schmidhuber,et al. Highway Networks , 2015, ArXiv.

[45] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[46] Jacob Eisenstein,et al. Mimicking Word Embeddings using Subword RNNs , 2017, EMNLP.

[47] Roland Vollgraf,et al. Pooled Contextualized Embeddings for Named Entity Recognition , 2019, NAACL.

[48] Thamar Solorio,et al. LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation , 2020, LREC.