Learning to Generate Representations for Novel Words: Mimic the OOV Situation in Training

In this work, we address the out-of-vocabulary (OOV) problem in sequence labeling using only training data of the task. A typical solution in this field is to represent an OOV word using the mean-pooled representations of its surrounding words at test time. However, such a pipeline approach often suffers from the error propagation problem, since training of the supervised model is independent of the mean-pooling operation. In this work, we propose a novel training strategy to address the error propagation problem suffered by this solution. It designs to mimic the OOV situation in the process of model training and trains the supervised model to fit the OOV word representations generated by the mean-pooling operation. Extensive experiments on different sequence labeling tasks, including part-of-speech tagging (POS), named entity recognition (NER), and chunking verified the effectiveness of our proposed method.

[1]  Mikhail Khodak,et al.  A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors , 2018, ACL.

[2]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[3]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[4]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[5]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[6]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[7]  Rich Caruana,et al.  Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping , 2000, NIPS.

[8]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[9]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[10]  Marcello Federico,et al.  Compositional Representation of Morphologically-Rich Input for Neural Machine Translation , 2018, ACL.

[11]  Marco Baroni,et al.  High-risk learning: acquiring new word vectors from tiny data , 2017, EMNLP.

[12]  Kevin Gimpel,et al.  Mapping Unseen Words to Task-Trained Embedding Spaces , 2015, Rep4NLP@ACL.

[13]  Philip Bachman,et al.  Learning with Pseudo-Ensembles , 2014, NIPS.

[14]  Wang Ling,et al.  Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , 2015, EMNLP.

[15]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[16]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[17]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[18]  Tommaso Caselli,et al.  When it's all piling up: investigating error propagation in an NLP pipeline , 2015, WNACP@NLDB.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[21]  C. Lee Giles,et al.  Overfitting and neural networks: conjugate gradient and backpropagation , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[22]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[23]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[24]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[25]  Xuanjing Huang,et al.  Long Short-Term Memory Neural Networks for Chinese Word Segmentation , 2015, EMNLP.

[26]  Jacob Eisenstein,et al.  Mimicking Word Embeddings using Subword RNNs , 2017, EMNLP.