Learning to Capitalize with Character-Level Recurrent Neural Networks: An Empirical Study

In this paper, we investigate case restoration for text without case information. Previous such work operates at the word level. We propose an approach using character-level recurrent neural networks (RNN), which performs competitively compared to language modeling and conditional random fields (CRF) approaches. We further provide quantitative and qualitative analysis on how RNN helps improve truecasing.

[1]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[2]  Daniel Marcu,et al.  Capitalizing Machine Translation , 2006, NAACL.

[3]  Alex Acero,et al.  Adaptation of Maximum Entropy Capitalizer: Little Data Can Help a Lo , 2006, Comput. Speech Lang..

[4]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[5]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[6]  Hwee Tou Ng,et al.  Teaching a Weaker Classifier: Named Entity Recognition on Upper Case Text , 2002, ACL.

[7]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[8]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[9]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[10]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[11]  David Kauchak,et al.  Simple English Wikipedia: A New Text Simplification Task , 2011, ACL.

[12]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[15]  Wei Lu,et al.  Weak Semi-Markov CRFs for Noun Phrase Chunking in Informal Text , 2016, HLT-NAACL.

[16]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[17]  Lucian Vlad Lita,et al.  tRuEcasIng , 2003, ACL.

[18]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[19]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[20]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[21]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[22]  Fei-Fei Li,et al.  Visualizing and Understanding Recurrent Networks , 2015, ArXiv.

[23]  Kalina Bontcheva,et al.  ResToRinG CaPitaLiZaTion in #TweeTs , 2015, WWW.