Truecasing German user-generated conversational text

True-casing, the task of restoring proper case to (generally) lower case input, is important in downstream tasks and for screen display. In this paper, we investigate truecasing as an in- trinsic task and present several experiments on noisy user queries to a voice-controlled dia- log system. In particular, we compare a rule- based, an n-gram language model (LM) and a recurrent neural network (RNN) approaches, evaluating the results on a German Q&A cor- pus and reporting accuracy for different case categories. We show that while RNNs reach higher accuracy especially on large datasets, character n-gram models with interpolation are still competitive, in particular on mixed- case words where their fall-back mechanisms come into play.

[1]  Björn Hoffmeister,et al.  Neural Text Normalization with Subword Units , 2019, NAACL.

[2]  Wei Lu,et al.  Learning to Capitalize with Character-Level Recurrent Neural Networks: An Empirical Study , 2016, EMNLP.

[3]  Yaser Al-Onaizan,et al.  Robustness to Capitalization Errors in Named Entity Recognition , 2019, EMNLP.

[4]  Daniel Marcu,et al.  Capitalizing Machine Translation , 2006, NAACL.

[5]  Brian Roark,et al.  The OpenGrm open-source finite-state grammar software libraries , 2012, ACL.

[6]  Brian Roark,et al.  Neural Models of Text Normalization for Speech Applications , 2019, Computational Linguistics.

[7]  Lucian Vlad Lita,et al.  tRuEcasIng , 2003, ACL.

[8]  Silviu Cucerzan Does Capitalization Matter in Web Search? , 2010, KDIR.

[9]  Thomas Eckart,et al.  Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages , 2012, LREC.

[10]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[11]  Fernando Batista,et al.  Language Dynamics and Capitalization using Maximum Entropy , 2008, ACL.

[12]  Thierry Etchegoyhen,et al.  To Case or not to case: Evaluating Casing Methods for Neural Machine Translation , 2020, LREC.

[13]  Alex Acero,et al.  Adaptation of Maximum Entropy Capitalizer: Little Data Can Help a Lo , 2006, Comput. Speech Lang..

[14]  Navdeep Jaitly,et al.  RNN Approaches to Text Normalization: A Challenge , 2016, ArXiv.

[15]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..