论文信息 - Truecasing German user-generated conversational text

Truecasing German user-generated conversational text

True-casing, the task of restoring proper case to (generally) lower case input, is important in downstream tasks and for screen display. In this paper, we investigate truecasing as an in- trinsic task and present several experiments on noisy user queries to a voice-controlled dia- log system. In particular, we compare a rule- based, an n-gram language model (LM) and a recurrent neural network (RNN) approaches, evaluating the results on a German Q&A cor- pus and reporting accuracy for different case categories. We show that while RNNs reach higher accuracy especially on large datasets, character n-gram models with interpolation are still competitive, in particular on mixed- case words where their fall-back mechanisms come into play.

[1] Björn Hoffmeister,et al. Neural Text Normalization with Subword Units , 2019, NAACL.

[2] Wei Lu,et al. Learning to Capitalize with Character-Level Recurrent Neural Networks: An Empirical Study , 2016, EMNLP.

[3] Yaser Al-Onaizan,et al. Robustness to Capitalization Errors in Named Entity Recognition , 2019, EMNLP.

[4] Daniel Marcu,et al. Capitalizing Machine Translation , 2006, NAACL.

[5] Brian Roark,et al. The OpenGrm open-source finite-state grammar software libraries , 2012, ACL.

[6] Brian Roark,et al. Neural Models of Text Normalization for Speech Applications , 2019, Computational Linguistics.

[7] Lucian Vlad Lita,et al. tRuEcasIng , 2003, ACL.

[8] Silviu Cucerzan. Does Capitalization Matter in Web Search? , 2010, KDIR.

[9] Thomas Eckart,et al. Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages , 2012, LREC.

[10] Hermann Ney,et al. On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[11] Fernando Batista,et al. Language Dynamics and Capitalization using Maximum Entropy , 2008, ACL.

[12] Thierry Etchegoyhen,et al. To Case or not to case: Evaluating Casing Methods for Neural Machine Translation , 2020, LREC.

[13] Alex Acero,et al. Adaptation of Maximum Entropy Capitalizer: Little Data Can Help a Lo , 2006, Comput. Speech Lang..

[14] Navdeep Jaitly,et al. RNN Approaches to Text Normalization: A Challenge , 2016, ArXiv.

[15] Christopher D. Manning,et al. Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..