论文信息 - Lightly supervised learning of text normalization: Russian number names

Lightly supervised learning of text normalization: Russian number names

Most areas of natural language processing today make heavy use of automatic inference from large corpora. One exception is text-normalization for such applications as text-to-speech synthesis, where it is still the norm to build grammars by hand for such tasks as handling abbreviations or the expansion of digit sequences into number names. One reason for this, apart from the general lack of interest in text normalization, has been the lack of annotated data. For many languages, however, there is abundant unannotated data that can be brought to bear on these problems. This paper reports on the inference of number-name expansion in Russian, a particularly difficult language due to its complex inflectional system. A database of several million spelled-out number names was collected from the web and mapped to digit strings using an overgenerating number-name grammar. The same overgenerating number-name grammar can be used to produce candidate expansions into number names, which are then scored using a language model trained on the web data. Our results suggest that it is possible to infer expansion modules for very complex number name systems, from unannotated data, and using a minimum of hand-compiled seed data.

Richard Sproat

[1] Michael Collins,et al. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[2] Dan Roth,et al. SNoW User Guide , 1999 .

[3] Richard Sproat. Multilingual Text-to-Speech Synthesis , 1997 .

[4] Kevin Knight,et al. Generation that Exploits Corpus-Based Statistical Knowledge , 1998, ACL.

[5] James R. Hurford,et al. The linguistic theory of numerals , 1975 .

[6] Alan W. Black,et al. Non-standard word and homograph resolution for asian language text analysis , 2000, INTERSPEECH.

[7] Shankar Kumar,et al. Normalization of non-standard words , 2001, Comput. Speech Lang..

[8] Mitchell P. Marcus,et al. Three machine learning algorithms for lexical ambiguity resolution , 1996 .

[9] Eric R. Ziegel,et al. The Elements of Statistical Learning , 2003, Technometrics.

[10] H. Brandt Corstius. Grammars for Number Names , 2011 .

[11] Terence Wade,et al. A Comprehensive Russian Grammar , 1992 .

[12] Richard Sproat,et al. Multilingual Text-to-Speech Synthesis: The Bell Labs Approach , 1998, CL.