Lightly supervised learning of text normalization: Russian number names

Most areas of natural language processing today make heavy use of automatic inference from large corpora. One exception is text-normalization for such applications as text-to-speech synthesis, where it is still the norm to build grammars by hand for such tasks as handling abbreviations or the expansion of digit sequences into number names. One reason for this, apart from the general lack of interest in text normalization, has been the lack of annotated data. For many languages, however, there is abundant unannotated data that can be brought to bear on these problems. This paper reports on the inference of number-name expansion in Russian, a particularly difficult language due to its complex inflectional system. A database of several million spelled-out number names was collected from the web and mapped to digit strings using an overgenerating number-name grammar. The same overgenerating number-name grammar can be used to produce candidate expansions into number names, which are then scored using a language model trained on the web data. Our results suggest that it is possible to infer expansion modules for very complex number name systems, from unannotated data, and using a minimum of hand-compiled seed data.