Language model capitalization

In many speech recognition systems, capitalization is not an inherent component of the language model: training corpora are down cased, and counts are accumulated for sequences of lower-cased words. This level of modeling is sufficient for automating voice commands or otherwise enabling users to communicate with a machine, but when the recognized speech is intended to be read by a person, such as in email dictation or even some web search applications, the lack of capitalization of the user's input can add an extra cognitive load on the reader. For these cases, speech recognition systems often post-process the recognized text to restore capitalization. We propose folding capitalization directly in the recognition language model. Instead of post-processing, we take the approach that language should be represented in all its richness, with capitalization, diacritics, and other special symbols. With that perspective, we describe a strategy to handle poorly capitalized or uncapitalized training corpora for language modeling. The resulting recognition system retains the accuracy/latency/memory tradeoff of our uncapitalized production recognizer, while providing properly cased outputs.