Out-of-vocabulary rate reduction through dispersion-based lexicon acquisition

In this paper, we address the issue of the effective reduction of out-of-vocabulary (OOV) words for automatic speech recognition (ASR) systems. We first of all evaluate the OOV rates produced by different vocabulary sets selected from a corpus of British English according to the raw frequency of occurrence. We demonstrate that OOV rates in realistic input from unlimited domains are much higher than has been reported in the literature for ASR systems that typically deal with only a subset of the English language. To reduce OOV rates, we then propose that the textual dispersion of word types is a more effective selection criterion for the acquisition of lexicons than the conventional method of lexical selection according to raw frequencies of occurrence. We evaluate the performance of the adjusted frequency according to the index of dispersion, and the dispersion of word types among component text categories of the training corpus. With an 80,000-word vocabulary, the estimated frequency per million words adjusted according the index of dispersion achieves and improvement of 7.3 per cent over the frequency-based approach for a large set of testing material from a variety of sources. Vocabulary sets selected according to textual dispersion alone achieve a slightly better overall OOV reduction rate of 7.5 per cent.