Statistical language modeling with prosodic boundaries and its use for continuous speech recognition

A new statistical language modeling was proposed where word n-gram was counted separately for the cases crossing and not crossing accent phrase boundaries. Since such counting requires a large speech corpus, which hardly can be prepared, part-of-speech (POS) n-gram was first counted for a small-sized speech corpus for the two cases instead, and then the result is applied to word n-gram counts of a large text corpus to divide them accordingly. Thus, the two types of word n-gram model can be obtained. Using ATR continuous speech corpus by two speakers, perplexity reduction from the baseline model to the proposed model was calculated for the word bi-gram. When accent phrase boundary information of the speech corpus was used, the reduction reached 11%, and when boundaries were extracted using our formerly developed method based on mora-F0 transition modeling, it still exceeded 8%. The reduction around 5% was still observed for sentences not included for the calculation of POS bi-gram and using boundaries automatically extracted from another speaker’s speech. The obtained bigram was applied to continuous speech recognition, resulted in a two-percentage improvement of word accuracy from when the baseline model was used.