Generating search query in unsupervised language model adaptaion using WWW

To improve the accuracy of an LVCSR (large vocabulary continuous speech recognition) system, it is effective to gather text data related to the topic of the input speech and adapt the language model using the text data. Several systems have been developed that gather text data from World Wide Web using keywords specified by a user. Those systems require the user to be involved in the transcription process. However, it is desirable to automate the entire process. To automate the text collection, we propose a method to create an adapted language model by collecting topic‐related text from World Wide Web. The proposed method composes the search query from the first recognition result, and it gathers text data from the WWW and adapts the language model. Then the input speech is decoded again using the adapted language model. As the first recognition result contains recognition errors, we developed a method to exclude the misrecognized words using word‐based confidence score and similarities between keywords. ...