Unsupervised acoustic model training

This paper describes some recent experiments using unsupervised techniques for acoustic model training in order to reduce the system development cost. The approach uses a speech recognizer to transcribe unannotated raw broadcast news data. The hypothesized transcription is used to create labels for the training data. Experiments providing supervision only via the language model training materials show that including texts which are contemporaneous with the audio data is not crucial for success of the approach, and that the acoustic models can be initialized with as little as 10 minutes of manually annotated data. These experiments demonstrate that unsupervised training is a viable training scheme and can dramatically reduce the cost of building acoustic models.