In-Document Adaptation for a Human Guided Automatic Transcription Service

In this work, the task is to assist human transcribers to produce, for example, interview or parliament speech transcriptions. The system will perform in-document adaptation based on a small amount of manually corrected automatic speech recognition results. The corrected segments of the spoken document are used to adapt the speech recognizer’s acoustic and language model. The updated models are used in second-pass recognition to produce a more accurate automatic transcription for the remaining uncorrected parts of the spoken document. In this work we evaluate two common adaptation methods for speech data in settings that represent typical transcription tasks. For adapting the acoustic model we use the Maximum A Posteriori adaptation method. For adapting the language model we use linear interpolation. We compare results of supervised adaptation to unsupervised adaptation, and evaluate the total benefit of using human corrected segments for in-document adaptation for typical transcription tasks.

[1]  Mikko Kurimo,et al.  Importance of High-Order N-Gram Models in Morph-Based Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Mei-Yuh Hwang,et al.  Unsupervised learning from users' error correction in speech dictation , 2004, INTERSPEECH.

[3]  Teemu Hirsimäki,et al.  On Growing and Pruning Kneser–Ney Smoothed $ N$-Gram Models , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Mikko Kurimo,et al.  Unsupervised and User Feedback Based Lexicon Adaptation for Foreign Names and Acronyms , 2015, SLSP.

[5]  Yashesh Gaur,et al.  The effects of automatic speech recognition quality on human transcription latency , 2016, W4A.

[6]  Masataka Goto,et al.  PodCastle: Collaborative Training of Language Models on the Basis of Wisdom of Crowds , 2012, INTERSPEECH.

[7]  Katri Leino Maximum A Posteriori for Acoustic Model Adaptation in Automatic Speech Recognition , 2015 .

[8]  Masataka Goto,et al.  PodCastle: a spoken document retrieval system for podcasts and its performance improvement by anonymous user contributions , 2009, SSCS '09.

[9]  Gökhan Tür,et al.  Exploiting user feedback for language model adaptation in meeting recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Masataka Goto,et al.  Podcastle: collaborative training of acoustic models on the basis of wisdom of crowds for podcast transcription , 2009, INTERSPEECH.

[11]  Krzysztof Marasek,et al.  SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation , 2002, LREC.