Multi-Staged Cross-Lingual Acoustic Model Adaption for Robust Speech Recognition in Real-World Applications - A Case Study on German Oral History Interviews

While recent automatic speech recognition systems achieve remarkable performance when large amounts of adequate, high quality annotated speech data is used for training, the same systems often only achieve an unsatisfactory result for tasks in domains that greatly deviate from the conditions represented by the training data. For many real-world applications, there is a lack of sufficient data that can be directly used for training robust speech recognition systems. To address this issue, we propose and investigate an approach that performs a robust acoustic model adaption to a target domain in a cross-lingual, multi-staged manner. Our approach enables the exploitation of large-scale training data from other domains in both the same and other languages. We evaluate our approach using the challenging task of German oral history interviews, where we achieve a relative reduction of the word error rate by more than 30% compared to a model trained from scratch only on the target domain, and 6-7% relative compared to a model trained robustly on 1000 hours of same-language out-of-domain training data.

[1]  Michael Gref,et al.  Improving Robust Speech Recognition for German Oral History Interviews Using Multi-Condition Training , 2018, ITG Symposium on Speech Communication.

[2]  Colleen Richey,et al.  Voices Obscured in Complex Environmental Settings (VOICES) corpus , 2018, INTERSPEECH.

[3]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[4]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Jürgen Schmidhuber,et al.  Recurrent nets that time and count , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[6]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[7]  Joachim Köhler,et al.  Exploiting the large-scale German Broadcast Corpus to boost the Fraunhofer IAIS Speech Recognition System , 2014, LREC.

[8]  Thomas Fang Zheng,et al.  Transfer learning for speech and language processing , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[9]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Michael Gref,et al.  Improved Transcription and Indexing of Oral History Interviews for Digital Humanities Research , 2018, LREC.

[11]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[12]  Joachim Köhler,et al.  DiSCo - A German Evaluation Corpus for Challenging Problems in the Broadcast Domain , 2010, LREC.

[13]  Sven Behnke,et al.  Two-Staged Acoustic Modeling Adaption for Robust Speech Recognition by the Example of German Oral History Interviews , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[14]  Sanjeev Khudanpur,et al.  JHU Kaldi system for Arabic MGB-3 ASR challenge using diarization, audio-transcript alignment and transfer learning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[15]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[16]  Jeff Z. Ma,et al.  Improving Deliverable Speech-to-Text Systems with Multilingual Knowledge Transfer , 2017, INTERSPEECH.

[17]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[18]  Tan Lee,et al.  Improving Cross-Lingual Knowledge Transferability Using Multilingual TDNN-BLSTM with Language-Dependent Pre-Final Layer , 2018, INTERSPEECH.

[19]  Sanjeev Khudanpur,et al.  Investigation of transfer learning for ASR using LF-MMI trained neural networks , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[20]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[21]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Haizhou Li,et al.  Semi-Supervised and Cross-Lingual Knowledge Transfer Learnings for DNN Hybrid Acoustic Models Under Low-Resource Conditions , 2016, INTERSPEECH.

[23]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[24]  Bhuvana Ramabhadran,et al.  Automatic Transcription of Czech Language Oral History in the MALACH Project: Resources and Initial Experiments , 2002, TSD.

[25]  Yonghong Yan,et al.  An Exploration of Dropout with LSTMs , 2017, INTERSPEECH.

[26]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  Brian Kingsbury,et al.  Challenging the Boundaries of Speech Recognition: The MALACH Corpus , 2019, INTERSPEECH.

[28]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[29]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.