Recognition of Latin American Spanish Using Multi-Task Learning

In the broadcast news domain, national wide newscasters typically interact with communities with a diverse set of accents. One of the challenges in speech recognition is the performance degradation in the presence of these diverse conditions. Performance further aggravates when the accents are from other countries that share the same language. Extensive work has been conducted in this topic for languages such as English and Mandarin. Recently, TDNN based multi-task learning has received some attention in this area, with interesting results, typically using models trained with a variety of different accented corpora from a particular language. In this work, we look at the case of LATAM (Latin American) Spanish for its unique and distinctive accent variations. Because LATAM Spanish has historically been influenced by non-Spanish European migrations, we anticipated that LATAM based speech recognition performance can be further improved by including these influential languages, during a TDNN based multi-task training. Experiments show that including such languages in the training setup outperforms the single task acoustic model baseline. We also propose an automatic per-language weight selection strategy to regularize each language contribution during multi-task training.

[1]  Sanjeev Khudanpur,et al.  Parallel training of DNNs with Natural Gradient and Parameter Averaging , 2014 .

[2]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Isabel Trancoso,et al.  The L2F Broadcast News Speech Recognition System , 2010 .

[4]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[5]  Bhuvana Ramabhadran,et al.  Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Tien Ping Tan,et al.  Acoustic Model Interpolation for Non-Native Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  Jasha Droppo,et al.  Multi-task learning in deep neural networks for improved phoneme recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[10]  Pascale Fung,et al.  Multi-accent Chinese speech recognition , 2006, INTERSPEECH.

[11]  Jean Paul Haton,et al.  Multilingual non-native speech recognition using phonetic confusion-based acoustic model modification and graphemic constraints , 2006, INTERSPEECH.

[12]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[13]  Yanpeng Li,et al.  Improving deep neural networks based multi-accent Mandarin speech recognition using i-vectors and accent-specific top layer , 2015, INTERSPEECH.

[14]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[15]  Isabel Trancoso,et al.  A specialized on-the-fly algorithm for lexicon and language model composition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Peter Bell,et al.  Multitask Learning of Context-Dependent Targets in Deep Neural Network Acoustic Models , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Sébastien Marcel,et al.  Bob: a free signal processing and machine learning toolbox for researchers , 2012, ACM Multimedia.

[18]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[19]  Hasim Sak,et al.  Multi-accent speech recognition with hierarchical grapheme based models , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Dimitra Vergyri,et al.  Automatic speech recognition of multiple accented English data , 2010, INTERSPEECH.

[21]  Preethi Jyothi,et al.  Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning , 2018, INTERSPEECH.

[22]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[23]  Atsunori Ogawa,et al.  Non-native English speech recognition using bilingual English lexicon and acoustic models , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[24]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[25]  Ramón Fernández Astudillo,et al.  Exploiting Phone Log-Likelihood Ratio Features for the Detection of the Native Language of Non-Native English Speakers , 2016, INTERSPEECH.

[26]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[27]  Nuno Souto,et al.  Speech recognition of broadcast news for the European Portuguese language , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[28]  Carlo Aliprandi,et al.  SAVAS: Collecting, Annotating and Sharing Audiovisual Language Resources for Automatic Subtitling , 2014, LREC.