BUT OpenSAT 2017 Speech Recognition System

The paper describes BUT Automatic Speech Recognition (ASR) systems for two domains in OpenSAT evaluations: Low Resourced Languages and Public Safety Communications. The first was challenging due to lack of training data, therefore multilingual approaches for BLSTM training were employed and recently published Residual Memory Networks requiring less training data were used. Combination of both approaches led to superior performance. The second domain was challenging due to recording in extreme conditions: specific channel, speaker under stress, high levels of noise. A data augmentation process was very important to get reasonably good performance.

[1]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[2]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[3]  Rolf Bardeli,et al.  TETRA channel simulation for automatic speech recognition , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[4]  Spyridon Matsoukas,et al.  Developing a Speech Activity Detection System for the DARPA RATS Program , 2012, INTERSPEECH.

[5]  Yu Zhang,et al.  Highway long short-term memory RNNS for distant speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Martin Karafiát,et al.  Adapting multilingual neural network hierarchy to a new language , 2014, SLTU.

[7]  Jan Cernocký,et al.  TRAP based features for LVCSR of meting data , 2004, INTERSPEECH.

[8]  Lukás Burget,et al.  2016 BUT Babel System: Multilingual BLSTM Acoustic Model with i-Vector Based Adaptation , 2017, INTERSPEECH.

[9]  Lukás Burget,et al.  Analysis of DNN approaches to speaker identification , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Lukás Burget,et al.  Analysis of Multilingual Blstm Acoustic Model on Low and High Resource Languages , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[12]  Mattias Heldner,et al.  The fundamental frequency variation spectrum , 2008 .

[13]  Steve Renals,et al.  Multilingual training of deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[15]  Richard M. Schwartz,et al.  Enhancing low resource keyword spotting with automatically retrieved web documents , 2015, INTERSPEECH.

[16]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[17]  Lukás Burget,et al.  Investigation into bottle-neck features for meeting speech recognition , 2009, INTERSPEECH.

[18]  Martin Karafiát,et al.  The language-independent bottleneck features , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[19]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[20]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Martin Karafiát,et al.  Convolutive Bottleneck Network features for LVCSR , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[22]  Lukas Burget,et al.  Residual memory networks: Feed-forward approach to learn long-term temporal dependencies , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Geoffrey Zweig,et al.  An introduction to computational networks and the computational network toolkit (invited talk) , 2014, INTERSPEECH.

[24]  Florian Metze,et al.  Speaker Adaptive Training of Deep Neural Network Acoustic Models Using I-Vectors , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.