Analysis of keyword spotting performance across IARPA babel languages

With the completion of the IARPA Babel program, it is possible to systematically analyze the performance of speech recognition systems across a wide variety of languages. We select 16 languages from the dataset and compare performance using a deep neural network-based acoustic model. The focus is on keyword spotting using the actual term-weighted value (ATWV) metric. We demonstrate that ATWV is keyword dependent, and that this must be accounted for in any cross-language analysis. Further, we show that while performance across languages does not track with any particular feature of the language, it is correlated with inter-annotator agreement.

[1]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Richard M. Schwartz,et al.  Combination of search techniques for improved spotting of OOV keywords , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Richard M. Schwartz,et al.  Enhancing low resource keyword spotting with automatically retrieved web documents , 2015, INTERSPEECH.

[4]  Tanja Schultz,et al.  Globalphone: a multilingual speech and text database developed at karlsruhe university , 2002, INTERSPEECH.

[5]  Etienne Barnard,et al.  Exploring minimal pronunciation modeling for low resource languages , 2015, INTERSPEECH.

[6]  Yu Zhang,et al.  Multilingual data selection for training stacked bottleneck features , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Richard M. Schwartz,et al.  Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[8]  Tanja Schultz,et al.  Grapheme based speech recognition , 2003, INTERSPEECH.

[9]  Richard M. Schwartz,et al.  Comparison of Multiple System Combination Techniques for Keyword Spotting , 2016, INTERSPEECH.

[10]  Richard M. Schwartz,et al.  Normalizationofphonetic keyword search scores , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Nelson Morgan,et al.  The TAO of ATWV: Probing the mysteries of keyword search performance , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[12]  Richard M. Schwartz,et al.  Two-Stage Data Augmentation for Low-Resourced Speech Recognition , 2016, INTERSPEECH.

[13]  Geoffrey Zweig,et al.  An introduction to computational networks and the computational network toolkit (invited talk) , 2014, INTERSPEECH.

[14]  Richard M. Schwartz,et al.  Improved Multilingual Training of Stacked Neural Network Acoustic Models for Low Resource Languages , 2016, INTERSPEECH.

[15]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[16]  Hermann Ney,et al.  Multilingual features based keyword search for very low-resource languages , 2015, INTERSPEECH.

[17]  Richard M. Schwartz,et al.  The 2013 BBN Vietnamese telephone speech keyword spotting system , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Mark J. F. Gales,et al.  Language independent and unsupervised acoustic models for speech recognition and keyword spotting , 2014, INTERSPEECH.

[19]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[20]  Jan Silovský,et al.  Sage: The New BBN Speech Processing Platform , 2016, INTERSPEECH.

[21]  Mark J. F. Gales,et al.  Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED , 2014, SLTU.