Toward High-Performance Language-Independent Query-by-Example Spoken Term Detection for MediaEval 2015: Post-Evaluation Analysis

This paper documents the significant components of a state-ofthe-art language-independent query-by-example spoken term detection system designed for the Query by Example Search on Speech Task (QUESST) in MediaEval 2015. We developed exact and partial matching DTW systems, and WFST based symbolic search systems to handle different types of search queries. To handle the noisy and reverberant speech in the task, we trained tokenizers using data augmented with different noise and reverberation conditions. Our postevaluation analysis showed that the phone boundary label provided by the improved tokenizers brings more accurate speech activity detection in DTW systems. We argue that acoustic condition mismatch is possibly a more important factor than language mismatch for obtaining consistent gain from stacked bottleneck features. Our post-evaluation system, involving a smaller number of component systems, can outperform our submitted systems, which performed the best for the task.

[1]  Hermann Ney,et al.  Cross-lingual portability of Chinese and english neural network features for French and German LVCSR , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[2]  Martin Karafiát,et al.  Hierarchical neural net architectures for feature extraction in ASR , 2010, INTERSPEECH.

[3]  Mireia Díez,et al.  High-performance Query-by-Example Spoken Term Detection on the SWS 2013 evaluation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Bin Ma,et al.  Unsupervised Bottleneck Features for Low-Resource Query-by-Example Spoken Term Detection , 2016, INTERSPEECH.

[5]  Jan Cernocký,et al.  Comparison of methods for language-dependent and language-independent query-by-example spoken term detection , 2012, TOIS.

[6]  Murat Saraclar,et al.  Lattice Indexing for Spoken Term Detection , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Timothy J. Hazen,et al.  Query-by-example spoken term detection using phonetic posteriorgram templates , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[8]  Bin Ma,et al.  Acoustic Segment Modeling with Spectral Clustering Methods , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Monson H. Hayes,et al.  Statistical Digital Signal Processing and Modeling , 1996 .

[10]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[11]  A. Oppenheim,et al.  Signal reconstruction from phase or magnitude , 1980 .

[12]  Bin Ma,et al.  Intrinsic spectral analysis based on temporal context features for query-by-example spoken term detection , 2014, INTERSPEECH.

[13]  Bin Ma,et al.  Approximate search of audio queries by using DTW with phone time boundary and data augmentation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Bin Ma,et al.  Language independent query-by-example spoken term detection using N-best phone sequences and partial matching , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Chng Eng Siong,et al.  The NNI Query-by-Example System for MediaEval 2015 , 2014, MediaEval.

[16]  Jacob Benesty,et al.  New insights into the noise reduction Wiener filter , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Hamid Sheikhzadeh,et al.  ETSI AMR-2 VAD: evaluation and ultra low-resource implementation , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[18]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[19]  Frédéric Bimbot,et al.  Audio keyword extraction by unsupervised word discovery , 2009, INTERSPEECH.

[20]  Lukás Burget,et al.  Copingwith channel mismatch in Query-by-Example - But QUESST 2014 , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Franciska de Jong,et al.  Robust speech/non-speech classification in heterogeneous multimedia content , 2011, Speech Commun..

[22]  Bin Ma,et al.  Learning Neural Network Representations Using Cross-Lingual Bottleneck Features with Word-Pair Information , 2016, INTERSPEECH.

[23]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Jorge Proença,et al.  The SPL-IT Query by Example Search on Speech system for MediaEval 2014 , 2014, MediaEval.

[25]  Bin Ma,et al.  Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[27]  Florian Metze,et al.  Query by Example Search on Speech at Mediaeval 2015 , 2014, MediaEval.