ALBAYZIN 2018 spoken term detection evaluation: a multi-domain international evaluation in Spanish

Search on speech (SoS) is a challenging area due to the huge amount of information stored in audio and video repositories. Spoken term detection (STD) is an SoS-related task aiming to retrieve data from a speech repository given a textual representation of a search term (which can include one or more words). This paper presents a multi-domain internationally open evaluation for STD in Spanish. The evaluation has been designed carefully so that several analyses of the main results can be carried out. The evaluation task aims at retrieving the speech files that contain the terms, providing their start and end times, and a score that reflects the confidence given to the detection. Three different Spanish speech databases that encompass different domains have been employed in the evaluation: the MAVIR database, which comprises a set of talks from workshops; the RTVE database, which includes broadcast news programs; and the COREMAH database, which contains 2-people spontaneous speech conversations about different topics. We present the evaluation itself, the three databases, the evaluation metric, the systems submitted to the evaluation, the results, and detailed post-evaluation analyses based on some term properties (within-vocabulary/out-of-vocabulary terms, single-word/multi-word terms, and native/foreign terms). Fusion results of the primary systems submitted to the evaluation are also presented. Three different research groups took part in the evaluation, and 11 different systems were submitted. The obtained results suggest that the STD task is still in progress and performance is highly sensitive to changes in the data domain.

[1]  Alexander I. Rudnicky,et al.  Using conversational word bursts in spoken term detection , 2013, INTERSPEECH.

[2]  Lin-Shan Lee,et al.  Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping , 2010, INTERSPEECH.

[3]  Bin Ma,et al.  Exemplar-inspired strategies for low-resource spoken keyword search in Swahili , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Biing-Hwang Juang,et al.  Non-Uniform MCE Training of Deep Long Short-Term Memory Recurrent Neural Networks for Keyword Spotting , 2017, INTERSPEECH.

[5]  Siddika Parlak,et al.  Spoken term detection for Turkish Broadcast News , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Hsiao-Wuen Hon,et al.  An overview of the SPHINX speech recognition system , 1990, IEEE Trans. Acoust. Speech Signal Process..

[7]  W. Bruce Croft,et al.  A Language Modeling Approach to Information Retrieval , 1998, SIGIR Forum.

[8]  Lin-Shan Lee,et al.  Interactive Spoken Document Retrieval With Suggested Key Terms Ranked by a Markov Decision Process , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Kuan-Yu Chen,et al.  Exploring the Use of Significant Words Language Modeling for Spoken Document Retrieval , 2017, INTERSPEECH.

[10]  Björn W. Schuller,et al.  Keyword spotting exploiting Long Short-Term Memory , 2013, Speech Commun..

[11]  Bhuvana Ramabhadran,et al.  Balancing false alarms and hits in Spoken Term Detection , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  David Yarowsky,et al.  Quantifying the value of pronunciation lexicons for keyword search in lowresource languages , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[14]  Brian Kingsbury,et al.  End-to-end ASR-free keyword search from speech , 2017, ICASSP.

[15]  Inma Hernáez,et al.  ALBAYZIN 2016 spoken term detection evaluation: an international open competitive evaluation in Spanish , 2017, EURASIP J. Audio Speech Music. Process..

[16]  Xavier Anguera Miró,et al.  Memory efficient subsequence DTW for Query-by-Example Spoken Term Detection , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[17]  Shi-wook Lee,et al.  Constructing Acoustic Distances Between Subwords and States Obtained from a Deep Neural Network for Spoken Term Detection , 2017, INTERSPEECH.

[18]  Horia Cucu,et al.  SpeeD @ MediaEval 2014: Spoken Term Detection with Robust Multilingual Phone Recognition , 2014, MediaEval.

[19]  Peng Gao,et al.  A Novel Phone-State Matrix Based Vocabulary-Indenendent Keyword Spotting Method for Spontaneous Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[20]  Arindam Mandal,et al.  Multi-Task Learning and Weighted Cross-Entropy for DNN-Based Keyword Spotting , 2016, INTERSPEECH.

[21]  Sanjeev Khudanpur,et al.  Using proxies for OOV keywords in the keyword search task , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[22]  Kenney Ng,et al.  Subword-based approaches for spoken document retrieval , 2000, Speech Commun..

[23]  Martha Larson,et al.  Contextual verification for open vocabulary spoken term detection , 2010, INTERSPEECH.

[24]  Shi-wook Lee,et al.  Combination of diverse subword units in spoken term detection , 2015, INTERSPEECH.

[25]  Naga Venkata Sudhakar Kolluru Sudhakar Enterprise governance model for hybrid cloud: IT Professional Conference @ National Institute of Standards and Technology (NIST), Gaithersburg, MD, USA , 2014 .

[26]  Haizhou Li,et al.  Re-ranking spoken term detection with acoustic exemplars of keywords , 2018, Speech Commun..

[27]  Andreas Stolcke,et al.  Open-vocabulary spoken term detection using graphone-based hybrid recognition systems , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[29]  Yoshiaki Itoh,et al.  An STD System Using Multiple STD Results and Multiple Rescoring Method for NTCIR-12 SpokenQuery&Doc Task , 2016, NTCIR.

[30]  Bhuvana Ramabhadran,et al.  Effect of pronounciations on OOV queries in spoken term detection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Bhuvana Ramabhadran,et al.  End-to-end speech recognition and keyword search on low-resource languages , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[33]  Ngoc Thang Vu,et al.  Generating exact lattices in the WFST framework , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Yun Lei,et al.  Feature fusion for high-accuracy keyword spotting , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Fabio Valente,et al.  English spoken term detection in multilingual recordings , 2010, INTERSPEECH.

[36]  Kai Yu,et al.  Unrestricted Vocabulary Keyword Spotting Using LSTM-CTC , 2016, INTERSPEECH.

[37]  Lin-Shan Lee,et al.  Improved spoken term detection by feature space pseudo-relevance feedback , 2010, INTERSPEECH.

[38]  Franciska de Jong,et al.  Evaluation of Spoken Document Retrieval for Historic Speech Collections , 2008, LREC.

[39]  Florian Metze,et al.  Word-based probabilistic phonetic retrieval for low-resource spoken term detection , 2014, INTERSPEECH.

[40]  Taras Butko,et al.  Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion , 2011, EURASIP J. Audio Speech Music. Process..

[41]  Richard M. Schwartz,et al.  White Listing and Score Normalization for Keyword Spotting of Noisy Speech , 2012, INTERSPEECH.

[42]  Andreas Stolcke,et al.  The SRI/OGI 2006 spoken term detection system , 2007, INTERSPEECH.

[43]  Bin Ma,et al.  Cross-lingual deep neural network based submodular unbiased data selection for low-resource keyword search , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Lukás Burget,et al.  Spoken Term Detection System Based on Combination of LVCSR and Phonetic Search , 2007, MLMI.

[45]  Mark J. F. Gales,et al.  Recurrent neural network language models for keyword search , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[47]  Michal Kuba,et al.  UNIZA System for the Spoken Web Search Task at MediaEval2013 , 2013, MediaEval.

[48]  Gareth J. F. Jones,et al.  Overview of the NTCIR-12 SpokenQuery&Doc-2 Task , 2016, NTCIR.

[49]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[50]  Haiyang Li,et al.  A Novel Confidence Measure Based on Context Consistency for Spoken Term Detection , 2012, INTERSPEECH.

[51]  Bhuvana Ramabhadran,et al.  Vocabulary independent spoken term detection , 2007, SIGIR.

[52]  Lukás Burget,et al.  BUT/Phonexia Bottleneck Feature Extractor , 2018, Odyssey.

[53]  Martha Larson,et al.  Spoken Content Retrieval: A Survey of Techniques and Technologies , 2012, Found. Trends Inf. Retr..

[54]  Jan Svec,et al.  A Relevance Score Estimation for Spoken Term Detection Based on RNN-Generated Pronunciation Embeddings , 2017, INTERSPEECH.

[55]  Kai Feng,et al.  Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[56]  Dong Wang,et al.  Feature analysis for discriminative confidence estimation in spoken term detection , 2014, Comput. Speech Lang..

[57]  Chng Eng Siong,et al.  Pruning Strategies for Partial Search in Spoken Term Detection , 2017, SoICT.

[58]  Alvaro Barreiro,et al.  Revisiting N-Gram Based Models for Retrieval in Degraded Large Collections , 2009, ECIR.

[59]  Tatsuya Kawahara,et al.  Overview of the IR for Spoken Documents Task in NTCIR-9 Workshop , 2011, NTCIR.

[60]  Heriberto Cuayáhuitl,et al.  Out-of-Vocabulary Word Modeling and Rejection for Spanish Keyword Spotting Systems , 2002, MICAI.

[61]  Richard M. Schwartz,et al.  The 2016 BBN Georgian telephone speech keyword spotting system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[62]  Paula Lopez-Otero,et al.  ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation , 2018, EURASIP J. Audio Speech Music. Process..

[63]  Paula Lopez-Otero,et al.  Efficient query-by-example spoken document retrieval combining phone multigram representation and dynamic time warping , 2019, Inf. Process. Manag..

[64]  Xiaohui Zhang,et al.  The Kaldi OpenKWS System: Improving Low Resource Keyword Search , 2017, INTERSPEECH.

[65]  Lluís F. Hurtado,et al.  Query-by-Example Spoken Term Detection ALBAYZIN 2012 evaluation: overview, systems, results, and discussion , 2013, EURASIP J. Audio Speech Music. Process..

[66]  David A. van Leeuwen,et al.  On calibration of language recognition scores , 2006, Odyssey.

[67]  Carmen García-Mateo,et al.  Comparison of ALBAYZIN query-by-example spoken term detection 2012 and 2014 evaluations , 2016, EURASIP Journal on Audio, Speech, and Music Processing.

[68]  Karan Nathwani,et al.  LSTM Based Attentive Fusion of Spectral and Prosodic Information for Keyword Spotting in Hindi Language , 2018, INTERSPEECH.

[69]  Sridha Sridharan,et al.  Optimising Figure of Merit for phonetic spoken term detection , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[70]  Mark J. F. Gales,et al.  Morph-to-word transduction for accurate and efficient automatic speech recognition and keyword search , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[71]  Eva Navas,et al.  Albayzín-2014 evaluation: audio segmentation and classification in broadcast news domains , 2015, EURASIP J. Audio Speech Music. Process..

[72]  Kai Feng,et al.  The subspace Gaussian mixture model - A structured model for speech recognition , 2011, Comput. Speech Lang..

[73]  Brian Kingsbury,et al.  Order-free spoken term detection , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[74]  Mark Dredze,et al.  A spoken term detection framework for recovering out-of-vocabulary words using the web , 2010, INTERSPEECH.

[75]  Richard M. Schwartz,et al.  Progress in the BBN keyword search system for the DARPA RATS program , 2014, INTERSPEECH.

[76]  Carmen García-Mateo,et al.  Spoken term detection ALBAYZIN 2014 evaluation: overview, systems, results, and discussion , 2015, EURASIP J. Audio Speech Music. Process..

[77]  Sridha Sridharan,et al.  Rapid Yet Accurate Speech Indexing Using Dynamic Match Lattice Spotting , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[78]  Richard M. Schwartz,et al.  Analysis of keyword spotting performance across IARPA babel languages , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[79]  Kuan-Yu Chen,et al.  Neural relevance-aware query modeling for spoken document retrieval , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[80]  Henrik Schulz,et al.  Speaker diarization of broadcast news in Albayzin 2010 evaluation campaign , 2012, EURASIP J. Audio Speech Music. Process..

[81]  Aleksei Romanenko,et al.  The STC Keyword Search System for OpenKWS 2016 Evaluation , 2017, INTERSPEECH.

[82]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[83]  Mireia Díez,et al.  The Albayzin 2010 Language Recognition Evaluation , 2011, INTERSPEECH.

[84]  Carmen García-Mateo,et al.  Transcrigal: A Bilingual System for Automatic Indexing of Broadcast News , 2004, LREC.

[85]  Yun Lei,et al.  Strategies for high accuracy keyword detection in noisy channels , 2013, INTERSPEECH.

[86]  Tanja Schultz,et al.  Grapheme based speech recognition , 2003, INTERSPEECH.

[87]  Meng Cai,et al.  High-performance Swahili keyword search with very limited language pack: The THUEE system for the OpenKWS15 evaluation , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[88]  Haizhou Li,et al.  System and keyword dependent fusion for spoken term detection , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[89]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[90]  Mireia Díez,et al.  High-performance Query-by-Example Spoken Term Detection on the SWS 2013 evaluation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[91]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[92]  Murat Saraclar,et al.  Lattice Indexing for Spoken Term Detection , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[93]  Kenneth Ward Church,et al.  Towards spoken term discovery at scale with zero resources , 2010, INTERSPEECH.

[94]  Richard M. Schwartz,et al.  Constructing sub-word units for spoken term detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[95]  Bin Ma,et al.  Efficient methods to train multilingual bottleneck feature extractors for low resource keyword search , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[96]  Biing-Hwang Juang,et al.  Non-Uniform Boosted MCE Training of Deep Neural Networks for Keyword Spotting , 2016, INTERSPEECH.

[97]  Javier Tejedor Noguerales Contributions to keyword spotting and spoken term: detection for information retrieval in audio minig , 2009 .

[98]  Amro El-Jaroudi,et al.  Multilingual speech recognition: the 1996 byblos callhome system , 1997, EUROSPEECH.

[99]  Florian Metze,et al.  An in-depth comparison of keyword specific thresholding and sum-to-one score normalization , 2014, INTERSPEECH.

[100]  Lori Lamel,et al.  Effective keyword search for low-resourced conversational speech , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[101]  Lukás Burget,et al.  Sub-word modeling of out of vocabulary words in spoken term detection , 2008, 2008 IEEE Spoken Language Technology Workshop.

[102]  Jia Liu,et al.  Fusing multiple systems into a compact lattice index for chinese spoken term detection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[103]  Meng Cai,et al.  A Novel Discriminative Score Calibration Method for Keyword Search , 2016, INTERSPEECH.

[104]  Gareth J. F. Jones,et al.  Overview of the NTCIR-11 SpokenQuery&Doc Task , 2014, NTCIR.

[105]  Jie Li,et al.  An empirical study of multilingual and low-resource spoken term detection using deep neural networks , 2014, INTERSPEECH.

[106]  Luis Javier Rodríguez-Fuentes,et al.  On the calibration and fusion of heterogeneous spoken term detection systems , 2013, INTERSPEECH.

[107]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[108]  Bin Ma,et al.  Low-resource keyword search strategies for tamil , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[109]  Tatsuya Kawahara,et al.  Overview of the NTCIR-10 SpokenDoc-2 Task , 2013, NTCIR.

[110]  Bhuvana Ramabhadran,et al.  Phonetic query expansion for spoken document retrieval , 2008, INTERSPEECH.

[111]  Hui Lin,et al.  Spoken keyword spotting via multi-lattice alignment , 2008, INTERSPEECH.

[112]  Richard M. Schwartz,et al.  Comparison of Multiple System Combination Techniques for Keyword Spotting , 2016, INTERSPEECH.

[113]  Kuan-Yu Chen,et al.  Spoken Document Retrieval With Unsupervised Query Modeling Techniques , 2012, IEEE Transactions on Audio, Speech, and Language Processing.