Spoken term detection ALBAYZIN 2014 evaluation: overview, systems, results, and discussion

Spoken term detection (STD) aims at retrieving data from a speech repository given a textual representation of the search term. Nowadays, it is receiving much interest due to the large volume of multimedia information. STD differs from automatic speech recognition (ASR) in that ASR is interested in all the terms/words that appear in the speech data, whereas STD focuses on a selected list of search terms that must be detected within the speech data. This paper presents the systems submitted to the STD ALBAYZIN 2014 evaluation, held as a part of the ALBAYZIN 2014 evaluation campaign within the context of the IberSPEECH 2014 conference. This is the first STD evaluation that deals with Spanish language. The evaluation consists of retrieving the speech files that contain the search terms, indicating their start and end times within the appropriate speech file, along with a score value that reflects the confidence given to the detection of the search term. The evaluation is conducted on a Spanish spontaneous speech database, which comprises a set of talks from workshops and amounts to about 7 h of speech. We present the database, the evaluation metrics, the systems submitted to the evaluation, the results, and a detailed discussion. Four different research groups took part in the evaluation. Evaluation results show reasonable performance for moderate out-of-vocabulary term rate. This paper compares the systems submitted to the evaluation and makes a deep analysis based on some search term properties (term length, in-vocabulary/out-of-vocabulary terms, single-word/multi-word terms, and in-language/foreign terms).

[1]  Björn W. Schuller,et al.  Keyword spotting exploiting Long Short-Term Memory , 2013, Speech Commun..

[2]  Bhuvana Ramabhadran,et al.  Balancing false alarms and hits in Spoken Term Detection , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  David Yarowsky,et al.  Quantifying the value of pronunciation lexicons for keyword search in lowresource languages , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Andreas Stolcke,et al.  The SRI/OGI 2006 spoken term detection system , 2007, INTERSPEECH.

[5]  Murat Saraclar,et al.  Hybrid language models for out of vocabulary word detection in large vocabulary conversational speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[7]  Florian Metze,et al.  Spoken Web Search , 2011, MediaEval.

[8]  Ji Wu,et al.  Subword scheme for keyword search , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[9]  Murat Saraclar,et al.  Lattice Indexing for Spoken Term Detection , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Haiyang Li,et al.  A Novel Confidence Measure Based on Context Consistency for Spoken Term Detection , 2012, INTERSPEECH.

[11]  Beth Logan,et al.  An experimental study of an audio indexing system for the web , 2000, INTERSPEECH.

[12]  Richard M. Schwartz,et al.  Normalizationofphonetic keyword search scores , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Kenneth Ward Church,et al.  Towards spoken term discovery at scale with zero resources , 2010, INTERSPEECH.

[14]  Brian Kingsbury,et al.  Exploiting diversity for spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Meng Cai,et al.  The THUEE system for the openKWS14 keyword search evaluation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Julia Hirschberg,et al.  Strategies for rescoring keyword search results using word-burst and acoustic features , 2014, INTERSPEECH.

[17]  Bhuvana Ramabhadran,et al.  Vocabulary independent spoken term detection , 2007, SIGIR.

[18]  Florian Metze,et al.  Word-based probabilistic phonetic retrieval for low-resource spoken term detection , 2014, INTERSPEECH.

[19]  Taras Butko,et al.  Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion , 2011, EURASIP J. Audio Speech Music. Process..

[20]  Bhuvana Ramabhadran,et al.  Phonetic query expansion for spoken document retrieval , 2008, INTERSPEECH.

[21]  Karen Spärck Jones,et al.  Retrieving spoken documents by combining multiple index sources , 1996, SIGIR '96.

[22]  Haiyang Li,et al.  Confidence Measure Based on Context Consistency Using Word Occurrence Probability and Topic Adaptation for Spoken Term Detection , 2014, IEICE Trans. Inf. Syst..

[23]  Lukás Burget,et al.  Spoken Term Detection System Based on Combination of LVCSR and Phonetic Search , 2007, MLMI.

[24]  Jean-Luc Gauvain,et al.  Developing STT and KWS systems using limited language resources , 2014, INTERSPEECH.

[25]  Lin-Shan Lee,et al.  Enhanced Spoken Term Detection Using Support Vector Machines and Weighted Pseudo Examples , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Richard M. Schwartz,et al.  White Listing and Score Normalization for Keyword Spotting of Noisy Speech , 2012, INTERSPEECH.

[27]  Mark J. F. Gales,et al.  A confidence-based approach for improving keyword hypothesis scores , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Sridha Sridharan,et al.  Optimising Figure of Merit for phonetic spoken term detection , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[30]  Bhuvana Ramabhadran,et al.  Effect of pronounciations on OOV queries in spoken term detection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Mark J. F. Gales,et al.  Investigation of multilingual deep neural networks for spoken term detection , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[32]  Carmen García-Mateo,et al.  Fast LM look-ahead for large vocabulary continuous speech recognition using perfect hashing , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[34]  Martha Larson,et al.  Contextual verification for open vocabulary spoken term detection , 2010, INTERSPEECH.

[35]  Naga Venkata Sudhakar Kolluru Sudhakar Enterprise governance model for hybrid cloud: IT Professional Conference @ National Institute of Standards and Technology (NIST), Gaithersburg, MD, USA , 2014 .

[36]  Andreas Stolcke,et al.  Open-vocabulary spoken term detection using graphone-based hybrid recognition systems , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[38]  Yu Zhang,et al.  Graph-based re-ranking using acoustic feature similarity between search results for spoken term detection on low-resource languages , 2014, INTERSPEECH.

[39]  Sanjeev Khudanpur,et al.  Using proxies for OOV keywords in the keyword search task , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[40]  Dong Wang,et al.  Term-Dependent Confidence Normalisation for Out-of-Vocabulary Spoken Term Detection , 2012, Journal of Computer Science and Technology.

[41]  Florian Metze,et al.  An in-depth comparison of keyword specific thresholding and sum-to-one score normalization , 2014, INTERSPEECH.

[42]  Hideo Joho,et al.  Overview of NTCIR-11 , 2014, NTCIR.

[43]  M. Inés Torres,et al.  Improving dialogue systems in a home automation environment , 2008, Ambi-Sys '08.

[44]  Mireia Díez,et al.  The Albayzin 2010 Language Recognition Evaluation , 2011, INTERSPEECH.

[45]  Mark J. F. Gales,et al.  Language independent and unsupervised acoustic models for speech recognition and keyword spotting , 2014, INTERSPEECH.

[46]  Jonathan G. Fiscus,et al.  Overview of the NIST Open Keyword Search 2013 Evaluation Worksho | NIST , 2013 .

[47]  Florian Metze,et al.  Query by Example Search on Speech at Mediaeval 2015 , 2014, MediaEval.

[48]  Beth Logan,et al.  Word and sub-word indexing approaches for reducing the effects of OOV queries on spoken audio , 2002 .

[49]  Yun Lei,et al.  Feature fusion for high-accuracy keyword spotting , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Fabio Valente,et al.  English spoken term detection in multilingual recordings , 2010, INTERSPEECH.

[51]  Lukás Burget,et al.  Sub-word modeling of out of vocabulary words in spoken term detection , 2008, 2008 IEEE Spoken Language Technology Workshop.

[52]  Khalid Choukri,et al.  TC-STAR: New language resources for ASR and SLT purposes , 2006, LREC.

[53]  Mark J. F. Gales,et al.  Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED , 2014, SLTU.

[54]  Bin Ma,et al.  Unsupervised data selection and word-morph mixed language model for tamil low-resource keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Peng Yu,et al.  A hybrid word / phoneme-based approach for improved vocabulary-independent search in spontaneous speech , 2004, INTERSPEECH.

[56]  I-Fan Chen,et al.  A keyword-boosted sMBR criterion to enhance keyword search performance in deep neural network based acoustic modeling , 2014, INTERSPEECH.

[57]  Jia Liu,et al.  Fusing multiple systems into a compact lattice index for chinese spoken term detection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[58]  Richard M. Schwartz,et al.  Subword and phonetic search for detecting out-of-vocabulary keywords , 2014, INTERSPEECH.

[59]  Richard M. Schwartz,et al.  Progress in the BBN keyword search system for the DARPA RATS program , 2014, INTERSPEECH.

[60]  Sridha Sridharan,et al.  Rapid Yet Accurate Speech Indexing Using Dynamic Match Lattice Spotting , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[61]  George Saon,et al.  The IBM keyword search system for the DARPA RATS program , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[62]  Gareth J. F. Jones,et al.  Overview of the NTCIR-11 SpokenQuery&Doc Task , 2014, NTCIR.

[63]  Jie Li,et al.  An empirical study of multilingual and low-resource spoken term detection using deep neural networks , 2014, INTERSPEECH.

[64]  Luis Javier Rodríguez-Fuentes,et al.  On the calibration and fusion of heterogeneous spoken term detection systems , 2013, INTERSPEECH.

[65]  Haizhou Li,et al.  System and keyword dependent fusion for spoken term detection , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[66]  Seiichi Nakagawa,et al.  A robust/fast spoken term detection method based on a syllable n-gram index with a distance metric , 2013, Speech Commun..

[67]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[68]  Hang Su,et al.  Syllable based keyword search: Transducing syllable lattices to word lattices , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[69]  Tomoyosi Akiba,et al.  DTW-distance-ordered spoken term detection , 2013, INTERSPEECH.

[70]  Tetsuya Sakai,et al.  Overview of NTCIR-9 , 2011, NTCIR.

[71]  Henrik Schulz,et al.  Speaker diarization of broadcast news in Albayzin 2010 evaluation campaign , 2012, EURASIP J. Audio Speech Music. Process..

[72]  Alexander H. Waibel,et al.  A Neural Network Keyword Search System for Telephone Speech , 2014, SPECOM.

[73]  Antonio Moreno-Sandoval,et al.  Developing a Phonemic and Syllabic Frequency Inventory for Spontaneous Spoken Castilian Spanish and their Comparison to Text-Based Inventories , 2008, LREC.

[74]  Yun Lei,et al.  Strategies for high accuracy keyword detection in noisy channels , 2013, INTERSPEECH.

[75]  Richard M. Schwartz,et al.  Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[76]  Haizhou Li,et al.  Discriminative score normalization for keyword search decision , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[77]  José B. Mariño,et al.  Albayzin speech database: design of the phonetic corpus , 1993, EUROSPEECH.

[78]  Yun Lei,et al.  Recent improvements in SRI's keyword detection system for noisy audio , 2014, INTERSPEECH.

[79]  S. R. Mahadeva Prasanna,et al.  Fast Approximate Spoken Term Detection from Sequence of Phonemes , 2008, SIGIR 2008.

[80]  Alexander I. Rudnicky,et al.  Combination of FST and CN search in spoken term detection , 2014, INTERSPEECH.

[81]  David A. James,et al.  A system for unrestricted topic retrieval from radio news broadcasts , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[82]  Alexander I. Rudnicky,et al.  Using conversational word bursts in spoken term detection , 2013, INTERSPEECH.

[83]  Lin-Shan Lee,et al.  Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping , 2010, INTERSPEECH.

[84]  Siddika Parlak,et al.  Spoken term detection for Turkish Broadcast News , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[85]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[86]  Kenji Iwata,et al.  Robust spoken term detection using combination of phone-based and word-based recognition , 2008, INTERSPEECH.

[87]  Sridha Sridharan,et al.  A phonetic search approach to the 2006 NIST spoken term detection evaluation , 2007, INTERSPEECH.

[88]  Mark J. F. Gales,et al.  Combining tandem and hybrid systems for improved speech recognition and keyword spotting on low resource languages , 2014, INTERSPEECH.

[89]  Aren Jansen,et al.  Featherweight phonetic keyword search for conversational speech , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[90]  Carmen García-Mateo,et al.  TC-STAR 2006 Automatic Speech Recognition Evaluation: The UVIGO System , 2006 .

[91]  Hang Su,et al.  Improvements on transducing syllable lattice to word lattice for keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[92]  Naoyuki Kanda,et al.  Using rhythmic features for Japanese spoken term detection , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[93]  Brian Kingsbury,et al.  Order-free spoken term detection , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[94]  Michael Picheny,et al.  New methods in continuous Mandarin speech recognition , 1997, EUROSPEECH.

[95]  Dong Wang,et al.  Feature analysis for discriminative confidence estimation in spoken term detection , 2014, Comput. Speech Lang..

[96]  Peng Yu,et al.  Vocabulary-independent search in spontaneous speech , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[97]  Lluís F. Hurtado,et al.  Query-by-Example Spoken Term Detection ALBAYZIN 2012 evaluation: overview, systems, results, and discussion , 2013, EURASIP J. Audio Speech Music. Process..

[98]  I-Fan Chen,et al.  A keyword-aware grammar framework for LVCSR-based spoken keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[99]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[100]  Florian Metze,et al.  The Spoken Web Search Task , 2012, MediaEval.

[101]  Khalid Choukri,et al.  SPEECHDAT-CAR. A Large Speech Database for Automotive Environments , 2000, LREC.

[102]  Mark Dredze,et al.  A spoken term detection framework for recovering out-of-vocabulary words using the web , 2010, INTERSPEECH.

[103]  Bin Ma,et al.  A phonotactic-semantic paradigm for automatic spoken document classification , 2005, SIGIR '05.

[104]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[105]  Wonkyum Lee,et al.  Semi-supervised training in low-resource ASR and KWS , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[106]  Bin Ma,et al.  Low-resource keyword search strategies for tamil , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[107]  Beth Logan,et al.  Approaches to reduce the effects of OOV queries on indexed spoken audio , 2005, IEEE Transactions on Multimedia.

[108]  Aqil M. Azmi,et al.  A survey of automatic Arabic diacritization techniques , 2013, Natural Language Engineering.

[109]  Aren Jansen,et al.  Point Process Models for Spotting Keywords in Continuous Speech , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[110]  Dong Wang,et al.  A comparison of phone and grapheme-based spoken term detection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[111]  Richard Sproat,et al.  Lattice-Based Search for Spoken Utterance Retrieval , 2004, NAACL.

[112]  Steve Young,et al.  The HTK book , 1995 .

[113]  David Yarowsky,et al.  A keyword search system using open source software , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[114]  Lin-Shan Lee,et al.  Improved spoken term detection by feature space pseudo-relevance feedback , 2010, INTERSPEECH.

[115]  Benjamin Barras,et al.  SoX : Sound eXchange , 2012 .

[116]  W. Press,et al.  Numerical Recipes: The Art of Scientific Computing , 1987 .

[117]  Nelson Morgan,et al.  The TAO of ATWV: Probing the mysteries of keyword search performance , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.