GTM-UVigo systems for Albayzin 2014 Search on Speech Evaluation

This paper describes the systems developed by the GTMUVigo team for the Albayzin 2014 Search on Speech evaluation. The primary system for the spoken term detection task consisted on the fusion of two different large vocabulary continuous speech recognition systems that differed in almost all their components: front-end, acoustic modelling, decoder and keyword search approach. An isolate word recognition system was fused with the two aforementioned speech recognition systems for the keyword spotting task. For the query by example spoken term detection task, a fusion of three systems was presented: one of them followed one of the aforementioned continuous speech recognition approaches, with the difference that in this case it was necessary to obtain a transcription of the queries; the other two systems performed a dynamic time warping search, being the use of fingerprints as feature vectors the main novelty of the presented approach.

[1]  Ngoc Thang Vu,et al.  Generating exact lattices in the WFST framework , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  N. Brummer,et al.  On calibration of language recognition scores , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[3]  Timothy J. Hazen,et al.  Query-by-example spoken term detection using phonetic posteriorgram templates , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[4]  Ramón Fernández Astudillo,et al.  The L2F Spoken Web Search system for Mediaeval 2012 , 2012, MediaEval.

[5]  Murat Saraclar,et al.  Lattice Indexing for Spoken Term Detection , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Carmen García-Mateo,et al.  Introducing a Framework for the Evaluation of Music Detection Tools , 2014, LREC.

[7]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[8]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[9]  Luis Javier Rodríguez-Fuentes,et al.  On the calibration and fusion of heterogeneous spoken term detection systems , 2013, INTERSPEECH.

[10]  Carmen García-Mateo,et al.  TC-STAR 2006 Automatic Speech Recognition Evaluation: The UVIGO System , 2006 .

[11]  F. Perdigão,et al.  Audio Fingerprinting System for Broadcast Streams , 2009 .