The Multi-Domain International Search on Speech 2020 ALBAYZIN Evaluation: Overview, Systems, Results, Discussion and Post-Evaluation Analyses

The large amount of information stored in audio and video repositories makes search on speech (SoS) a challenging area that is continuously receiving much interest. Within SoS, spoken term detection (STD) aims to retrieve speech data given a text-based representation of a search query (which can include one or more words). On the other hand, query-by-example spoken term detection (QbE STD) aims to retrieve speech data given an acoustic representation of a search query. This is the first paper that presents an internationally open multi-domain evaluation for SoS in Spanish that includes both STD and QbE STD tasks. The evaluation was carefully designed so that several post-evaluation analyses of the main results could be carried out. The evaluation tasks aim to retrieve the speech files that contain the queries, providing their start and end times and a score that reflects how likely the detection within the given time intervals and speech file is. Three different speech databases in Spanish that comprise different domains were employed in the evaluation: the MAVIR database, which comprises a set of talks from workshops; the RTVE database, which includes broadcast news programs; and the SPARL20 database, which contains Spanish parliament sessions. We present the evaluation itself, the three databases, the evaluation metric, the systems submitted to the evaluation, the evaluation results and some detailed post-evaluation analyses based on specific query properties (in-vocabulary/out-of-vocabulary queries, single-word/multi-word queries and native/foreign queries). The most novel features of the submitted systems are a data augmentation technique for the STD task and an end-to-end system for the QbE STD task. The obtained results suggest that there is clearly room for improvement in the SoS task and that performance is highly sensitive to changes in the data domain.

[1]  Kai Feng,et al.  The subspace Gaussian mixture model - A structured model for speech recognition , 2011, Comput. Speech Lang..

[2]  Sridha Sridharan,et al.  Rapid Yet Accurate Speech Indexing Using Dynamic Match Lattice Spotting , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Lin-Shan Lee,et al.  Unsupervised Iterative Deep Learning of Speech Features and Acoustic Tokens with Applications to Spoken Term Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Kuan-Yu Chen,et al.  Spoken Document Retrieval With Unsupervised Query Modeling Techniques , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Bin Ma,et al.  Acoustic Segment Modeling with Spectral Clustering Methods , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Murat Saraclar,et al.  Low Resource Keyword Search With Synthesized Crosslingual Exemplars , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Brian Kingsbury,et al.  End-to-end ASR-free keyword search from speech , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Kishore Prahallad,et al.  Query-by-Example Spoken Term Detection using Frequency Domain Linear Prediction and Non-Segmental Dynamic Time Warping , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Murat Saraclar,et al.  Generative RNNs for OOV Keyword Search , 2019, IEEE Signal Processing Letters.

[10]  Haizhou Li,et al.  Re-ranking spoken term detection with acoustic exemplars of keywords , 2018, Speech Commun..

[11]  Dhananjay Ram,et al.  Neural Network Based End-to-End Query by Example Spoken Term Detection , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Bowen Zhou,et al.  Attentive Pooling Networks , 2016, ArXiv.

[13]  Murat Saraclar,et al.  Lattice Indexing for Spoken Term Detection , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Jianhua Tao,et al.  Language-Adversarial Transfer Learning for Low-Resource Speech Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Afsaneh Asaei,et al.  Sparse Subspace Modeling for Query by Example Spoken Term Detection , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Lukás Burget,et al.  Spoken Term Detection System Based on Combination of LVCSR and Phonetic Search , 2007, MLMI.

[17]  Lin-Shan Lee,et al.  Interactive Spoken Document Retrieval With Suggested Key Terms Ranked by a Markov Decision Process , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Junqing Yu,et al.  Advanced Convolutional Neural Network-Based Hybrid Acoustic Models for Low-Resource Speech Recognition , 2020, Comput..

[19]  Kenney Ng,et al.  Subword-based approaches for spoken document retrieval , 2000, Speech Commun..

[20]  Lin-Shan Lee,et al.  Unsupervised Discovery of Structured Acoustic Tokens With Applications to Spoken Term Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Dong Wang,et al.  Feature analysis for discriminative confidence estimation in spoken term detection , 2014, Comput. Speech Lang..

[22]  Björn W. Schuller,et al.  Keyword spotting exploiting Long Short-Term Memory , 2013, Speech Commun..