Finding relevant features for zero-resource query-by-example search on speech

Zero-resource query-by-example search on speech strategies have raised the interest of the research community, as they do not imply training (and therefore, large amounts of training data) or any knowledge about either the language to be processed or any others. These systems usually rely on Mel-frequency cepstral coefficients (MFCCs) for speech representation and dynamic time warping (DTW) or any of its variants for performing the search. Nevertheless, which features to use in this task is still an open research problem, and the use of large feature sets combined with feature selection approaches have not been addressed yet in the query-by-example search on speech scenario. In this paper, we present two methods to select the most relevant features among a large set of acoustic features, for the purpose of estimating the relevance of each feature using the costs of the best alignment path (obtained when performing DTW) and their neighbouring region. To prove the validity of these methods, experiments were carried out in four different search on speech scenarios that were used in international benchmarks, namely Albayzin 2014 search on speech evaluation, MediaEval spoken web search SWS 2013, and MediaEval query-by-example search on speech QUESST2014 and QUESST2015. Experimental results showed a dramatic improvement in the results when reducing the feature set using the proposed techniques, especially in the case of the relevance-based approaches. A comparison between the proposed methods and other representations such as MFCCs, phonetic posteriorgrams and dimensionality reduction based on principal component analysis, showed that the zero-resource approaches presented in this paper are promising, as they outperformed more extended approaches in all the experimental scenarios. The feature relevance estimation approaches, apart from improving search on speech results, also revealed features other than MFCCs that seemed to be a value-added in query-by-example tasks.

[1]  Timothy J. Hazen,et al.  Retrieval and browsing of spoken content , 2008, IEEE Signal Processing Magazine.

[2]  Björn W. Schuller,et al.  The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task , 2012 .

[3]  Kornel Laskowski,et al.  Emotion recognition in spontaneous speech using GMMs , 2006, INTERSPEECH.

[4]  Björn W. Schuller,et al.  AVEC 2013: the continuous audio/visual emotion and depression recognition challenge , 2013, AVEC@ACM Multimedia.

[5]  Sumit Basu A linked-HMM model for robust voicing and speech detection , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[6]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[7]  Lukás Burget,et al.  Calibration and fusion of query-by-example systems — But SWS 2013 , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Elmar Nöth,et al.  PROSODIC FEATURE EVALUATION: BRUTE FORCE OR WELL DESIGNED? , 1999 .

[9]  Xavier Anguera Telefonica Research System for the Spoken Web Search task at Mediaeval 2012 , 2012, ICASSP 2013.

[10]  Ramón Fernández Astudillo,et al.  The L2F Spoken Web Search system for Mediaeval 2012 , 2012, MediaEval.

[11]  Doroteo Torre Toledano,et al.  A quantitative study of disfluencies in formal, informal and media spontaneous speech in Spanish , 2012 .

[12]  Florian Metze,et al.  Query by Example Search on Speech at Mediaeval 2015 , 2014, MediaEval.

[13]  Claude Barras,et al.  Augmenting short-term cepstral features with long-term discriminative features for speaker verification of telephone data , 2013, INTERSPEECH.

[14]  Björn W. Schuller,et al.  The Voice of Leadership: Models and Performances of Automatic Analysis in Online Speeches , 2012, IEEE Transactions on Affective Computing.

[15]  Mireia Díez,et al.  High-performance Query-by-Example Spoken Term Detection on the SWS 2013 evaluation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Luis Javier Rodríguez-Fuentes,et al.  Feature Selection Based on Genetic Algorithms for Speaker Recognition , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[17]  Xavier Anguera Information retrieval-based dynamic time warping. , 2013, Interspeech 2013.

[18]  Luis Javier Rodríguez-Fuentes,et al.  On the Use of Lattices of Time-Synchronous Cross-Decoder Phone Co-Occurrences in a SVM-Phonotactic Language Recognition System , 2011, INTERSPEECH.

[19]  Jozef Vavrek,et al.  TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVM , 2012, MediaEval.

[20]  Carmen García-Mateo,et al.  Phonetic unit selection for cross-lingual query-by-example spoken term detection , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[21]  Florian Metze,et al.  The Spoken Web Search Task at MediaEval 2011 , 2012, ICASSP.

[22]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[23]  Xavier Binefa,et al.  Combining temporal and spectral information for Query-by-Example Spoken Term Detection , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[24]  Aren Jansen,et al.  The JHU-HLTCOE Spoken Web Search System for MediaEval 2012 , 2012, MediaEval.

[25]  Luis Javier Rodríguez-Fuentes,et al.  On the calibration and fusion of heterogeneous spoken term detection systems , 2013, INTERSPEECH.

[26]  Xavier Anguera Miró,et al.  Memory efficient subsequence DTW for Query-by-Example Spoken Term Detection , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[27]  Mireia Díez,et al.  GTTS Systems for the SWS Task at MediaEval 2013 , 2013, MediaEval.

[28]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[29]  Florian Metze,et al.  Query-by-example spoken term detection evaluation on low-resource languages , 2014, SLTU.

[30]  Aren Jansen,et al.  Rapid Evaluation of Speech Representations for Spoken Term Discovery , 2011, INTERSPEECH.

[31]  Kishore Prahallad,et al.  Query-by-Example Spoken Term Detection using Frequency Domain Linear Prediction and Non-Segmental Dynamic Time Warping , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  Emilio Sanchis Arnal,et al.  ELiRF at MediaEval 2015: Query by Example Search on Speech Task (QUESST) , 2014, MediaEval.

[33]  Florian Metze,et al.  The Spoken Web Search Task , 2012, MediaEval.

[34]  Lluís F. Hurtado,et al.  Query-by-Example Spoken Term Detection ALBAYZIN 2012 evaluation: overview, systems, results, and discussion , 2013, EURASIP J. Audio Speech Music. Process..