Weighted fast sequential DTW for multilingual audio Query-by-Example retrieval

This paper examines multilingual audio Query-by-Example (QbE) retrieval, utilizing the posteriorgram-based Phonetic Unit Modelling (PUM) approach and the Weighted Fast Sequential Dynamic Time Warping (WFSDTW) algorithm. The PUM approach employs phone recognizers trained on language-specific external resources in a supervised way. Thus, the information about the phonetic distribution is embedded in the process of acoustic modelling. The resulting acoustic models were also used for language-independent QbE retrieval. The improved WFSDTW algorithm was implemented in order to perform retrievals for each query (keyword) within the particular utterance file. The major interest is placed on a retrieval performance measurement of the proposed WFSDTW solution employing posteriorgram-based keyword matching with Gaussian mixture modelling (GMM). Score normalization and fusion of four different language-dependent sub-systems was carried out using a simple max-score merging strategy. The results show a certain predominance of the proposed WFSDTW solution among two other evaluated techniques, namely basic DTW and segmental DTW algorithms. Also, the combination of multiple PUM techniques together with the WFSDTW has been proved as an effective solution for the QbE task.

[1]  Jan Cernocký,et al.  Speechdat-e: five eastern european speech databases for voice-operated teleservices completed , 2001, INTERSPEECH.

[2]  Florian Metze,et al.  Query by Example Search on Speech at Mediaeval 2015 , 2014, MediaEval.

[3]  Bin Ma,et al.  Unsupervised mining of acoustic subword units with segment-level Gaussian posteriorgrams , 2013, INTERSPEECH.

[4]  Florian Metze,et al.  Extracting deep bottleneck features using stacked auto-encoders , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  James R. Glass,et al.  A NOVEL DTW-BASED DISTANCE MEASURE FOR SPEAKER SEGMENTATION , 2006, 2006 IEEE Spoken Language Technology Workshop.

[6]  Patrick Gros,et al.  Variability modelling for audio events detection in movies , 2014, Multimedia Tools and Applications.

[7]  G. W. Hughes,et al.  Minimum Prediction Residual Principle Applied to Speech Recognition , 1975 .

[8]  Lin-Shan Lee,et al.  Integrating frame-based and segment-based dynamic time warping for unsupervised spoken term detection with spoken queries , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Kishore Prahallad,et al.  Query-by-Example Spoken Term Detection using Frequency Domain Linear Prediction and Non-Segmental Dynamic Time Warping , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[11]  Martin Lojka,et al.  Query-by-example retrieval via fast sequential dynamic time warping algorithm , 2015, 2015 38th International Conference on Telecommunications and Signal Processing (TSP).

[12]  Andrzej Czyzewski,et al.  Detection and localization of selected acoustic events in acoustic field for smart surveillance applications , 2012, Multimedia Tools and Applications.

[13]  Emilio Sanchis Arnal,et al.  ELiRF at MediaEval 2015: Query by Example Search on Speech Task (QUESST) , 2014, MediaEval.

[14]  Kishore Prahallad,et al.  IIIT-H System for MediaEval 2014 QUESST , 2014, MediaEval.

[15]  Andrzej Czyzewski,et al.  An audio-visual corpus for multimodal automatic speech recognition , 2017, Journal of Intelligent Information Systems.

[16]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[17]  Lei Xie,et al.  Investigating neural network based query-by-example keyword spotting approach for personalized wake-up word detection in Mandarin Chinese , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[18]  Jithendra Vepa,et al.  Using posterior-based features in template matching for speech recognition , 2006, INTERSPEECH.

[19]  David A. van Leeuwen,et al.  Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Bin Ma,et al.  Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Frédéric Bimbot,et al.  Zero-Resource Audio-Only Spoken Term Detection Based on a Combination of Template Matching Techniques , 2011, INTERSPEECH.

[22]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[23]  Mikel Penagarikano MediaEval 2013 Spoken Web Search Task: System Performance Measures , 2013 .

[24]  Jan Cernocký,et al.  Comparison of methods for language-dependent and language-independent query-by-example spoken term detection , 2012, TOIS.

[25]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Jordi Luque,et al.  The Telefonica Research Spoken Web Search System for MediaEval 2013 , 2013, MediaEval.

[27]  Julie Carson-Berndsen,et al.  Framework for cross-language automatic phonetic segmentation , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Timothy J. Hazen,et al.  Query-by-example spoken term detection using phonetic posteriorgram templates , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[29]  Lukás Burget,et al.  Spoken Term Detection System Based on Combination of LVCSR and Phonetic Search , 2007, MLMI.

[30]  Horia Cucu,et al.  SpeeD @ MediaEval 2014: Spoken Term Detection with Robust Multilingual Phone Recognition , 2014, MediaEval.

[31]  Chng Eng Siong,et al.  The NNI Query-by-Example System for MediaEval 2015 , 2014, MediaEval.

[32]  Bin Ma,et al.  An acoustic segment modeling approach to query-by-example spoken term detection , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Richard Sproat,et al.  Lattice-Based Search for Spoken Utterance Retrieval , 2004, NAACL.

[34]  Delphine Charlet,et al.  Using textual information from LVCSR transcripts for phonetic-based spoken term detection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.