论文信息 - Multitask Feature Learning for Low-Resource Query-by-Example Spoken Term Detection

Multitask Feature Learning for Low-Resource Query-by-Example Spoken Term Detection

We propose a novel technique that learns a low-dimensional feature representation from unlabeled data of a target language, and labeled data from a nontarget language. The technique is studied as a solution to query-by-example spoken term detection (QbE-STD) for a low-resource language. We extract low-dimensional features from a bottle-neck layer of a multitask deep neural network, which is jointly trained with speech data from the low-resource target language and resource-rich nontarget language. The proposed feature learning technique aims to extract acoustic features that offer phonetic discriminability. It explores a new way of leveraging cross-lingual speech data to overcome the resource limitation in the target language. We conduct QbE-STD experiments using the dynamic time warping distance of the multitask bottle-neck features between the query and the search database. The QbE-STD process does not rely on an automatic speech recognition pipeline of the target language. We validate the effectiveness of multitask feature learning through a series of comparative experiments.

[1] Rich Caruana,et al. Multitask Learning , 1997, Machine-mediated learning.

[2] Bin Ma,et al. Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3] Aren Jansen,et al. A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge , 2015, INTERSPEECH.

[4] Simon King,et al. Deep neural networks employing Multi-Task Learning and stacked bottleneck features for speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Aren Jansen,et al. Unsupervised neural network based feature extraction using weak top-down constraints , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Carla Teixeira Lopes,et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[7] Cheung-Chi Leung,et al. Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Lukás Burget,et al. BUT QUESST 2014 system description , 2014, MediaEval.

[9] Bin Ma,et al. An acoustic segment modeling approach to query-by-example spoken term detection , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Florian Metze,et al. Query by Example Search on Speech at Mediaeval 2015 , 2014, MediaEval.

[11] Lin-Shan Lee,et al. Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-to-Sequence Autoencoder , 2016, INTERSPEECH.

[12] Hynek Hermansky,et al. Evaluating speech features with the minimal-pair ABX task (II): resistance to noise , 2014, INTERSPEECH.

[13] Karen Livescu,et al. Query-by-Example Search with Discriminative Neural Acoustic Word Embeddings , 2017, INTERSPEECH.

[14] Yifan Gong,et al. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15] Lukás Burget,et al. An empirical evaluation of zero resource acoustic unit discovery , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Frédéric Bimbot,et al. Audio keyword extraction by unsupervised word discovery , 2009, INTERSPEECH.

[17] Bin Ma,et al. Toward High-Performance Language-Independent Query-by-Example Spoken Term Detection for MediaEval 2015: Post-Evaluation Analysis , 2016, INTERSPEECH.

[18] James R. Glass,et al. Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[19] Karel Veselý,et al. BUT2012 Approaches for Spoken Web Search - MediaEval 2012 , 2012, MediaEval.

[20] Chng Eng Siong,et al. The NNI Query-by-Example System for MediaEval 2015 , 2014, MediaEval.

[21] Florian Metze,et al. Spoken Web Search , 2011, MediaEval.

[22] Ji Wu,et al. Rapid adaptation for deep neural networks through multi-task learning , 2015, INTERSPEECH.

[23] John W. Fisher,et al. Supplemental Material for Parallel Sampling of DP Mixture Models using Sub-Clusters Splits , 2013 .

[24] Phil D. Green,et al. Multitask learning in connectionist robust ASR using recurrent neural networks , 2003, INTERSPEECH.

[25] Mireia Díez,et al. High-performance Query-by-Example Spoken Term Detection on the SWS 2013 evaluation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Joseph Picone,et al. A Doubly Hierarchical Dirichlet Process Hidden Markov Model with a Non-Ergodic Structure , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27] Hung-An Chang,et al. Resource configurable spoken query detection using Deep Boltzmann Machines , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28] Shai Ben-David,et al. Exploiting Task Relatedness for Mulitple Task Learning , 2003, COLT.

[29] Satoshi Nakamura,et al. Iterative training of a DPGMM-HMM acoustic unit recognizer in a zero resource scenario , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[30] James R. Glass,et al. Making Sense of Sound: Unsupervised Topic Segmentation over Acoustic Input , 2007, ACL.

[31] Aren Jansen,et al. Weak top-down constraints for unsupervised acoustic model training , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32] Aren Jansen,et al. Segmental acoustic indexing for zero resource keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33] Florian Metze,et al. The Spoken Web Search Task , 2012, MediaEval.

[34] Giorgio Metta,et al. An auto-encoder based approach to unsupervised learning of subword units , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35] John R. Hershey,et al. Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks , 2015, INTERSPEECH.

[36] James R. Glass,et al. Spoken Content Retrieval—Beyond Cascading Speech Recognition with Text Retrieval , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[37] Lukás Burget,et al. Copingwith channel mismatch in Query-by-Example - But QUESST 2014 , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38] Bin Ma,et al. Language independent query-by-example spoken term detection using N-best phone sequences and partial matching , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39] James R. Glass,et al. Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[40] Peter Bell,et al. Regularization of context-dependent deep neural networks with context-independent multi-task training , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41] Carl E. Rasmussen,et al. The Infinite Gaussian Mixture Model , 1999, NIPS.

[42] Ewan Dunbar,et al. A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling , 2015, INTERSPEECH.

[43] Aren Jansen,et al. NLP on Spoken Documents Without ASR , 2010, EMNLP.

[44] Bin Ma,et al. Acoustic TextTiling for story segmentation of spoken documents , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45] Simon King,et al. Unsupervised lexical clustering of speech segments using fixed-dimensional acoustic embeddings , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[46] Peter Bell,et al. Complementary tasks for context-dependent deep neural network acoustic models , 2015, INTERSPEECH.

[47] Jasha Droppo,et al. Multi-task learning in deep neural networks for improved phoneme recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[48] Joseph Picone,et al. A Nonparametric Bayesian Approach for Spoken Term Detection by Example Query , 2016, INTERSPEECH.

[49] James R. Glass,et al. A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.

[50] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[51] Bin Ma,et al. Unsupervised Bottleneck Features for Low-Resource Query-by-Example Spoken Term Detection , 2016, INTERSPEECH.

[52] Bin Ma,et al. Learning Neural Network Representations Using Cross-Lingual Bottleneck Features with Word-Pair Information , 2016, INTERSPEECH.

[53] Timothy J. Hazen,et al. Query-by-example spoken term detection using phonetic posteriorgram templates , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[54] Bin Ma,et al. A Vector Space Modeling Approach to Spoken Language Identification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[55] Jonathan G. Fiscus,et al. Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[56] Jonathan Baxter,et al. A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[57] Bin Ma,et al. Parallel inference of dirichlet process Gaussian mixture models for unsupervised acoustic modeling: a feasibility study , 2015, INTERSPEECH.

[58] Aren Jansen,et al. Evaluating speech features with the minimal-pair ABX task: analysis of the classical MFC/PLP pipeline , 2013, INTERSPEECH.

[59] Ryan P. Adams,et al. Composing graphical models with neural networks for structured representations and fast inference , 2016, NIPS.

[60] Tasha Nagamine,et al. On the Role of Nonlinear Transformations in Deep Neural Network Acoustic Models , 2016, INTERSPEECH.

[61] Satoshi Nakamura,et al. Supervised Learning of Acoustic Models in a Zero Resource Setting to Improve DPGMM Clustering , 2016, INTERSPEECH.

[62] Dong Wang,et al. Collaborative Joint Training With Multitask Recurrent Model for Speech and Speaker Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[63] Lukás Burget,et al. Variational Inference for Acoustic Unit Discovery , 2016, Workshop on Spoken Language Technologies for Under-resourced Languages.

[64] Igor Szöke,et al. BUT QUESST 2015 System Description , 2015, MediaEval.

[65] Lin-Shan Lee,et al. An iterative deep learning framework for unsupervised discovery of speech features and linguistic units with applications on spoken term detection , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[66] Kai Yu,et al. Multi-task learning for text-dependent speaker verification , 2015, INTERSPEECH.