Multitask Feature Learning for Low-Resource Query-by-Example Spoken Term Detection

We propose a novel technique that learns a low-dimensional feature representation from unlabeled data of a target language, and labeled data from a nontarget language. The technique is studied as a solution to query-by-example spoken term detection (QbE-STD) for a low-resource language. We extract low-dimensional features from a bottle-neck layer of a multitask deep neural network, which is jointly trained with speech data from the low-resource target language and resource-rich nontarget language. The proposed feature learning technique aims to extract acoustic features that offer phonetic discriminability. It explores a new way of leveraging cross-lingual speech data to overcome the resource limitation in the target language. We conduct QbE-STD experiments using the dynamic time warping distance of the multitask bottle-neck features between the query and the search database. The QbE-STD process does not rely on an automatic speech recognition pipeline of the target language. We validate the effectiveness of multitask feature learning through a series of comparative experiments.

[1]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[2]  Bin Ma,et al.  Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Aren Jansen,et al.  A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge , 2015, INTERSPEECH.

[4]  Simon King,et al.  Deep neural networks employing Multi-Task Learning and stacked bottleneck features for speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Aren Jansen,et al.  Unsupervised neural network based feature extraction using weak top-down constraints , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[7]  Cheung-Chi Leung,et al.  Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Lukás Burget,et al.  BUT QUESST 2014 system description , 2014, MediaEval.

[9]  Bin Ma,et al.  An acoustic segment modeling approach to query-by-example spoken term detection , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Florian Metze,et al.  Query by Example Search on Speech at Mediaeval 2015 , 2014, MediaEval.

[11]  Lin-Shan Lee,et al.  Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-to-Sequence Autoencoder , 2016, INTERSPEECH.

[12]  Hynek Hermansky,et al.  Evaluating speech features with the minimal-pair ABX task (II): resistance to noise , 2014, INTERSPEECH.

[13]  Karen Livescu,et al.  Query-by-Example Search with Discriminative Neural Acoustic Word Embeddings , 2017, INTERSPEECH.

[14]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Lukás Burget,et al.  An empirical evaluation of zero resource acoustic unit discovery , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Frédéric Bimbot,et al.  Audio keyword extraction by unsupervised word discovery , 2009, INTERSPEECH.

[17]  Bin Ma,et al.  Toward High-Performance Language-Independent Query-by-Example Spoken Term Detection for MediaEval 2015: Post-Evaluation Analysis , 2016, INTERSPEECH.

[18]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[19]  Karel Veselý,et al.  BUT2012 Approaches for Spoken Web Search - MediaEval 2012 , 2012, MediaEval.

[20]  Chng Eng Siong,et al.  The NNI Query-by-Example System for MediaEval 2015 , 2014, MediaEval.

[21]  Florian Metze,et al.  Spoken Web Search , 2011, MediaEval.

[22]  Ji Wu,et al.  Rapid adaptation for deep neural networks through multi-task learning , 2015, INTERSPEECH.

[23]  John W. Fisher,et al.  Supplemental Material for Parallel Sampling of DP Mixture Models using Sub-Clusters Splits , 2013 .

[24]  Phil D. Green,et al.  Multitask learning in connectionist robust ASR using recurrent neural networks , 2003, INTERSPEECH.

[25]  Mireia Díez,et al.  High-performance Query-by-Example Spoken Term Detection on the SWS 2013 evaluation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Joseph Picone,et al.  A Doubly Hierarchical Dirichlet Process Hidden Markov Model with a Non-Ergodic Structure , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Hung-An Chang,et al.  Resource configurable spoken query detection using Deep Boltzmann Machines , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Shai Ben-David,et al.  Exploiting Task Relatedness for Mulitple Task Learning , 2003, COLT.

[29]  Satoshi Nakamura,et al.  Iterative training of a DPGMM-HMM acoustic unit recognizer in a zero resource scenario , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[30]  James R. Glass,et al.  Making Sense of Sound: Unsupervised Topic Segmentation over Acoustic Input , 2007, ACL.

[31]  Aren Jansen,et al.  Weak top-down constraints for unsupervised acoustic model training , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Aren Jansen,et al.  Segmental acoustic indexing for zero resource keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Florian Metze,et al.  The Spoken Web Search Task , 2012, MediaEval.

[34]  Giorgio Metta,et al.  An auto-encoder based approach to unsupervised learning of subword units , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  John R. Hershey,et al.  Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks , 2015, INTERSPEECH.

[36]  James R. Glass,et al.  Spoken Content Retrieval—Beyond Cascading Speech Recognition with Text Retrieval , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[37]  Lukás Burget,et al.  Copingwith channel mismatch in Query-by-Example - But QUESST 2014 , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Bin Ma,et al.  Language independent query-by-example spoken term detection using N-best phone sequences and partial matching , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Peter Bell,et al.  Regularization of context-dependent deep neural networks with context-independent multi-task training , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[42]  Ewan Dunbar,et al.  A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling , 2015, INTERSPEECH.

[43]  Aren Jansen,et al.  NLP on Spoken Documents Without ASR , 2010, EMNLP.

[44]  Bin Ma,et al.  Acoustic TextTiling for story segmentation of spoken documents , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Simon King,et al.  Unsupervised lexical clustering of speech segments using fixed-dimensional acoustic embeddings , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[46]  Peter Bell,et al.  Complementary tasks for context-dependent deep neural network acoustic models , 2015, INTERSPEECH.

[47]  Jasha Droppo,et al.  Multi-task learning in deep neural networks for improved phoneme recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[48]  Joseph Picone,et al.  A Nonparametric Bayesian Approach for Spoken Term Detection by Example Query , 2016, INTERSPEECH.

[49]  James R. Glass,et al.  A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.

[50]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[51]  Bin Ma,et al.  Unsupervised Bottleneck Features for Low-Resource Query-by-Example Spoken Term Detection , 2016, INTERSPEECH.

[52]  Bin Ma,et al.  Learning Neural Network Representations Using Cross-Lingual Bottleneck Features with Word-Pair Information , 2016, INTERSPEECH.

[53]  Timothy J. Hazen,et al.  Query-by-example spoken term detection using phonetic posteriorgram templates , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[54]  Bin Ma,et al.  A Vector Space Modeling Approach to Spoken Language Identification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[55]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[56]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[57]  Bin Ma,et al.  Parallel inference of dirichlet process Gaussian mixture models for unsupervised acoustic modeling: a feasibility study , 2015, INTERSPEECH.

[58]  Aren Jansen,et al.  Evaluating speech features with the minimal-pair ABX task: analysis of the classical MFC/PLP pipeline , 2013, INTERSPEECH.

[59]  Ryan P. Adams,et al.  Composing graphical models with neural networks for structured representations and fast inference , 2016, NIPS.

[60]  Tasha Nagamine,et al.  On the Role of Nonlinear Transformations in Deep Neural Network Acoustic Models , 2016, INTERSPEECH.

[61]  Satoshi Nakamura,et al.  Supervised Learning of Acoustic Models in a Zero Resource Setting to Improve DPGMM Clustering , 2016, INTERSPEECH.

[62]  Dong Wang,et al.  Collaborative Joint Training With Multitask Recurrent Model for Speech and Speaker Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[63]  Lukás Burget,et al.  Variational Inference for Acoustic Unit Discovery , 2016, Workshop on Spoken Language Technologies for Under-resourced Languages.

[64]  Igor Szöke,et al.  BUT QUESST 2015 System Description , 2015, MediaEval.

[65]  Lin-Shan Lee,et al.  An iterative deep learning framework for unsupervised discovery of speech features and linguistic units with applications on spoken term detection , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[66]  Kai Yu,et al.  Multi-task learning for text-dependent speaker verification , 2015, INTERSPEECH.