Unsupervised Discovery of Structured Acoustic Tokens With Applications to Spoken Term Detection

In this paper, we compare two paradigms for unsupervised discovery of structured acoustic tokens directly from speech corpora without any human annotation. The multigranular paradigm seeks to capture all available information in the corpora with multiple sets of tokens for different model granularities. The hierarchical paradigm attempts to jointly learn several levels of signal representations in a hierarchical structure. The two paradigms are unified within a theoretical framework in this paper. Query-by-example spoken term detection (QbE-STD) experiments on the query by example search on speech task dataset of MediaEval 2015 verifies the competitiveness of the acoustic tokens. The enhanced relevance score proposed in this work improves both paradigms for the task of QbE-STD. We also list results on the ABX evaluation task of the Zero Resource Challenge 2015 for comparison of the paradigms.

[1]  Aren Jansen,et al.  A segmental framework for fully-unsupervised large-vocabulary speech recognition , 2016, Comput. Speech Lang..

[2]  B.-H. Juang,et al.  On the hidden Markov model and dynamic time warping for speech recognition — A unified view , 1984, AT&T Bell Laboratories Technical Journal.

[3]  Aren Jansen,et al.  Indexing Raw Acoustic Features for Scalable Zero Resource Search , 2012, INTERSPEECH.

[4]  Aren Jansen,et al.  Evaluating speech features with the minimal-pair ABX task: analysis of the classical MFC/PLP pipeline , 2013, INTERSPEECH.

[5]  Lin-Shan Lee,et al.  Unsupervised discovery of linguistic structure including two-level acoustic patterns using three cascaded stages of iterative optimization , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Lukás Burget,et al.  Copingwith channel mismatch in Query-by-Example - But QUESST 2014 , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Sridha Sridharan,et al.  A phonetic search approach to the 2006 NIST spoken term detection evaluation , 2007, INTERSPEECH.

[8]  Aren Jansen,et al.  A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge , 2015, INTERSPEECH.

[9]  Lin-Shan Lee,et al.  Personalized acoustic modeling by weakly supervised multi-task deep learning using acoustic tokens discovered from unlabeled data , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Finnian Kelly,et al.  A comparison of auditory features for robust speech recognition , 2010, 2010 18th European Signal Processing Conference.

[11]  Ewan Dunbar,et al.  A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling , 2015, INTERSPEECH.

[12]  Ji Wu,et al.  Rapid adaptation for deep neural networks through multi-task learning , 2015, INTERSPEECH.

[13]  Igor Szöke,et al.  BUT QUESST 2015 System Description , 2015, MediaEval.

[14]  Bhuvana Ramabhadran,et al.  Vocabulary independent spoken term detection , 2007, SIGIR.

[15]  Aren Jansen,et al.  Towards Unsupervised Training of Speaker Independent Acoustic Models , 2011, INTERSPEECH.

[16]  Dipanjan Chakraborty,et al.  WWTW: the world wide telecom web , 2007, NSDR '07.

[17]  Lin-Shan Lee,et al.  An iterative deep learning framework for unsupervised discovery of speech features and linguistic units with applications on spoken term detection , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[18]  Bin Ma,et al.  Parallel inference of dirichlet process Gaussian mixture models for unsupervised acoustic modeling: a feasibility study , 2015, INTERSPEECH.

[19]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Lin-Shan Lee,et al.  Unsupervised spoken term detection with spoken queries by multi-level acoustic patterns with varying model granularity , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Lin-Shan Lee,et al.  Unsupervised Hidden Markov Modeling of Spoken Queries for Spoken Term Detection without Speech Recognition , 2011, INTERSPEECH.

[22]  Lorenzo Rosasco,et al.  Discovering discrete subword units with binarized autoencoders and hidden-Markov-model encoders , 2015, INTERSPEECH.

[23]  James R. Glass,et al.  A Piecewise Aggregate Approximation Lower-Bound Estimate for Posteriorgram-Based Dynamic Time Warping , 2011, INTERSPEECH.

[24]  Florian Metze,et al.  QUESST2014: Evaluating Query-by-Example Speech Search in a zero-resource setting with real-life queries , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Alan W. Black,et al.  Using articulatory features and inferred phonological segments in zero resource speech processing , 2015, INTERSPEECH.

[26]  Richard M. Schwartz,et al.  Unsupervised acoustic and language model training with small amounts of labelled data , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Florian Metze,et al.  Query by Example Search on Speech at Mediaeval 2015 , 2014, MediaEval.

[28]  Chng Eng Siong,et al.  The NNI Query-by-Example System for MediaEval 2015 , 2014, MediaEval.

[29]  Aren Jansen,et al.  Unsupervised neural network based feature extraction using weak top-down constraints , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Martti Vainio,et al.  Resources for speech research: present and future infrastructure needs , 2009, INTERSPEECH.

[31]  Bin Ma,et al.  An acoustic segment modeling approach to query-by-example spoken term detection , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  David A. van Leeuwen,et al.  Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Horia Cucu,et al.  SpeeD @ MediaEval 2015: Multilingual Phone Recognition Approach to Query by Example STD , 2015, MediaEval.

[34]  Elizabeth Zeitoun,et al.  The Formosan Language Archive: Linguistic Analysis and Language Processing , 2005, ROCLING/IJCLCLP.

[35]  Aren Jansen,et al.  Rapid Evaluation of Speech Representations for Spoken Term Discovery , 2011, INTERSPEECH.

[36]  Lin-Shan Lee,et al.  Enhancing automatically discovered multi-level acoustic patterns considering context consistency with applications in spoken term detection , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[38]  Yaodong Zhang,et al.  Unsupervised speech processing with applications to query-by-example spoken term detection , 2013 .

[39]  Richard Sproat,et al.  Lattice-Based Search for Spoken Utterance Retrieval , 2004, NAACL.

[40]  Mark J. F. Gales,et al.  Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED , 2014, SLTU.

[41]  Herbert Gish,et al.  Unsupervised training of an HMM-based speech recognizer for topic classification , 2009, INTERSPEECH.

[42]  Bin Ma,et al.  Acoustic Segment Modeling with Spectral Clustering Methods , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[43]  Andrew Rosenberg,et al.  CUNY Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 , 2015, MediaEval.

[44]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[45]  Mikel Penagarikano MediaEval 2013 Spoken Web Search Task: System Performance Measures , 2013 .

[46]  Bin Ma,et al.  Intrinsic spectral analysis based on temporal context features for query-by-example spoken term detection , 2014, INTERSPEECH.

[47]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[48]  James R. Glass,et al.  Towards multi-speaker unsupervised speech pattern discovery , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[49]  Bin Ma,et al.  Toward High-Performance Language-Independent Query-by-Example Spoken Term Detection for MediaEval 2015: Post-Evaluation Analysis , 2016, INTERSPEECH.

[50]  James Fox Current Developments in Comparative Austronesian Studies , 2004 .

[51]  James R. Glass,et al.  Fast spoken query detection using lower-bound Dynamic Time Warping on Graphical Processing Units , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52]  John R. Hershey,et al.  Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[53]  Hsinchun Chen,et al.  Updateable PAT-Tree Approach to Chinese Key PhraseExtraction using Mutual Information: A Linguistic Foundation for Knowledge Management , 1999 .

[54]  James R. Glass,et al.  A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.

[55]  Bin Ma,et al.  Unsupervised Bottleneck Features for Low-Resource Query-by-Example Spoken Term Detection , 2016, INTERSPEECH.

[56]  Lin-Shan Lee,et al.  Towards unsupervised semantic retrieval of spoken content with query expansion based on automatically discovered acoustic patterns , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[57]  Herbert Gish,et al.  Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discovery , 2014, Comput. Speech Lang..

[58]  Lin-Shan Lee,et al.  Performance Analysis for Lattice-Based Speech Indexing Approaches Using Words and Subword Units , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[59]  Lin-Shan Lee,et al.  Enhancing query expansion for semantic retrieval of spoken content with automatically discovered acoustic patterns , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.