Graph-based re-ranking using acoustic feature similarity between search results for spoken term detection on low-resource languages

Acoustic feature similarity between search results has been shown to be very helpful for the task of spoken term detection (STD). A graph-based re-ranking approach for STD has been proposed based on the concept that search results, which are acoustically similar to other results with higher confidence scores, should have higher scores themselves. In this approach, the similarity between all search results of a given term are considered as a graph, and the confidence scores of the search results propagate through this graph. Since this approach can improve STD results without any additional labelled data, it is especially suitable for STD on languages with limited amounts of annotated data. However, its performance has not been widely studied on benchmark corpora. In this paper, we investigate the effectiveness of the graph-based reranking approach on limited language data from the IARPA Babel program. Experiments on the low-resource languages, Assamese, Bengali and Lao, show that graph-based re-ranking improves STD systems using fuzzy matching, and lattices based on different kinds of units including words, subwords, and hybrids. Index Terms: Random Walk, Spoken Term Detection

[1]  Martha Larson,et al.  Spoken Content Retrieval: A Survey of Techniques and Technologies , 2012, Found. Trends Inf. Retr..

[2]  Lin-Shan Lee,et al.  Improved spoken term detection with graph-based re-ranking in feature space , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Sridha Sridharan,et al.  Dynamic match phone-lattice searches for very fast and accurate unrestricted vocabulary keyword spotting , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[5]  Yu Zhang,et al.  Extracting deep neural network bottleneck features using low-rank matrix factorization , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Lin-Shan Lee,et al.  Open-Vocabulary Retrieval of Spoken Content with Shorter/Longer Queries Considering Word/Subword-based Acoustic Feature Similarity , 2012, INTERSPEECH.

[7]  Lin-Shan Lee,et al.  Improved lattice-based spoken document retrieval by directly learning from the evaluation measures , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Lin-Shan Lee,et al.  Improved spoken term detection by feature space pseudo-relevance feedback , 2010, INTERSPEECH.

[9]  Dong Wang,et al.  Handling overlaps in spoken term detection , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Aren Jansen,et al.  Semi-supervised manifold learning approaches for spoken term verification , 2013, INTERSPEECH.

[11]  Sridha Sridharan,et al.  Optimising Figure of Merit for phonetic spoken term detection , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Lin-Shan Lee,et al.  Improved open-vocabulary spoken content retrieval with word and subword lattices using acoustic feature similarity , 2014, Comput. Speech Lang..

[13]  Bhuvana Ramabhadran,et al.  Semi-supervised term-weighted value rescoring for keyword search , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Thomas Hain,et al.  Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition , 2006, INTERSPEECH.

[15]  Tanja Schultz,et al.  Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[16]  Aren Jansen,et al.  Zero resource graph-based confidence estimation for open vocabulary spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Alexander I. Rudnicky,et al.  Using conversational word bursts in spoken term detection , 2013, INTERSPEECH.

[18]  Lin-Shan Lee,et al.  Subword-based position specific posterior lattices (s-PSPL) for indexing speech information , 2007, INTERSPEECH.

[19]  Mattias Heldner,et al.  The fundamental frequency variation spectrum , 2008 .

[20]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[21]  Shi-wook Lee,et al.  Combining multiple subword representations for open-vocabulary spoken document retrieval , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[22]  Daniel Schneider,et al.  Efficient subword lattice retrieval for German spoken term detection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Sridha Sridharan,et al.  Discriminative Optimization of the Figure of Merit for Phonetic Spoken Term Detection , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[25]  Bhuvana Ramabhadran,et al.  Query-by-example Spoken Term Detection For OOV terms , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[26]  Lin-Shan Lee,et al.  Enhanced Spoken Term Detection Using Support Vector Machines and Weighted Pseudo Examples , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Arindam Mandal,et al.  Discriminatively trained phoneme confusion model for keyword spotting , 2012, INTERSPEECH.

[28]  Amy Nicole Langville,et al.  A Survey of Eigenvector Methods for Web Information Retrieval , 2005, SIAM Rev..

[29]  Spyridon Matsoukas,et al.  Developing a Speech Activity Detection System for the DARPA RATS Program , 2012, INTERSPEECH.

[30]  Gokhan Tur,et al.  Spoken Language Understanding: Systems for Extracting Semantic Information from Speech , 2011 .

[31]  Yusuke Yokota,et al.  Spoken document retrieval by translating recognition candidates into correct transcriptions , 2008, INTERSPEECH.

[32]  Daniel P. W. Ellis,et al.  Noise Robust Pitch Tracking by Subband Autocorrelation Classification , 2012, INTERSPEECH.

[33]  Shuang Wu,et al.  Robust Event Detection From Spoken Content In Consumer Domain Videos , 2012, INTERSPEECH.

[34]  Cyril Allauzen,et al.  General Indexation of Weighted Automata - Application to Spoken Utterance Retrieval , 2004, HLT-NAACL 2004.

[35]  Michael Picheny,et al.  Improvements in phone based audio search via constrained match with high order confusion estimates , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[36]  Mikko Kurimo,et al.  Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline , 2013 .

[37]  Sanjeev Khudanpur,et al.  Using proxies for OOV keywords in the keyword search task , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[38]  Joachim Köhler,et al.  Merging search spaces for subword spoken term detection , 2009, INTERSPEECH.

[39]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[40]  Richard M. Schwartz,et al.  Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[41]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.