Unsupervised data selection and word-morph mixed language model for tamil low-resource keyword search

This paper considers an unsupervised data selection problem for the training data of an acoustic model and the vocabulary coverage of a keyword search system in low-resource settings. We propose to use Gaussian component index based n-grams as acoustic features in a submodular function for unsupervised data selection. The submodular function provides a near-optimal solution in terms of the objective being optimized. Moreover, to further resolve the high out-of-vocabulary (OOV) rate for morphologically-rich languages like Tamil, word-morph mixed language modeling is also considered. Our experiments are conducted on the Tamil speech provided by the IAPRA Babel program for the 2014 NIST Open Keyword Search Evaluation (OpenKWS14). We show that the selection of data plays an important role to the word error rate of the speech recognition system and the actual term weighted value (ATWV) of the keyword search system. The 10 hours of speech selected from the full language pack (FLP) using the proposed algorithm provides a relative 23.2% and 20.7% ATWV improvement over two other data subsets, the 10-hour data from the limited language pack (LLP) defined by IARPA and the 10 hours of speech randomly selected from the FLP, respectively. The proposed algorithm also increases the vocabulary coverage, implicitly alleviating the OOV problem: The number of OOV search terms drops from 1,686 and 1,171 in the two baseline conditions to 972.

[1]  Bin Ma,et al.  Acoustic Segment Modeling with Spectral Clustering Methods , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Mari Ostendorf,et al.  Subword-based modeling for handling OOV words inkeyword spotting , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Dilek Z. Hakkani-Tür,et al.  Active and unsupervised learning for automatic speech recognition , 2003, INTERSPEECH.

[4]  Bin Ma,et al.  Acoustic TextTiling for story segmentation of spoken documents , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Hui Lin,et al.  How to select a good training-data subset for transcription: submodular active selection for sequences , 2009, INTERSPEECH.

[6]  Douglas A. Reynolds,et al.  Approaches to language identification using Gaussian mixture models and shifted delta cepstral features , 2002, INTERSPEECH.

[7]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[8]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[9]  Ngoc Thang Vu,et al.  Experiments towards a better LVCSR system for tamil , 2013, INTERSPEECH.

[10]  Francis R. Bach,et al.  Learning with Submodular Functions: A Convex Optimization Perspective , 2011, Found. Trends Mach. Learn..

[11]  John H. L. Hansen,et al.  A preliminary study of child vocalization on a parallel corpus of US and shanghainese toddlers , 2013, INTERSPEECH.

[12]  Dilek Z. Hakkani-Tür,et al.  Active learning for automatic speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Ebru Arisoy,et al.  Morph-based speech recognition and modeling of out-of-vocabulary words across languages , 2007, TSLP.

[14]  Jeff A. Bilmes,et al.  Submodular subset selection for large-scale speech training data , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Yusuke Shinohara A submodular optimization approach to sentence set selection , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Rong Zhang,et al.  Data selection for speech recognition , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[17]  Gerard G. L. Meyer,et al.  Selective sampling of training data for speech recognition , 2002 .

[18]  Regina Barzilay,et al.  Morphological Segmentation for Keyword Spotting , 2014, EMNLP.

[19]  Bin Ma,et al.  Submodular data selection with acoustic and phonetic features for automatic speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Bin Ma,et al.  Unsupervised mining of acoustic subword units with segment-level Gaussian posteriorgrams , 2013, INTERSPEECH.

[21]  Mikko Kurimo,et al.  Morfessor 2.0: Toolkit for statistical morphological segmentation , 2014, EACL.

[22]  Jeff A. Bilmes,et al.  Unsupervised submodular subset selection for speech data , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Shrikanth S. Narayanan,et al.  Robust Voice Activity Detection Using Long-Term Signal Variability , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Siddika Parlak,et al.  Performance Analysis and Improvement of Turkish Broadcast News Retrieval , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[26]  I-Fan Chen,et al.  A keyword-aware grammar framework for LVCSR-based spoken keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  M. Selvam,et al.  Improvement of Rule Based Morphological Analysis and POS Tagging in Tamil Language via Projection and Induction Techniques , 2022 .

[28]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[29]  Tara N. Sainath,et al.  N-best entropy based data selection for acoustic modeling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[31]  Bin Ma,et al.  Low-resource keyword search strategies for tamil , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Gökhan Tür,et al.  Active learning for spoken language understanding , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[33]  Olivier Siohan,et al.  Training data selection based on context-dependent state matching , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Jeff A. Bilmes,et al.  Using Document Summarization Techniques for Speech Data Subset Selection , 2013, NAACL.

[35]  Mikko Kurimo,et al.  Supervised Morphological Segmentation in a Low-Resource Learning Setting using Conditional Random Fields , 2013, CoNLL.

[36]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[37]  Olivier Siohan,et al.  Ivector-based Acoustic Data Selection , 2013, INTERSPEECH.