Data Balancing for Efficient Training of Hybrid ANN/HMM Automatic Speech Recognition Systems

Hybrid speech recognizers, where the estimation of the emission pdf of the states of hidden Markov models (HMMs), usually carried out using Gaussian mixture models (GMMs), is substituted by artificial neural networks (ANNs) have several advantages over the classical systems. However, to obtain performance improvements, the computational requirements are heavily increased because of the need to train the ANN. Departing from the observation of the remarkable skewness of speech data, this paper proposes sifting out the training set and balancing the amount of samples per class. With this method, the training time has been reduced 18 times while obtaining performances similar to or even better than those with the whole database, especially in noisy environments. However, the application of these reduced sets is not straightforward. To avoid the mismatch between training and testing conditions created by the modification of the distribution of the training data, a proper scaling of the a posteriori probabilities obtained and a resizing of the context window need to be performed as demonstrated in this paper.

[1]  Bianca Zadrozny,et al.  Guest editorial: special issue on utility-based data mining , 2008, Data Mining and Knowledge Discovery.

[2]  Teresa M. Kamm,et al.  Active learning for acoustic speech recognition modeling , 2004 .

[3]  Pietro Laface,et al.  Speeding-up neural network training using sentence and frame selection , 2007, INTERSPEECH.

[4]  Sanjeev Khudanpur,et al.  Sample selection for automatic language identification , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Berlin Chen,et al.  Training data selection for improving discriminative training of acoustic models , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[6]  Chin-Hui Lee,et al.  A dynamic in-search data selection method with its applications to acoustic modeling and utterance verification , 2005, IEEE Transactions on Speech and Audio Processing.

[7]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[8]  Hervé Bourlard,et al.  An introduction to the hybrid hmm/connectionist approach , 1995 .

[9]  Isabelle Jars,et al.  Improving Spoken Language Understanding with information retrieval and active learning methods , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  D.S. Anyfantis,et al.  Local cost sensitive learning for handling imbalanced data sets , 2007, 2007 Mediterranean Conference on Control & Automation.

[11]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[12]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[13]  Sethu Vijayakumar,et al.  Improving Generalization Ability through Active Learning , 1999 .

[14]  José Francisco Martínez-Trinidad,et al.  Progress in Pattern Recognition, Image Analysis and Applications, 12th Iberoamericann Congress on Pattern Recognition, CIARP 2007, Valparaiso, Chile, November 13-16, 2007, Proceedings , 2008, CIARP.

[15]  Lou Boves,et al.  In search of optimal data selection for training of automatic speech recognition systems , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[16]  Gerard G. L. Meyer,et al.  Selective sampling of training data for speech recognition , 2002 .

[17]  Pietro Laface,et al.  Linear hidden transformations for adaptation of hybrid ANN/HMM models , 2007, Speech Commun..

[18]  Ajay Joshi,et al.  Applying the wrapper approach for auto discovery of under-sampling and over-sampling percentages on skewed datasets , 2004 .

[19]  Carmen Peláez-Moreno,et al.  Automatic data selection for MLP-based feature extraction for ASR , 2005, INTERSPEECH.

[20]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[21]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[22]  Alexander I. Rudnicky,et al.  FOR SEMI-SUPERVISED ACOUSTIC MODELING , 2006 .

[23]  Jesus Savage,et al.  Progress in Pattern Recognition, Image Analysis and Applications (CIARP) , 2007 .

[24]  Dilek Z. Hakkani-Tür,et al.  Active learning: theory and applications to automatic speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[25]  Erkki Oja,et al.  Artificial Neural Networks: Biological Inspirations - ICANN 2005, 15th International Conference, Warsaw, Poland, September 11-15, 2005, Proceedings, Part I , 2005, ICANN.

[26]  Hervé Bourlard,et al.  Continuous speech recognition , 1995, IEEE Signal Process. Mag..

[27]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[28]  Andreas Stolcke,et al.  Using MLP features in SRI's conversational speech recognition system , 2005, INTERSPEECH.

[29]  Narada D. Warakagoda,et al.  The COST 249 SpeechDat Multilingual Reference Recogniser , 2000, LREC.

[30]  Fernando Díaz-de-María,et al.  Support Vector Machines for continuous speech recognition , 2006, 2006 14th European Signal Processing Conference.

[31]  Vincent Lemaire,et al.  Active Learning Strategies: A Case Study for Detection of Emotions in Speech , 2007, ICDM.

[32]  László Tóth,et al.  Non-commercial Research and Educational Use including without Limitation Use in Instruction at Your Institution, Sending It to Specific Colleagues That You Know, and Providing a Copy to Your Institution's Administrator. All Other Uses, Reproduction and Distribution, including without Limitation Comm , 2022 .

[33]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[34]  Gerard G. L. Meyer,et al.  Automatic selection of transcribed training material , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[35]  Rong Zhang,et al.  Data selection for speech recognition , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[36]  Carmen Peláez-Moreno,et al.  Robust ASR using Support Vector Machines , 2007, Speech Commun..

[37]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[38]  N. Morgan,et al.  A CTS TASK FOR MEANINGFUL FAST-TURNAROUND EXPERIMENTS , 2015 .

[39]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[40]  Climent Nadeu,et al.  Comparison and combination of features in a hybrid HMM/MLP and a HMM/GMM speech recognition system , 2005, IEEE Transactions on Speech and Audio Processing.

[41]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[42]  Ciro Martins,et al.  An incremental speaker-adaptation technique for hybrid HMM-MLP recognizer , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[43]  Ciro Martins,et al.  Speaker-adaptation in a hybrid HMM-MLP recognizer , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[44]  Andreas Stolcke,et al.  SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS , 2004 .

[45]  Carmen Peláez-Moreno,et al.  SVMs for Automatic Speech Recognition: A Survey , 2005, WNSP.

[46]  Gökhan Tür,et al.  Combining active and semi-supervised learning for spoken language understanding , 2005, Speech Commun..

[47]  Hervé Bourlard,et al.  Connectionist probability estimators in HMM speech recognition , 1994, IEEE Trans. Speech Audio Process..

[48]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[49]  Christian Raymond,et al.  Learning with noisy supervision for Spoken Language Understanding , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[50]  Marco Gori,et al.  A survey of hybrid ANN/HMM models for automatic speech recognition , 2001, Neurocomputing.

[51]  Hervé Bourlard,et al.  Hybrid HMM/ANN Systems for Speech Recognition: Overview and New Research Directions , 1997, Summer School on Neural Networks.

[52]  Simon King,et al.  A hybrid ANN/DBN approach to articulatory feature recognition , 2005, INTERSPEECH.

[53]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[54]  Marco Gori,et al.  Adaptive Processing of Sequences and Data Structures , 1998, Lecture Notes in Computer Science.

[55]  Astrid Hagen Robust speech recognition based on multi-stream processing , 2001 .

[56]  Gerhard Rigoll,et al.  Hybrid NN/HMM acoustic modeling techniques for distributed speech recognition , 2006, Speech Commun..

[57]  David A. Cieslak,et al.  Automatically countering imbalance and its empirical relationship to cost , 2008, Data Mining and Knowledge Discovery.

[58]  Dario Albesano,et al.  Hybrid HMM-NN for speech recognition and prior class probabilities , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[59]  N. Morgan,et al.  Pushing the envelope - aside [speech recognition] , 2005, IEEE Signal Processing Magazine.

[60]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[61]  José Salvador Sánchez,et al.  An Empirical Study of the Behavior of Classifiers on Imbalanced and Overlapped Data Sets , 2007, CIARP.

[62]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[63]  László Tóth,et al.  Training HMM/ANN Hybrid Speech Recognizers by Probabilistic Sampling , 2005, ICANN.