Speech Phoneme Classification by Intelligent Decision-Level Fusion

This paper explores the decision fusion for the phoneme recognition problem through intelligent combination of Naive Bayes and Learning Vector Quantization (LVQ) classifiers and feature fusion using Mel-frequency Cepstral Coefficients (MFCC), Relative Spectral Transform—Perceptual Linear Prediction (Rasta-PLP) and Perceptual Linear Prediction (PLP). This work emphasizes optimal decision making from decisions of classifiers which are trained on different features. The proposed architecture consists of three decision fusion approaches which are weighted mean, deep belief networks (DBN) and fuzzy logic. We proposed a performance comparison on a dataset of an African language phoneme, Fongbe, for experiments. The latter produced the overall decision fusion performance with the proposed approach using fuzzy logic whose classification accuracies are 95.54 % for consonants and 83.97 % for vowels despite the lower execution time of Deep Belief Networks.

[1]  Petros Maragos,et al.  Adaptive multimodal fusion by uncertainty compensation , 2006, INTERSPEECH.

[2]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[3]  Amjad Rehman,et al.  Fuzzy Phoneme Classification Using Multi-speaker Vocal Tract Length Normalization , 2014 .

[4]  Tat-Seng Chua,et al.  Fusion of AV features and external information sources for event detection in team sports video , 2006, TOMCCAP.

[5]  Anna Esposito,et al.  Preprocessing and neural classification of English stop consonants [b, d, g, p, t, k] , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[6]  Harry Zhang,et al.  Exploring Conditions For The Optimality Of Naïve Bayes , 2005, Int. J. Pattern Recognit. Artif. Intell..

[7]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Cina Motamed,et al.  Weighted Combination of Naive Bayes and LVQ Classifier for Fongbe Phoneme Classification , 2014, 2014 Tenth International Conference on Signal-Image Technology and Internet-Based Systems.

[9]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[10]  Hermann Ney,et al.  A word graph algorithm for large vocabulary continuous speech recognition , 1994, Comput. Speech Lang..

[11]  A. Esposito,et al.  Phoneme Classification using a Rasta-PLP preprocessing algorithm and a Time Delay Neural Network: Performance Studies , 1999 .

[12]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[13]  Benoît Maison,et al.  Joint processing of audio and visual information for multimedia indexing and human-computer interaction , 2000, RIAO.

[14]  Angeliki Metallinou,et al.  Decision level combination of multiple modalities for recognition and analysis of emotional expression , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Galina L. Rogova,et al.  Combining the results of several neural network classifiers , 1994, Neural Networks.

[16]  A.Bienvenu Akoha Syntaxe et lexicologie du fon-gbe. (republique du benin) , 1991 .

[17]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[18]  Gilles Boulianne,et al.  A Dempster-Shafer Based Fusion Approach for Audio-Visual Speech Recognition with Application to Large Vocabulary French Speech , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[19]  Tobi Delbruck,et al.  Real-time classification and sensor fusion with a spiking deep belief network , 2013, Front. Neurosci..

[20]  Teuvo Kohonen,et al.  An introduction to neural computing , 1988, Neural Networks.

[21]  S. Sridharan,et al.  Improved speech recognition using adaptive audio-visual fusion via a stochastic secondary classifier , 2001, Proceedings of 2001 International Symposium on Intelligent Multimedia, Video and Speech Processing. ISIMP 2001 (IEEE Cat. No.01EX489).

[22]  Thomas Niesler,et al.  Comparative phonetic analysis and phoneme recognition for Afrikaans, English and Xhosa using the African Speech Technology telephone speech databases , 2004, South Afr. Comput. J..

[23]  M. Malcangi,et al.  Audio-visual fuzzy fusion for robust speech recognition , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[24]  Sophie M. Wuerger,et al.  Continuous audio-visual digit recognition using N-best decision fusion , 2004, Inf. Fusion.

[25]  Robert A. Jacobs,et al.  Methods For Combining Experts' Probability Assessments , 1995, Neural Computation.

[26]  Sung-Bae Cho,et al.  Combining multiple neural networks by fuzzy integral for robust classification , 1995, IEEE Trans. Syst. Man Cybern..

[27]  Norbert Pfleger,et al.  Context based multimodal fusion , 2004, ICMI '04.

[28]  Robert D Flint,et al.  Direct classification of all American English phonemes using signals from functional speech motor cortex , 2014, Journal of neural engineering.

[29]  Laurent Besacier,et al.  Automatic Speech Recognition for Under-Resourced Languages: Application to Vietnamese Language , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Dimitri Palaz,et al.  End-to-end Phoneme Sequence Recognition using Convolutional Neural Networks , 2013, ArXiv.

[31]  Peter Sollich,et al.  Tuning support vector machines for robust phoneme classification with acoustic waveforms , 2009, INTERSPEECH.