Adaptive decision-level fusion for Fongbe phoneme classification using fuzzy logic and Deep Belief Networks

In this paper, we compare three approaches for decision fusion in a phoneme classification problem. We especially deal with decision-level fusion from Naive Bayes and Learning Vector Quantization (LVQ) classifiers that were trained and tested by three speech analysis techniques: Mel-frequency Cepstral Coefficients (MFCC), Relative Spectral Transform - Perceptual Linear Prediction (Rasta-PLP) and Perceptual Linear Prediction (PLP). Optimal decision making is performed with the non-parametric and parametric methods. We investigated the performance of both decision methods with a third proposed approach using fuzzy logic. The work discusses the classification of an African language phoneme namely Fongbe language and all experiments were performed on its dataset. After classification and the decision fusion, the overall decision fusion performance is obtained on test data with the proposed approach using fuzzy logic whose classification accuracies are 95,54% for consonants and 83,97% for vowels despite the lower execution time of Deep Belief Networks.

[1]  A. Esposito,et al.  Phoneme Classification using a Rasta-PLP preprocessing algorithm and a Time Delay Neural Network: Performance Studies , 1999 .

[2]  Tobi Delbruck,et al.  Real-time classification and sensor fusion with a spiking deep belief network , 2013, Front. Neurosci..

[3]  Ngoc Thang Vu,et al.  Hausa large vocabulary continuous speech recognition , 2012, SLTU.

[4]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[5]  Dimitri Palaz,et al.  End-to-end Phoneme Sequence Recognition using Convolutional Neural Networks , 2013, ArXiv.

[6]  Israel Cohen,et al.  Classification of Unvoiced Fricative Phonemes using Geometric Methods , 2010 .

[7]  A.Bienvenu Akoha Syntaxe et lexicologie du fon-gbe. (republique du benin) , 1991 .

[8]  Sophie M. Wuerger,et al.  Continuous audio-visual digit recognition using N-best decision fusion , 2004, Inf. Fusion.

[9]  Benoît Maison,et al.  Joint processing of audio and visual information for multimedia indexing and human-computer interaction , 2000, RIAO.

[10]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Petros Maragos,et al.  Adaptive multimodal fusion by uncertainty compensation , 2006, INTERSPEECH.

[12]  Robert D Flint,et al.  Direct classification of all American English phonemes using signals from functional speech motor cortex , 2014, Journal of neural engineering.

[13]  Gilles Boulianne,et al.  A Dempster-Shafer Based Fusion Approach for Audio-Visual Speech Recognition with Application to Large Vocabulary French Speech , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[14]  S. Sridharan,et al.  Improved speech recognition using adaptive audio-visual fusion via a stochastic secondary classifier , 2001, Proceedings of 2001 International Symposium on Intelligent Multimedia, Video and Speech Processing. ISIMP 2001 (IEEE Cat. No.01EX489).

[15]  Thomas Niesler,et al.  Comparative phonetic analysis and phoneme recognition for Afrikaans, English and Xhosa using the African Speech Technology telephone speech databases , 2004, South Afr. Comput. J..

[16]  Harry Zhang,et al.  Exploring Conditions For The Optimality Of Naïve Bayes , 2005, Int. J. Pattern Recognit. Artif. Intell..

[17]  Cina Motamed,et al.  Weighted Combination of Naive Bayes and LVQ Classifier for Fongbe Phoneme Classification , 2014, 2014 Tenth International Conference on Signal-Image Technology and Internet-Based Systems.

[18]  Galina L. Rogova,et al.  Combining the results of several neural network classifiers , 1994, Neural Networks.

[19]  Sung-Bae Cho,et al.  Combining multiple neural networks by fuzzy integral for robust classification , 1995, IEEE Trans. Syst. Man Cybern..

[20]  Robert A. Jacobs,et al.  Methods For Combining Experts' Probability Assessments , 1995, Neural Computation.

[21]  M. Mehta,et al.  MULTIMODAL INPUT FUSION IN HUMAN-COMPUTER INTERACTION On the Example of the NICE Project , 2003 .

[22]  Teuvo Kohonen,et al.  An introduction to neural computing , 1988, Neural Networks.

[23]  Angeliki Metallinou,et al.  Decision level combination of multiple modalities for recognition and analysis of emotional expression , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[25]  Anna Esposito,et al.  Preprocessing and neural classification of English stop consonants [b, d, g, p, t, k] , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[26]  Laurent Besacier,et al.  Automatic Speech Recognition for Under-Resourced Languages: Application to Vietnamese Language , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Audio-visual synchrony for detection of monologues in video archives , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[28]  Norbert Pfleger,et al.  Context based multimodal fusion , 2004, ICMI '04.

[29]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[30]  Amjad Rehman,et al.  Fuzzy Phoneme Classification Using Multi-speaker Vocal Tract Length Normalization , 2014 .

[31]  Peter Sollich,et al.  Tuning support vector machines for robust phoneme classification with acoustic waveforms , 2009, INTERSPEECH.

[32]  M. Malcangi,et al.  Audio-visual fuzzy fusion for robust speech recognition , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[33]  Tat-Seng Chua,et al.  Fusion of AV features and external information sources for event detection in team sports video , 2006, TOMCCAP.