Deep Neural Network Based Recognition and Classification of Bengali Phonemes: A Case Study of Bengali Unconstrained Speech

This paper proposed a phoneme recognition and classification model for Bengali continuous speech. A Deep Neural Network based model has been developed for the recognition and classification task where the Stacked Denoising Autoencoder is used to generatively pre-train the deep network. Autoencoders are stacked to form the deep-structured network. Mel-frequency cepstral coefficients are used as input data vector. In hidden layer, 200 numbers of hidden units have been utilized. The number of hidden layers of the deep network is kept as three. The phoneme posterior probability has been derived in the output layer. This proposed model has been trained and tested using unconstrained Bengali continuous speech data collected from the different sources (TV, Radio, and normal conversation in a laboratory). In recognition phase, the Phoneme Error Rate is reported for the deep-structured model as 24.62% and 26.37% respectively for the training and testing while in the classification task this model achieves 86.7% average phoneme classification accuracy in training and 82.53% in the testing phase.

[1]  John Makhoul,et al.  BYBLOS: The BBN continuous speech recognition system , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[3]  Asoke Kumar Datta,et al.  Places and Manner of Articulation of Bangla Consonants: A EPG Based Study , 2011, INTERSPEECH.

[4]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[5]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[6]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[7]  Pabitra Mitra,et al.  Effect of aging on speech features and phoneme recognition: a study on Bengali voicing vowels , 2013, Int. J. Speech Technol..

[8]  S. M. Peeling,et al.  Isolated digit recognition experiments using the multi-layer perceptron , 1988, Speech Commun..

[9]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[10]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[11]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[12]  Mumit Khan,et al.  Isolated and continuous bangla speech recognition: implementation, performance and application perspective , 2007 .

[13]  Alex Waibel,et al.  Phoneme recognition: neural networks vs. hidden Markov models vs. hidden Markov models , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[14]  Dong Yu,et al.  Deep Learning for Signal and Information Processing , 2013 .

[15]  Anup Kumar Paul,et al.  Bangla Speech Recognition System Using LPC and ANN , 2009, 2009 Seventh International Conference on Advances in Pattern Recognition.

[16]  Birger Kollmeier,et al.  Phoneme confusions in human and automatic speech recognition , 2007, INTERSPEECH.

[17]  L. R. Rabiner,et al.  Recognition of isolated digits using hidden Markov models with continuous mixture densities , 1985, AT&T Technical Journal.

[18]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[19]  Lalit R. Bahl,et al.  Experiments with the Tangora 20,000 word speech recognizer , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[21]  Rasmus Berg Palm,et al.  Prediction as a candidate for learning deep hierarchical models of data , 2012 .