Bangla speech recognition using 1D CNN and LSTM with different dimension reduction techniques

This paper presents a model of Bangla speech recognition using machine learning algorithms. Mel-frequency Cepstral Coefficient (MFCC) and Mel Spectrogram are extracted from a Bangla dataset. Commonly used dimension reduction techniques, Principal Component Analysis (PCA), Kernel-PCA (k-PCA), and T-distributed Stochastic Neighbor Embedding (t-SNE) are applied to the extracted features as a dimension reduction technique. At the end, as a classification tool, 1-dimensional Convolutional Neural Network (1D-CNN) and Long-Short Term Memory (LSTM) are utilized. Experimental results demonstrate that among the dimension reduction techniques, PCA demonstrates comparatively higher accuracy than the other state-of-art models by exhibiting 94.58% and 83.12% accuracy for 1D CNN and LSTM, respectively. In addition, it has been observed that dimension reduction techniques have no positive impact on 1D-CNN and LSTM. Without any dimension reduction technique, MFCC with 1D-CNN has demonstrated better accuracy compared to MFCC with LSTM by showing 97.26% and 93.83% of accuracy, respectively.

[1]  Md. Masudur Rahman,et al.  Dynamic Time Warping Assisted SVM Classifier for Bangla Speech Recognition , 2018, 2018 International Conference on Computer, Communication, Chemical, Material and Electronic Engineering (IC4ME2).

[2]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[3]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[4]  Parneet Kaur,et al.  Hindi Automatic Speech Recognition Using HTK , 2013 .

[5]  V. Radha,et al.  Speaker Independent Isolated Speech Recognition System for Tamil Language using HMM , 2012 .

[6]  L. Ryd,et al.  On bias. , 1994, Acta orthopaedica Scandinavica.

[7]  M. K. Soni,et al.  Real Time Speaker Recognition System for Hindi Words , 2014 .

[8]  S. Park,et al.  Texture classification with kernel principal component analysis , 2000 .

[9]  Sanjay Mathur,et al.  Sanskrit Speech Recognition using Hidden Markov Model Toolkit , 2014 .

[10]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[11]  Douglas M. Hawkins,et al.  The Problem of Overfitting , 2004, J. Chem. Inf. Model..

[12]  Meinard Müller,et al.  Information retrieval for music and motion , 2007 .

[13]  Abeer Alsadoon,et al.  Experiments on the MFCC application in speaker recognition using Matlab , 2017, 2017 Seventh International Conference on Information Science and Technology (ICIST).

[14]  Nilanjan Ray Chaudhuri,et al.  Online bad data outlier detection in PMU measurements using PCA feature-driven ANN classifier , 2017, 2017 IEEE Power & Energy Society General Meeting.

[15]  Jacek M. Zurada,et al.  Introduction to artificial neural systems , 1992 .

[16]  S. Velliangiri,et al.  A Review of Dimensionality Reduction Techniques for Efficient Computation , 2019, Procedia Computer Science.

[17]  Elizabeth Sherly,et al.  MALAYALAM WORD IDENTIFICATION FOR SPEECH RECOGNITION SYSTEM , 2015 .

[18]  Shantanu Sharma,et al.  A technique for dimension reduction of MFCC spectral features for speech recognition , 2015, 2015 International Conference on Industrial Instrumentation and Control (ICIC).

[19]  Md Saiful Islam,et al.  A noble approach for recognizing Bangla real number automatically using CMU Sphinx4 , 2016, 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV).

[20]  Nelson Morgan,et al.  Deep and Wide: Multiple Layers in Automatic Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[22]  Heiga Zen,et al.  On the Use of Kernel PCA for Feature Extraction in Speech Recognition , 2003, IEICE Trans. Inf. Syst..

[23]  Thomas Fang Zheng,et al.  Comparison of different implementations of MFCC , 2001, Journal of Computer Science and Technology.

[24]  P. Carlone,et al.  Composite materials manufacturing , 2015, SOCO 2015.

[25]  Jong-Myon Kim,et al.  Acoustic Emission Sensor Network Based Fault Diagnosis of Induction Motors Using a Gabor Filter and Multiclass Support Vector Machines , 2016, Ad Hoc Sens. Wirel. Networks.

[26]  Annu Choudhary,et al.  Automatic Speech Recognition System for Isolated & Connected Words of Hindi Language By Using Hidden Markov Model Toolkit ( HTK ) , 2013 .

[27]  Suma Swamy,et al.  AN EFFICIENT SPEECH RECOGNITION SYSTEM , 2013 .

[28]  Jia Uddin,et al.  A Real Time Speech to Text Conversion Technique for Bengali Language , 2018, 2018 International Conference on Computer, Communication, Chemical, Material and Electronic Engineering (IC4ME2).

[29]  Md Saiful Islam,et al.  Bengali speech recognition: A double layered LSTM-RNN approach , 2017, 2017 20th International Conference of Computer and Information Technology (ICCIT).

[30]  Suryo Wijoyo,et al.  Speech Recognition Using Linear Predictive Coding and Artificial Neural Network for Controlling Movement of Mobile Robot , 2011 .

[31]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[32]  Jia Uddin,et al.  Speech Recognition Using Feed Forward Neural Network and Principle Component Analysis , 2017, SIRS.

[33]  Cini Kurian,et al.  Development & evaluation of different acoustic models for Malayalam continuous speech recognition , 2012 .

[34]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  Mazen Saleh,et al.  Data reduction on MFCC features based on kernel PCA for speaker verification system , 2014 .

[36]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[37]  Min-Seok Kim,et al.  Robust Speaker Identification Using Greedy Kernel PCA , 2008, 2008 20th IEEE International Conference on Tools with Artificial Intelligence.

[38]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.

[39]  Alexander Lerch,et al.  An Introduction to Audio Content Analysis: Applications in Signal Processing and Music Informatics , 2012 .

[40]  Martin Wattenberg,et al.  How to Use t-SNE Effectively , 2016 .

[41]  Heiga Zen,et al.  Sparse KPCA for feature extraction in speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[42]  Nikos Fakotakis,et al.  Comparative Evaluation of Various MFCC Implementations on the Speaker Verification Task , 2007 .

[43]  Yi Wang,et al.  Speaker recognition based on MFCC and BP neural networks , 2017, 2017 28th Irish Signals and Systems Conference (ISSC).

[44]  Stephen A. Zahorian,et al.  Nonlinear Dimensionality Reduction Methods for Use with Automatic Speech Recognition , 2011 .

[45]  Keiron O'Shea,et al.  An Introduction to Convolutional Neural Networks , 2015, ArXiv.

[46]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[47]  Fakhri Karray,et al.  Dimensionality Reduction for Emotional Speech Recognition , 2012, 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing.

[48]  Fernando De la Torre,et al.  Robust Kernel Principal Component Analysis , 2008, NIPS.

[49]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[50]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[51]  P. Boonma,et al.  Dimensionality Reduction Algorithms for Improving Efficiency of PromoRank: A Comparison Study , 2015 .

[52]  Huan Liu Feature Selection , 2010, Encyclopedia of Machine Learning.

[53]  Bayya Yegnanarayana,et al.  Combining evidence from residual phase and MFCC features for speaker recognition , 2006, IEEE Signal Processing Letters.

[54]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[55]  I. Elamvazuthi,et al.  Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques , 2010, ArXiv.