论文信息 - Bangla speech recognition using 1D CNN and LSTM with different dimension reduction techniques

Bangla speech recognition using 1D CNN and LSTM with different dimension reduction techniques

This paper presents a model of Bangla speech recognition using machine learning algorithms. Mel-frequency Cepstral Coefficient (MFCC) and Mel Spectrogram are extracted from a Bangla dataset. Commonly used dimension reduction techniques, Principal Component Analysis (PCA), Kernel-PCA (k-PCA), and T-distributed Stochastic Neighbor Embedding (t-SNE) are applied to the extracted features as a dimension reduction technique. At the end, as a classification tool, 1-dimensional Convolutional Neural Network (1D-CNN) and Long-Short Term Memory (LSTM) are utilized. Experimental results demonstrate that among the dimension reduction techniques, PCA demonstrates comparatively higher accuracy than the other state-of-art models by exhibiting 94.58% and 83.12% accuracy for 1D CNN and LSTM, respectively. In addition, it has been observed that dimension reduction techniques have no positive impact on 1D-CNN and LSTM. Without any dimension reduction technique, MFCC with 1D-CNN has demonstrated better accuracy compared to MFCC with LSTM by showing 97.26% and 93.83% of accuracy, respectively.

Jia Uddin | Md. Nazmus Sabab | Mohammad Abidur Rahman Chowdhury | S. M. Mahsanul Islam Nirjhor

[1] Md. Masudur Rahman,et al. Dynamic Time Warping Assisted SVM Classifier for Bangla Speech Recognition , 2018, 2018 International Conference on Computer, Communication, Chemical, Material and Electronic Engineering (IC4ME2).

[2] Elie Bienenstock,et al. Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[3] Stan Davis,et al. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[4] Parneet Kaur,et al. Hindi Automatic Speech Recognition Using HTK , 2013 .

[5] V. Radha,et al. Speaker Independent Isolated Speech Recognition System for Tamil Language using HMM , 2012 .

[6] L. Ryd,et al. On bias. , 1994, Acta orthopaedica Scandinavica.

[7] M. K. Soni,et al. Real Time Speaker Recognition System for Hindi Words , 2014 .

[8] S. Park,et al. Texture classification with kernel principal component analysis , 2000 .

[9] Sanjay Mathur,et al. Sanskrit Speech Recognition using Hidden Markov Model Toolkit , 2014 .

[10] Sanjeev Khudanpur,et al. Audio augmentation for speech recognition , 2015, INTERSPEECH.

[11] Douglas M. Hawkins,et al. The Problem of Overfitting , 2004, J. Chem. Inf. Model..

[12] Meinard Müller,et al. Information retrieval for music and motion , 2007 .

[13] Abeer Alsadoon,et al. Experiments on the MFCC application in speaker recognition using Matlab , 2017, 2017 Seventh International Conference on Information Science and Technology (ICIST).

[14] Nilanjan Ray Chaudhuri,et al. Online bad data outlier detection in PMU measurements using PCA feature-driven ANN classifier , 2017, 2017 IEEE Power & Energy Society General Meeting.

[15] Jacek M. Zurada,et al. Introduction to artificial neural systems , 1992 .

[16] S. Velliangiri,et al. A Review of Dimensionality Reduction Techniques for Efficient Computation , 2019, Procedia Computer Science.

[17] Elizabeth Sherly,et al. MALAYALAM WORD IDENTIFICATION FOR SPEECH RECOGNITION SYSTEM , 2015 .

[18] Shantanu Sharma,et al. A technique for dimension reduction of MFCC spectral features for speech recognition , 2015, 2015 International Conference on Industrial Instrumentation and Control (ICIC).

[19] Md Saiful Islam,et al. A noble approach for recognizing Bangla real number automatically using CMU Sphinx4 , 2016, 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV).

[20] Nelson Morgan,et al. Deep and Wide: Multiple Layers in Automatic Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[21] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[22] Heiga Zen,et al. On the Use of Kernel PCA for Feature Extraction in Speech Recognition , 2003, IEICE Trans. Inf. Syst..

[23] Thomas Fang Zheng,et al. Comparison of different implementations of MFCC , 2001, Journal of Computer Science and Technology.

[24] P. Carlone,et al. Composite materials manufacturing , 2015, SOCO 2015.

[25] Jong-Myon Kim,et al. Acoustic Emission Sensor Network Based Fault Diagnosis of Induction Motors Using a Gabor Filter and Multiclass Support Vector Machines , 2016, Ad Hoc Sens. Wirel. Networks.

[26] Annu Choudhary,et al. Automatic Speech Recognition System for Isolated & Connected Words of Hindi Language By Using Hidden Markov Model Toolkit ( HTK ) , 2013 .

[27] Suma Swamy,et al. AN EFFICIENT SPEECH RECOGNITION SYSTEM , 2013 .

[28] Jia Uddin,et al. A Real Time Speech to Text Conversion Technique for Bengali Language , 2018, 2018 International Conference on Computer, Communication, Chemical, Material and Electronic Engineering (IC4ME2).

[29] Md Saiful Islam,et al. Bengali speech recognition: A double layered LSTM-RNN approach , 2017, 2017 20th International Conference of Computer and Information Technology (ICCIT).

[30] Suryo Wijoyo,et al. Speech Recognition Using Linear Predictive Coding and Artificial Neural Network for Controlling Movement of Mobile Robot , 2011 .

[31] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[32] Jia Uddin,et al. Speech Recognition Using Feed Forward Neural Network and Principle Component Analysis , 2017, SIRS.

[33] Cini Kurian,et al. Development & evaluation of different acoustic models for Malayalam continuous speech recognition , 2012 .

[34] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35] Mazen Saleh,et al. Data reduction on MFCC features based on kernel PCA for speaker verification system , 2014 .

[36] Joydeep Ghosh,et al. Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[37] Min-Seok Kim,et al. Robust Speaker Identification Using Greedy Kernel PCA , 2008, 2008 20th IEEE International Conference on Tools with Artificial Intelligence.

[38] Bernhard Schölkopf,et al. Kernel Principal Component Analysis , 1997, ICANN.

[39] Alexander Lerch,et al. An Introduction to Audio Content Analysis: Applications in Signal Processing and Music Informatics , 2012 .

[40] Martin Wattenberg,et al. How to Use t-SNE Effectively , 2016 .

[41] Heiga Zen,et al. Sparse KPCA for feature extraction in speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[42] Nikos Fakotakis,et al. Comparative Evaluation of Various MFCC Implementations on the Speaker Verification Task , 2007 .

[43] Yi Wang,et al. Speaker recognition based on MFCC and BP neural networks , 2017, 2017 28th Irish Signals and Systems Conference (ISSC).

[44] Stephen A. Zahorian,et al. Nonlinear Dimensionality Reduction Methods for Use with Automatic Speech Recognition , 2011 .

[45] Keiron O'Shea,et al. An Introduction to Convolutional Neural Networks , 2015, ArXiv.

[46] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[47] Fakhri Karray,et al. Dimensionality Reduction for Emotional Speech Recognition , 2012, 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing.

[48] Fernando De la Torre,et al. Robust Kernel Principal Component Analysis , 2008, NIPS.

[49] Tsuyoshi Murata,et al. {m , 1934, ACML.

[50] Richard Socher,et al. Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[51] P. Boonma,et al. Dimensionality Reduction Algorithms for Improving Efficiency of PromoRank: A Comparison Study , 2015 .

[52] Huan Liu. Feature Selection , 2010, Encyclopedia of Machine Learning.

[53] Bayya Yegnanarayana,et al. Combining evidence from residual phase and MFCC features for speaker recognition , 2006, IEEE Signal Processing Letters.

[54] Eric O. Postma,et al. Dimensionality Reduction: A Comparative Review , 2008 .

[55] I. Elamvazuthi,et al. Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques , 2010, ArXiv.