Challenges and Opportunities of Speech Recognition for Bengali Language

Speech recognition is a fascinating process that offers the opportunity to interact and command the machine in the field of human-computer interactions. Speech recognition is a language-dependent system constructed directly based on the linguistic and textual properties of any language. Automatic Speech Recognition (ASR) systems are currently being used to translate speech to text flawlessly. Although ASR systems are being strongly executed in international languages, ASR systems’ implementation in the Bengali language has not reached an acceptable state. In this research work, we sedulously disclose the current status of the Bengali ASR system’s research endeavors. In what follows, we acquaint the challenges that are mostly encountered while constructing a Bengali ASR system. We split the challenges into language-dependent and language-independent challenges and guide how the particular complications may be overhauled. Following a rigorous investigation and highlighting the challenges, we conclude that Bengali ASR systems require specific construction of ASR architectures based on the Bengali language’s grammatical and phonetic structure.

[1]  Mohammed Rokibul Alam Kotwal,et al.  Gender independent Bangla automatic speech recognition , 2012, 2012 International Conference on Informatics, Electronics & Vision (ICIEV).

[2]  Ayushi Y. Vadwala,et al.  Survey paper on Different Speech Recognition Algorithm: Challenges and Techniques , 2017 .

[3]  Namrata Dave,et al.  Feature Extraction Methods LPC, PLP and MFCC In Speech Recognition , 2013 .

[4]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[5]  Anup Kumar Paul,et al.  Bangla Speech Recognition System Using LPC and ANN , 2009, 2009 Seventh International Conference on Advances in Pattern Recognition.

[6]  N. S A Survey on Speech Feature Extraction and Classification Techniques , 2020, 2020 International Conference on Inventive Computation Technologies (ICICT).

[7]  Ankit Kumar,et al.  Challenges and Issues in Adopting Speech Recognition , 2018 .

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  Hermann Ney,et al.  LSTM, GRU, Highway and a Bit of Attention: An Empirical Overview for Language Modeling in Speech Recognition , 2016, INTERSPEECH.

[10]  Muhammad Abdullah Adnan,et al.  Preparation of Bangla Speech Corpus from Publicly Available Audio & Text , 2020, LREC.

[11]  Mohammad Rezwanul Huq,et al.  Bengali Spoken Digit Classification: A Deep Learning Approach Using Convolutional Neural Network , 2020 .

[12]  Md. Shafiul Alam Chowdhury,et al.  Linear predictor coefficient, power spectral analysis and two-layer feed forward network for bangla speech recognition , 2019, 2019 IEEE International Conference on System, Computation, Automation and Networking (ICSCAN).

[13]  Pabitra Mitra,et al.  Developing Bengali Speech Corpus for Phone Recognizer Using Optimum Text Selection Technique , 2011, 2011 International Conference on Asian Language Processing.

[14]  P. Mitra,et al.  Shruti-II: A vernacular speech recognition system in Bengali and an application for visually impaired community , 2010, 2010 IEEE Students Technology Symposium (TechSym).

[15]  Md Saiful Islam,et al.  Bengali speech recognition: A double layered LSTM-RNN approach , 2017, 2017 20th International Conference of Computer and Information Technology (ICCIT).

[16]  John H. L. Hansen,et al.  A Review on Speech Recognition Technique , 2010 .

[17]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[18]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Pabitra Mitra,et al.  Bengali speech corpus for continuous auutomatic speech recognition system , 2011, 2011 International Conference on Speech Database and Assessments (Oriental COCOSDA).

[20]  Ghulam Muhammad,et al.  Bangla phoneme recognition for ASR using multilayer neural network , 2010, 2010 13th International Conference on Computer and Information Technology (ICCIT).

[21]  Hynek Hermansky,et al.  Multi-resolution RASTA filtering for TANDEM-based ASR , 2005, INTERSPEECH.

[22]  Sakhawat Hosain Sumit,et al.  Noise Robust End-to-End Speech Recognition for Bangla Language , 2018, 2018 International Conference on Bangla Speech and Language Processing (ICBSLP).

[23]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[24]  Muhammad Mostafa Monowar,et al.  Deep Speaker Recognition: Process, Progress, and Challenges , 2021, IEEE Access.

[25]  A. Ganapathiraju,et al.  LINEAR DISCRIMINANT ANALYSIS - A BRIEF TUTORIAL , 1995 .

[26]  Richard M. Stern,et al.  The 1996 Hub-4 Sphinx-3 System , 1997 .

[27]  Emily Tucker Prud'hommeaux,et al.  Assessing Performance of Bengali Speech Recognizers Under Real World Conditions using GMM-HMM and DNN based Methods. , 2018, SLTU-2018.

[28]  Martin Westphal,et al.  The use of cepstral means in conversational speech recognition , 1997, EUROSPEECH.

[29]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[30]  Geoffrey E. Hinton Deep belief networks , 2009, Scholarpedia.

[31]  Mohammad Nuruzzaman Bhuiyan,et al.  Automatic Speech Recognition Technique for Bangla Words , 2013 .

[32]  Shafkat Kibria,et al.  Bangla Speech Recognition for Voice Search , 2018, 2018 International Conference on Bangla Speech and Language Processing (ICBSLP).

[33]  Yoshua Bengio,et al.  Light Gated Recurrent Units for Speech Recognition , 2018, IEEE Transactions on Emerging Topics in Computational Intelligence.

[34]  C. V. Jawahar,et al.  IndicSpeech: Text-to-Speech Corpus for Indian Languages , 2020, LREC.

[35]  H. Ney,et al.  Linear discriminant analysis for improved large vocabulary continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[36]  C R Rashmi Review of Algorithms and Applications in Speech Recognition System , 2014 .

[37]  Keikichi Hirose,et al.  On the effectiveness of MFCCs and their statistical distribution properties in speaker identification , 2004, 2004 IEEE Symposium on Virtual Environments, Human-Computer Interfaces and Measurement Systems, 2004. (VCIMS)..

[38]  Manvendra Singh,et al.  Speech Recognition Using Neural Networks , 2011 .

[39]  이상헌,et al.  Deep Belief Networks , 2010, Encyclopedia of Machine Learning.

[40]  M. A. H. Akhand,et al.  Acoustic modeling using deep belief network for Bangla speech recognition , 2015, 2015 18th International Conference on Computer and Information Technology (ICCIT).

[41]  Tanja Schultz,et al.  Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[42]  Brian Kingsbury,et al.  Building Competitive Direct Acoustics-to-Word Models for English Conversational Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Shyamal Kumar Das Mandal,et al.  Prosodic word boundary detection from Bengali continuous speech , 2020, Lang. Resour. Evaluation.

[44]  Mark J. F. Gales,et al.  Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED , 2014, SLTU.

[45]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[46]  P. K. Das,et al.  Bangla Speech-to-Text conversion using SAPI , 2012, 2012 International Conference on Computer and Communication Engineering (ICCCE).

[48]  Md. Mijanur Rahman,et al.  Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech Recognition , 2013, ArXiv.

[49]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[50]  Matthew H. Davis,et al.  Speech recognition in adverse conditions: A review , 2012 .

[51]  Hsiao-Wuen Hon,et al.  An overview of the SPHINX speech recognition system , 1990, IEEE Trans. Acoust. Speech Signal Process..

[52]  Munish Kumar,et al.  ASRoIL: a comprehensive survey for automatic speech recognition of Indian languages , 2019, Artificial Intelligence Review.

[53]  Ashraful Islam,et al.  Bengali Speech Recognition - Bangla Real Number Audio Dataset , 2018 .

[54]  M. A. Anusuya,et al.  Speech Recognition by Machine, A Review , 2010, ArXiv.

[55]  Marco Gori,et al.  A survey of hybrid ANN/HMM models for automatic speech recognition , 2001, Neurocomputing.

[56]  Sudhāṃśu Śekhara Tuṅga Bengali and other related dialects of south Assam , 1995 .

[57]  Alfred Mertins,et al.  Automatic speech recognition and speech variability: A review , 2007, Speech Commun..

[58]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[59]  Firoj Alam,et al.  Development of annotated Bangla speech corpora , 2010, SLTU.

[60]  Hung-yi Lee,et al.  Meta Learning for End-To-End Low-Resource Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[61]  Paul Lamere,et al.  Sphinx-4: a flexible open source framework for speech recognition , 2004 .

[62]  Tetsuya Takiguchi,et al.  PCA-Based Speech Enhancement for Distorted Speech Recognition , 2007, J. Multim..

[63]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[64]  Mohammad Shahidur Rahman,et al.  Continuous Bengali Speech Recognition Based On Deep Neural Network , 2019, 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE).

[65]  Anikó Ekárt,et al.  Phoneme aware speech recognition through evolutionary optimisation , 2019, GECCO.

[66]  Mohammad Nurul Huda,et al.  Automatic word recognition for bangla spoken language , 2014, 2014 International Conference on Signal Propagation and Computer Technology (ICSPCT 2014).

[67]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[68]  Shyamal Kumar Das Mandal,et al.  Deep Neural Network based Place and Manner of Articulation Detection and Classification for Bengali Continuous Speech , 2018 .

[69]  Oh-Wook Kwon,et al.  Phoneme recognition using ICA-based feature extraction and transformation , 2004, Signal Process..

[70]  Mohammad Mehdi Homayounpour,et al.  Adaptive windows multiple deep residual networks for speech recognition , 2020, Expert Syst. Appl..

[71]  Thaweesak Yingthawornsuk,et al.  Speech Recognition using MFCC , 2012 .

[72]  Mariusz Zió,et al.  WAVELET-FOURIER ANALYSIS FOR SPEAKER RECOGNITION , 2011 .

[73]  张国亮,et al.  Comparison of Different Implementations of MFCC , 2001 .

[74]  John Sahaya Rani Alex,et al.  Experimental Evaluation of CNN Architecture for Speech Recognition , 2020 .

[75]  Md. Saiful Islam,et al.  Comprehending Real Numbers: Development of Bengali Real Number Speech Corpus , 2018, ArXiv.

[76]  Jatiya Kabi,et al.  Speech recognition front-end for segmenting and clustering continuous Bangla speech , 2010 .

[77]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[78]  Tanmay Bhowmik,et al.  Deep Neural Network Based Recognition and Classification of Bengali Phonemes: A Case Study of Bengali Unconstrained Speech , 2017 .

[79]  Md Saiful Islam,et al.  A noble approach for recognizing Bangla real number automatically using CMU Sphinx4 , 2016, 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV).

[80]  Moin Mostakim,et al.  Prodorshok I: A bengali isolated speech dataset for voice-based assistive technologies: A comparative analysis of the effects of data augmentation on HMM-GMM and DNN classifiers , 2017, 2017 IEEE Region 10 Humanitarian Technology Conference (R10-HTC).

[81]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[82]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[83]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[84]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.