Machine learning based sample extraction for automatic speech recognition using dialectal Assamese speech

Automatic Speaker Recognition (ASR) and related issues are continuously evolving as inseparable elements of Human Computer Interaction (HCI). With assimilation of emerging concepts like big data and Internet of Things (IoT) as extended elements of HCI, ASR techniques are found to be passing through a paradigm shift. Oflate, learning based techniques have started to receive greater attention from research communities related to ASR owing to the fact that former possess natural ability to mimic biological behavior and that way aids ASR modeling and processing. The current learning based ASR techniques are found to be evolving further with incorporation of big data, IoT like concepts. Here, in this paper, we report certain approaches based on machine learning (ML) used for extraction of relevant samples from big data space and apply them for ASR using certain soft computing techniques for Assamese speech with dialectal variations. A class of ML techniques comprising of the basic Artificial Neural Network (ANN) in feedforward (FF) and Deep Neural Network (DNN) forms using raw speech, extracted features and frequency domain forms are considered. The Multi Layer Perceptron (MLP) is configured with inputs in several forms to learn class information obtained using clustering and manual labeling. DNNs are also used to extract specific sentence types. Initially, from a large storage, relevant samples are selected and assimilated. Next, a few conventional methods are used for feature extraction of a few selected types. The features comprise of both spectral and prosodic types. These are applied to Recurrent Neural Network (RNN) and Fully Focused Time Delay Neural Network (FFTDNN) structures to evaluate their performance in recognizing mood, dialect, speaker and gender variations in dialectal Assamese speech. The system is tested under several background noise conditions by considering the recognition rates (obtained using confusion matrices and manually) and computation time. It is found that the proposed ML based sentence extraction techniques and the composite feature set used with RNN as classifier outperform all other approaches. By using ANN in FF form as feature extractor, the performance of the system is evaluated and a comparison is made. Experimental results show that the application of big data samples has enhanced the learning of the ASR system. Further, the ANN based sample and feature extraction techniques are found to be efficient enough to enable application of ML techniques in big data aspects as part of ASR systems.

[1]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[2]  Sadaaki Miyamoto,et al.  Algorithms for Fuzzy Clustering - Methods in c-Means Clustering with Applications , 2008, Studies in Fuzziness and Soft Computing.

[3]  Kandarpa Kumar Sarma,et al.  A Class of Neuro-computational Models to Verify Mood Variation in Dialectal Assamese Speech , 2014, 2014 2nd International Symposium on Computational and Business Intelligence.

[4]  Anton Gunzinger,et al.  Fast neural net simulation with a DSP processor array , 1995, IEEE Trans. Neural Networks.

[5]  Bayya Yegnanarayana,et al.  Extraction and representation of prosodic features for language and speaker recognition , 2008, Speech Commun..

[6]  Kandarpa Kumar Sarma,et al.  Phoneme-Based Speech Segmentation using Hybrid Soft Computing Framework , 2014 .

[7]  Jong-Suk Ruth Lee,et al.  Study on Big Data Center Traffic Management Based on the Separation of Large-Scale Data Stream , 2013, 2013 Seventh International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing.

[8]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech using global and local prosodic features , 2013, Int. J. Speech Technol..

[9]  K SREENIVASA RAO,et al.  Role of neural network models for developing speech systems , 2011 .

[10]  John J. Ohala,et al.  Prosody as a distinctive feature for the discrimination of arabic dialects , 1999, EUROSPEECH.

[11]  Kandarpa Kumar Sarma,et al.  Composite feature set for mood recognition in dialectal Assamese speech , 2015, 2015 2nd International Conference on Signal Processing and Integrated Networks (SPIN).

[12]  Ronald W. Schafer,et al.  Introduction to Digital Speech Processing , 2007, Found. Trends Signal Process..

[13]  Kandarpa Kumar Sarma,et al.  Bi-lingual Handwritten Character and Numeral Recognition using Multi-Dimensional Recurrent Neural Ne , 2009 .

[14]  Sin-Horng Chen,et al.  RNN-based prosodic modeling for mandarin speech and its application to speech-to-text conversion , 2002, Speech Commun..

[15]  Paavo Alku,et al.  Voice source modelling using deep neural networks for statistical parametric speech synthesis , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[16]  Kandarpa Kumar Sarma,et al.  An ANN based approach to recognize initial phonemes of spoken words of Assamese language , 2013, Appl. Soft Comput..

[17]  Nikola B. Serbedzija Simulating Artificial Neural Networks on Parallel Architectures , 1996, Computer.

[18]  Naleih M. Botros,et al.  Hardware implementation of an artificial neural network , 1993, IEEE International Conference on Neural Networks.

[19]  Lukás Burget,et al.  Investigations into prosodic syllable contour features for speaker recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.