Speech & Speaker recognition for Romanian Language

The present paper illustrates the main methods that can be employed to build a speech and speaker recognition system for Romanian language. To this aim, we start by presenting the classical approach of extracting the Mell Frequency Cepstral Coefficients features from a dataset of speech signals (which represents some words/phrases in Romanian language). The recognition is done either by using Dynamic Time Warping (DTW) or by training an Convolutional Neural Network. A comparison between these models is presented and commented. Once such a system is developed, we proceed further by implementing an application that listens and executes some predefined commands. In our setup, the system performs two main tasks: it recognizes the user by his voice and executes a task corresponding to the vocal command. Source code can be downloaded at: click to download Author

[1]  Carol Y. Espy-Wilson,et al.  A new set of features for text-independent speaker identification , 2006, INTERSPEECH.

[2]  Andi Buzo,et al.  SPONTANEOUS SPEECH RECOGNITION FOR ROMANIAN IN SPOKEN DIALOGUE SYSTEMS , 2010 .

[3]  Philip Chan,et al.  Toward accurate dynamic time warping in linear time and space , 2007, Intell. Data Anal..

[4]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[5]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[6]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[7]  Tomi Kinnunen,et al.  Spectral Features for Automatic Text-Independent Speaker Recognition , 2003 .

[8]  Amir Abolfazl Suratgar,et al.  Speech Recognition from PSD using Neural Network , 2009 .

[9]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[10]  Tara N. Sainath,et al.  Joint training of convolutional and non-convolutional neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Zhuowen Tu,et al.  Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree , 2015, AISTATS.

[12]  古井 貞煕,et al.  Digital speech processing, synthesis, and recognition , 1989 .

[13]  Kuldip K. Paliwal,et al.  Robust parameters for speech recognition based on subband spectral centroid histograms , 2001, INTERSPEECH.

[14]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.