论文信息 - Speech & Speaker recognition for Romanian Language

Speech & Speaker recognition for Romanian Language

The present paper illustrates the main methods that can be employed to build a speech and speaker recognition system for Romanian language. To this aim, we start by presenting the classical approach of extracting the Mell Frequency Cepstral Coefficients features from a dataset of speech signals (which represents some words/phrases in Romanian language). The recognition is done either by using Dynamic Time Warping (DTW) or by training an Convolutional Neural Network. A comparison between these models is presented and commented. Once such a system is developed, we proceed further by implementing an application that listens and executes some predefined commands. In our setup, the system performs two main tasks: it recognizes the user by his voice and executes a task corresponding to the vocal command. Source code can be downloaded at: click to download Author

[1] Carol Y. Espy-Wilson,et al. A new set of features for text-independent speaker identification , 2006, INTERSPEECH.

[2] Andi Buzo,et al. SPONTANEOUS SPEECH RECOGNITION FOR ROMANIAN IN SPOKEN DIALOGUE SYSTEMS , 2010 .

[3] Philip Chan,et al. Toward accurate dynamic time warping in linear time and space , 2007, Intell. Data Anal..

[4] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[5] Biing-Hwang Juang,et al. Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[6] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[7] Tomi Kinnunen,et al. Spectral Features for Automatic Text-Independent Speaker Recognition , 2003 .

[8] Amir Abolfazl Suratgar,et al. Speech Recognition from PSD using Neural Network , 2009 .

[9] Robert M. Gray,et al. An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[10] Tara N. Sainath,et al. Joint training of convolutional and non-convolutional neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Zhuowen Tu,et al. Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree , 2015, AISTATS.

[12] 古井貞煕,et al. Digital speech processing, synthesis, and recognition , 1989 .

[13] Kuldip K. Paliwal,et al. Robust parameters for speech recognition based on subband spectral centroid histograms , 2001, INTERSPEECH.

[14] Tianqi Chen,et al. Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.