Speaker-Independent Japanese Isolated Speech Word Recognition Using TDRC Features

Automatic speech recognition (ASR) may be defined as the process of recognizing the spoken string of words from acoustic speech signals. This paper presents an approach for designing an ASR system based on isolated speech word recognition using two dimensional root cepstrum (TDRC) coefficients as features. These features are extracted from speech signals corresponding to a moderate sized vocabulary of 54 words, taken from Tohoku University - Matsushita Japanese Isolated Word Database (TMW). The extracted features are used to train, validate and test a bayesian optimized k-nearest neighbor (k-NN) classifier. The bayesian optimization is used to select optimal machine learning hyperparameters that minimizes the cross-validation loss. It also minimizes over two other hyperparameters namely, the nearest neighborhood size and the distance function. Alongside with TDRC features, two-dimensional cepstrum (TDC) features are also extracted from the speech signals of same database. The obtained results are presented and the performance of these two features are analyzed using statistical hypothesis tests. The statistical hypothesis testing show that TDRC features are better than TDC features. The statistical hypothesis testing is also performed to determine the optimum values of root parameter $(\gamma)$ used in TDRC feature extraction method for TMW speech corpus.

[1]  Lai-Wan Chan,et al.  Isolated word recognition using modular recurrent neural networks , 1998, Pattern Recognit..

[2]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[3]  James R. Glass,et al.  A 6 mW, 5,000-Word Real-Time Speech Recognizer Using WFST Models , 2015, IEEE Journal of Solid-State Circuits.

[4]  J. Lim Spectral root homomorphic deconvolution system , 1979, ICASSP.

[5]  Alan V. Oppenheim,et al.  Discrete-time Signal Processing. Vol.2 , 2001 .

[6]  E. Chilton,et al.  Two-dimensional root cepstrum as feature extraction method for speech recognition , 2003 .

[7]  Aurélien Géron,et al.  Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems , 2017 .

[8]  Hsiao-Chuan Wang,et al.  A study of the two-dimensional cepstrum approach for speech recognition , 1992 .

[9]  E. Chilton,et al.  Modified two-dimensional root cepstrum analysis , 2005 .

[10]  Jia Liu,et al.  Efficient embedded speech recognition for very large vocabulary Mandarin car-navigation systems , 2009, IEEE Transactions on Consumer Electronics.

[11]  Toshiyuki Sakai,et al.  Spoken-word recognition using dynamic features analysed by two-dimensional cepstrum , 1989 .

[12]  Xueying Zhang,et al.  Evaluation of a set of new ORF kernel functions of SVM for speech recognition , 2013, Eng. Appl. Artif. Intell..

[13]  Stephen A. Zahorian,et al.  Signal modeling for high-performance robust isolated word recognition , 2001, IEEE Trans. Speech Audio Process..