The Speech Recognition of Double-Syllable Chinese Words Based on the Hilbert Spectrum

Here a Chinese lexical recognition task is studied by a small vocabulary including 40 double-syllable Chinese words. In the approach presented, the Hilbert-Huang Transform (HHT) which consists of two steps is applied to speech signal analyzing. First, the speech signals are decomposed into a set of intrinsic mode functions (IMFs) by using the empirical mode decomposition (EMD) technique. Second, the first two IMFs are retained for further Hilbert spectral analysis. Final presentation of the speech signal is an energy-frequency-time distribution designated as the Hilbert spectrum, which can be used to depict the characteristics of speech sounds. For feature extraction, the Hilbert spectrum of each speech signal is divided into a set of frequency sub-bands. The number of discrete points on the Hilbert spectrum each sub-band contained is calculated as an element of the feature vector. Feature vectors obtained are fed to Support Vector Machine (SVM) classifier for classification. The proposed method is evaluated using 3840 speech samples from 8 different speakers (4 male). The experimental result, overall recognition rate of the 40 words achieving around 97% demonstrates the effectiveness of this approach.

[1]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[2]  Prasanna Kumar Sahu,et al.  Speech Recognition using ERB-like Admissible Wavelet Packet Decomposition based on Perceptual sub-band Weighting , 2016 .

[3]  Zhuo-Fu Liu,et al.  Speech enhancement based on Hilbert-Huang transform , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[4]  Vikas Joshi,et al.  Sub-band based histogram equalization in cepstral domain for speech recognition , 2015, Speech Commun..

[5]  Laurent Besacier,et al.  Automatic Speech Recognition for African Languages with Vowel Length Contrast , 2016, SLTU.

[6]  Gabriel Rilling,et al.  On empirical mode decomposition and its algorithms , 2003 .

[7]  S. Blumstein,et al.  Perceptual invariance and onset spectra for stop consonants in different vowel environments. , 1980, The Journal of the Acoustical Society of America.

[8]  N. Huang,et al.  The Mechanism for Frequency Downshift in Nonlinear Wave Evolution , 1996 .

[9]  Jean-Christophe Cexus,et al.  Denoising via empirical mode decomposition , 2006 .

[10]  A. Boudraa,et al.  A new EMD denoising approach dedicated to voiced speech signals , 2008, 2008 2nd International Conference on Signals, Circuits and Systems.

[11]  Yousef Al-Assaf,et al.  The Application of Wavelets Transforms and Neural Networks to Speech Classification , 2003, Intell. Autom. Soft Comput..

[12]  Mahpara Hyder Chowdhury Speech based gender identification using empirical mode decomposition (EMD) , 2014 .

[13]  Fathi E. Abd El-Samie,et al.  Hybrid speech enhancement with empirical mode decomposition and spectral subtraction for efficient speaker identification , 2015, Int. J. Speech Technol..

[14]  Leonardo Zao,et al.  Speech Enhancement with EMD and Hurst-Based Mode Selection , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Sid-Ahmed Selouani,et al.  Incorporating Phonetic Knowledge Into an Evolutionary Subspace Approach for Robust Speech Recognition , 2007 .

[16]  Keikichi Hirose,et al.  Voiced/non-voiced speech classification using adaptive thresholding with bivariate EMD , 2016, Pattern Analysis and Applications.

[17]  Jeung-Yoon Choi,et al.  Analysis of acoustic parameters for consonant voicing classification in clean and telephone speech. , 2012, The Journal of the Acoustical Society of America.