Improving Mandarin Tone Recognition Based on DNN by Combining Acoustic and Articulatory Features Using Extended Recognition Networks

The tone is a distinctive feature in Mandarin Chinese. Tone recognition is useful in distinguishing ambiguous words in Chinese Mandarin speech recognition. Most traditional studies focused on prosodic features (e.g., F0, duration and energy) to improve the performance of tone recognition. In this paper, we propose a novel framework to integrate articulatory features (AFs) and MFCC into a DNN-HMM based tone recognition system. The procedure was implemented as the following steps: 1) estimating posterior probabilities of different AFs using a DNN classifier; 2) combining the estimated posterior probabilities with MFCC and F0 as input features; 3) realizing tone recognition using DNN-HMM. F0 and Energy were used as the input features of the baseline system based on DNN-HMM. Three contrast experiments were conducted according to different features under the framework of DNN: MFCC only, MFCC+F0 and MFCC+F0+AFs. The experimental results showed that the system with MFCC+F0 features outperformed the baseline system, with a 46.8% relative error reduction. After incorporating the AFs, a further relative reduction of about 10.1% in the tone error rate was achieved, which showed the efficiency of the proposed method.

[1]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[2]  Keikichi Hirose,et al.  Anchoring hypothesis and its application to tone recognition of Chinese continuous speech , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[3]  Yi Xu,et al.  Effects of tone and focus on the formation and alignment of f0contours , 1999 .

[4]  N. Umeda “F0 declination” is situation dependent , 1982 .

[5]  Wenju Liu,et al.  Improved tone modeling by exploiting articulatory features for mandarin speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Hsiao-Chuan Wang,et al.  Hidden Markov model for Mandarin lexical tone recognition , 1988, IEEE Trans. Acoust. Speech Signal Process..

[7]  Gang Peng,et al.  An Innovative Prosody Modeling Method for Chinese Speech Recognition , 2004, Int. J. Speech Technol..

[8]  Keikichi Hirose,et al.  Tone nucleus modeling for Chinese lexical tone recognition , 2004, Speech Commun..

[9]  Jean-Marie Humbert,et al.  Consonant Types, Vowel Quality, and Tone , 1978 .

[10]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[11]  Mark Liberman,et al.  Mandarin tone classification without pitch tracking , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[13]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[14]  Sin-Horng Chen,et al.  Tone recognition of continuous Mandarin speech based on neural networks , 1995, IEEE Trans. Speech Audio Process..

[15]  G. E. Peterson,et al.  Some Basic Considerations in the Analysis of Intonation , 1960 .

[16]  Chiu-yu Tseng,et al.  Improved tone concatenation rules in a formant-based Chinese text-to-speech system , 1993, IEEE Trans. Speech Audio Process..

[17]  Gernot A. Fink,et al.  Combining acoustic and articulatory feature information for robust speech recognition , 2002, Speech Commun..

[18]  Zhang Jialu THE INTRINSIC FUNDAMENTAL FREQUENCY OF VOWELS AND THE EFFECT OF SPEECH MODES ON FORMANTS , 1989 .

[19]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.