A Robust Pitch Extractor Based on DTW Lines and CASA with Application in Noisy Speech Recognition

This paper proposes a robust pitch extractor with application in Automatic Speech Recognition and based on selecting pitch lines of a tonegram (a representation of the different pitch energies at each frame time). First, the tonegram and its maximum energy regions are extracted and a Dynamic Time Warping algorithm finds the most energetic trajectories or pitch lines from these regions. A second stage estimates the tonegram of the most energetic lines by applying Computational Auditory Scene Analysis rules which reject and group octave-related lines. The mean pitch of the speaker is estimated and the final pitch is estimated by rejecting lines which are outside from the mean pitch. The proposed pitch extractor is evaluated in a novel way - by means of the word accuracy of a Missing Data recognizer on Aurora-2 database.

[1]  Jon Barker,et al.  A pitch based noise estimation technique for robust speech recognition with Missing Data , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Mike Brookes,et al.  A Pitch Estimation Filter robust to high levels of noise (PEFAC) , 2011, 2011 19th European Signal Processing Conference.

[3]  Daniel P. W. Ellis,et al.  Ground-truth transcriptions of real music from force-aligned MIDI syntheses , 2003, ISMIR.

[4]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[5]  Ning Ma,et al.  Exploiting correlogram structure for robust speech recognition with multiple speech sources , 2007, Speech Commun..

[6]  Antonio M. Peinado Speech Recognition Over Digital Channels: Robustness and Standards , 2006 .

[7]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[8]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[9]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[10]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[11]  Antonio M. Peinado,et al.  Feature Extraction Based on Pitch-Synchronous Averaging for Robust Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.