F0 estimation using empirical mode decomposition and complex cepstrum analysis in reverberant environments

Fundamental frequency F0 is an important cue in speech signal processing, but the performance of F0 estimation by using current state-of-the-art methods drastically degrades due to reverberation effects. Therefore, we propose a F0 estimation method that is robust in reverberant environments by using complex cepstrum analysis and empirical mode decomposition (EMD). Speech dereverberation of our method has two parts: one deals with the amplitude cepstrum, and the other the phase cepstrum. In the first part, EMD is used to decompose the averaged amplitude cepstrum of reverberant speech into two groups of intrinsic mode functions (IMFs). The first group is associated with the amplitude cepstrum of the clean speech signals. The second group, which is used to enhance the reverberant speech signals, is associated with the room impulse response (RIR). In the second part, the all-pass phase cepstrum of the target reverberant speech is modified by a particular value related to the reverberation time. F0 is then estimated from the enhanced speech signals. The results showed that the proposed method could estimate F0 more correctly than other methods such as SWIPE and YIN.

[1]  T. Houtgast,et al.  The Modulation Transfer Function in Room Acoustics as a Predictor of Speech Intelligibility , 1973 .

[2]  M. Sondhi,et al.  New methods of pitch extraction , 1968 .

[3]  Shigeru Ando,et al.  An Optimal Comb Filter for Time-Varying Harmonics Extraction(Special Section on Digital Signal Processing) , 1998 .

[4]  John G Harris,et al.  A sawtooth waveform inspired pitch estimator for speech and music. , 2008, The Journal of the Acoustical Society of America.

[5]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[6]  Masashi Unoki,et al.  Study on Method for Estimating F0 of Steady Complex Tone in Noisy Reverberant Environments , 2013, 2013 Ninth International Conference on Intelligent Information Hiding and Multimedia Signal Processing.

[7]  D. J. Hermes,et al.  Measurement of pitch by subharmonic summation. , 1988, The Journal of the Acoustical Society of America.

[8]  Danilo P. Mandic,et al.  Empirical Mode Decomposition-Based Time-Frequency Analysis of Multivariate Signals: The Power of Adaptive Data Analysis , 2013, IEEE Signal Processing Magazine.

[9]  Peter Kabal,et al.  Reverberant speech enhancement using cepstral processing , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[10]  A. Noll Cepstrum pitch determination. , 1967, The Journal of the Acoustical Society of America.

[11]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[12]  Rüdiger Hoffmann,et al.  Voice Activity Detection in MTF-Based Power Envelope Restoration , 2011, INTERSPEECH.

[13]  M. Ross,et al.  Average magnitude difference function pitch extractor , 1974 .

[14]  Masashi Unoki,et al.  Speech Analysis Method Based on Source-Filter Model Using Multivariate Empirical Mode Decomposition , 2016, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[15]  Masashi Unoki,et al.  Comparative evaluations of robust and accurate F0 estimates in reverberant environments , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Masashi Unoki,et al.  A speech dereverberation method based on the MTF concept using adaptive time-frequency divisions , 2003, 2004 12th European Signal Processing Conference.

[17]  Chai Wutiwiwatchai,et al.  LOTUS-SOC: A social media speech corpus for Thai LVCSR in noisy environments , 2016, 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA).

[18]  Hajime Kobayashi,et al.  Weighted autocorrelation for pitch extraction of noisy speech , 2001, IEEE Trans. Speech Audio Process..

[19]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[20]  Alan V. Oppenheim,et al.  Discrete-time Signal Processing. Vol.2 , 2001 .

[21]  A M Noll,et al.  Clipstrum pitch determination. , 1968, Journal of the Acoustical Society of America.

[22]  Roy D. Patterson,et al.  Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity , 1999, EUROSPEECH.