On the Usefulness of the Speech Phase Spectrum for Pitch Extraction

© 2018 International Speech Communication Association. All rights reserved. Most frequency domain techniques for pitch extraction such as cepstrum, harmonic product spectrum (HPS) and summation residual harmonics (SRH) operate on the magnitude spectrum and turn it into a function in which the fundamental frequency emerges as argmax. In this paper, we investigate the extension of these three techniques to the phase and group delay (GD) domains. Our extensions exploit the observation that the bin at which F(magnitude) becomes maximum, for some monotonically increasing function F, is equivalent to bin at which F(phase) has maximum negative slope and F(groupdelay) has the maximum value. To extract the pitch track from speech phase spectrum, these techniques were coupled with the source-filter model in the phase domain that we proposed in earlier publications and a novel voicing detection algorithm proposed here. The accuracy and robustness of the phase-based pitch extraction techniques are illustrated and compared with their magnitude-based counterparts using six pitch evaluation metrics. On average, it is observed that the phase spectrum can be successfully employed in pitch tracking with comparable accuracy and robustness to the speech magnitude spectrum.

[1]  Erfan Loweimi,et al.  Robust phase-based speech signal processing from source-filter separation to model-based robust ASR , 2018 .

[2]  Takao Kobayashi,et al.  Spectral analysis using generalized cepstrum , 1984 .

[3]  Pejman Mowlaee Begzade Mahale,et al.  Phase estimation in single-channel speech enhancement using phase invariance constraints , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Alan V. Oppenheim,et al.  Discrete-time signal processing (2nd ed.) , 1999 .

[5]  Yannis Stylianou,et al.  Advances in phase-aware signal processing in speech communication , 2016, Speech Commun..

[6]  Erfan Loweimi,et al.  A new group delay-based feature for robust speech recognition , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[7]  Abeer Alwan,et al.  Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics , 2019, INTERSPEECH.

[8]  Jon Barker,et al.  Source-filter separation of speech signal in the phase domain , 2015, INTERSPEECH.

[9]  Thierry Dutoit,et al.  Chirp group delay analysis of speech signals , 2007, Speech Commun..

[10]  Rajesh M. Hegde,et al.  Significance of the Modified Group Delay Feature in Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  G. Fant Acoustic theory of speech production : with calculations based on X-ray studies of Russian articulations , 1961 .

[12]  Thomas Drugman,et al.  A new phase-based feature representation for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Thierry Dutoit,et al.  Causal-anticausal decomposition of speech using complex cepstrum for glottal source estimation , 2011, Speech Commun..

[14]  Sree Hari Krishnan Parthasarathi,et al.  Robustness of phase based features for speaker recognition , 2009, INTERSPEECH.

[15]  Paul C. Bagshaw,et al.  Enhanced pitch tracking and the processing of F0 contours for computer aided intonation teaching , 1993, EUROSPEECH.

[16]  Pejman Mowlaee,et al.  Single Channel Phase-Aware Signal Processing in Speech Communication: Theory and Practice , 2016 .

[17]  Jon Barker,et al.  Exploring the Use of Group Delay for Generalised VTS Based Noise Compensation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Hamid Sheikhzadeh,et al.  Phase-Only Speech Reconstruction Using Very Short Frames , 2011, INTERSPEECH.

[19]  Daniel P. W. Ellis,et al.  Noise Robust Pitch Tracking by Subband Autocorrelation Classification , 2012, INTERSPEECH.

[20]  Hema A. Murthy,et al.  Two-pitch tracking in co-channel speech using modified group delay functions , 2017, Speech Commun..

[21]  Jon Barker,et al.  Statistical normalisation of phase-based feature representation for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Abeer Alwan,et al.  Reducing F0 Frame Error of F0 tracking algorithms under noisy conditions with an unvoiced/voiced classification frontend , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Karthika Vijayan,et al.  Significance of analytic phase of speech signals in speaker verification , 2016, Speech Commun..

[24]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[25]  Timo Gerkmann,et al.  An evaluation of the perceptual quality of phase-aware single-channel speech enhancement. , 2016, The Journal of the Acoustical Society of America.

[26]  Jon Barker,et al.  Channel Compensation in the Generalised Vector Taylor Series Approach to Robust ASR , 2017, INTERSPEECH.

[27]  Erfan Loweimi,et al.  On the importance of phase and magnitude spectra in speech enhancement , 2011, 2011 19th Iranian Conference on Electrical Engineering.

[28]  Kuldip K. Paliwal,et al.  The importance of phase in speech enhancement , 2011, Speech Commun..

[29]  Fabrice Plante,et al.  A pitch extraction reference database , 1995, EUROSPEECH.

[30]  Hema A. Murthy,et al.  Group delay based melody monopitch extraction from music , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[32]  Kuldip K. Paliwal,et al.  Exploiting Conjugate Symmetry of the Short-Time Fourier Spectrum for Speech Enhancement , 2008, IEEE Signal Processing Letters.

[33]  Erfan Loweimi,et al.  Objective evaluation of phase and magnitude only reconstructed speech: New considerations , 2010, 10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010).

[34]  M. J. Cheng,et al.  Comparative performance study of several pitch detection algorithms , 1975 .

[35]  Thomas Drugman,et al.  On the Importance of Pre-emphasis and Window Shape in Phase-Based Speech Recognition , 2013, NOLISP.

[36]  A. Noll Short‐Time Spectrum and “Cepstrum” Techniques for Vocal‐Pitch Detection , 1964 .