Statistical normalisation of phase-based feature representation for robust speech recognition

In earlier work we have proposed a source-filter decomposition of speech through phase-based processing. The decomposition leads to novel speech features that are extracted from the filter component of the phase spectrum. This paper analyses this spectrum and the proposed representation by evaluating statistical properties at various points along the parametrisation pipeline. We show that speech phase spectrum has a bell-shaped distribution which is in contrast to the uniform assumption that is usually made. It is demonstrated that the uniform density (which implies that the corresponding sequence is least-informative) is an artefact of the phase wrapping and not an original characteristic of this spectrum. In addition, we extend the idea of statistical normalisation usually applied for the magnitudebased features into the phase domain. Based on the statistical structure of the phase-based features, which is shown to be super-gaussian in the clean condition, three normalisation schemes, namely, Gaussianisation, Laplacianisation and table-based histogram equalisation have been applied for improving the robustness. Speech recognition experiments using Aurora-2 show that applying an optimal normalisation scheme at the right stage of the feature extraction process can produce average relative WER reductions of up to 18.6% across the 0–20 dB SNR conditions.

[1]  Erfan Loweimi,et al.  Objective evaluation of phase and magnitude only reconstructed speech: New considerations , 2010, 10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010).

[2]  D.P. Skinner,et al.  The cepstrum: A guide to processing , 1977, Proceedings of the IEEE.

[3]  Jonathan Le Roux,et al.  Phase Processing for Single-Channel Speech Enhancement: History and recent advances , 2015, IEEE Signal Processing Magazine.

[4]  Hermann Ney,et al.  Using phase spectrum information for improved speech recognition performance , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5]  Thomas Drugman,et al.  On the Importance of Pre-emphasis and Window Shape in Phase-Based Speech Recognition , 2013, NOLISP.

[6]  A.V. Oppenheim,et al.  The importance of phase in signals , 1980, Proceedings of the IEEE.

[7]  G. Fant Acoustic theory of speech production : with calculations based on X-ray studies of Russian articulations , 1961 .

[8]  Kuldip K. Paliwal,et al.  The importance of phase in speech enhancement , 2011, Speech Commun..

[9]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[10]  Hamid Sheikhzadeh,et al.  Phase-Only Speech Reconstruction Using Very Short Frames , 2011, INTERSPEECH.

[11]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[12]  Thomas Drugman,et al.  A new phase-based feature representation for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[14]  Pejman Mowlaee,et al.  Iterative Closed-Loop Phase-Aware Single-Channel Speech Enhancement , 2013, IEEE Signal Processing Letters.

[15]  Kuldip K. Paliwal,et al.  Short-time phase spectrum in speech processing: A review and some experimental results , 2007, Digit. Signal Process..

[16]  Sridha Sridharan,et al.  Feature warping for robust speaker verification , 2001, Odyssey.

[17]  Yannis Stylianou,et al.  Advances in phase-aware signal processing in speech communication , 2016, Speech Commun..

[18]  Erfan Loweimi,et al.  On the importance of phase and magnitude spectra in speech enhancement , 2011, 2011 19th Iranian Conference on Electrical Engineering.

[19]  Erfan Loweimi,et al.  A new group delay-based feature for robust speech recognition , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[20]  Thierry Dutoit,et al.  Chirp group delay analysis of speech signals , 2007, Speech Commun..

[21]  Rajesh M. Hegde,et al.  Significance of the Modified Group Delay Feature in Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  A. W. M. van den Enden,et al.  Discrete Time Signal Processing , 1989 .

[23]  Hema A. Murthy,et al.  Modified group delay feature based total variability space modelling for speaker recognition , 2015, Int. J. Speech Technol..

[24]  Yuhong Yang,et al.  Information Theory, Inference, and Learning Algorithms , 2005 .

[25]  Kuldip K. Paliwal,et al.  Product of power spectrum and group delay function for speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  Jon Barker,et al.  Source-filter separation of speech signal in the phase domain , 2015, INTERSPEECH.

[27]  Karthika Vijayan,et al.  Analysis of Phase Spectrum of Speech Signals Using Allpass Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Sree Hari Krishnan Parthasarathi,et al.  Robustness of phase based features for speaker recognition , 2009, INTERSPEECH.

[29]  José L. Pérez-Córdoba,et al.  Histogram equalization of speech representation for robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.