Pitch prediction from Mel-frequency cepstral coefficients using sparse spectrum recovery

This work proposes a technique for predicting the pitch from Mel-frequency cepstral coefficients (MFCC) vectors. Previous pitch prediction methods are based on the statistical models such as Gaussian mixture models and hidden Markov models. In this paper, we propose a three-step method to estimate pitch from MFCC vectors. First the Mel-filterbank energies (MFBEs) are estimated from MFCC vectors. Secondly, we propose a novel method to estimate the spectrum from MFBE that exploits the sparse nature of the voiced speech spectrum. Finally, the pitch is estimated from the recovered spectrum. We also explore the effect of different levels of truncation of the discrete cosine transformation (DCT) coefficients in MFCC computation on the pitch prediction error. We use the deep neutral network (DNN) based predictor as baseline to predict the pitch from MFCC vectors. The experiments using CMU-ARCTIC and KEELE database show that the proposed three-step method generalizes better across databases and genders resulting in a drop of ∼8Hz and ∼5Hz in average RMSE of predicted pitch with respect to those from DNN when 13-dimensional and 26-dimensional MFCC vectors are used for pitch prediction respectively. We also find that the sparsity constraint performs better in recovering the spectrum at lower pitch values.

[1]  Ángel M. Gómez,et al.  Packet loss concealment based on VQ replicas and MMSE estimation applied to distributed speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[2]  Xu Shao,et al.  Speech reconstruction from mel-frequency cepstral coefficients using a source-filter model , 2002, INTERSPEECH.

[3]  Yannis Stylianou,et al.  Stochastic modeling of spectral adjustment for high quality pitch modification , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[4]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[5]  Xuejing Sun,et al.  Pitch determination and voice quality analysis using Subharmonic-to-Harmonic Ratio , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[7]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Meir Tzur,et al.  Speech reconstruction from mel frequency cepstral coefficients and pitch frequency , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9]  Geoffrey Zweig,et al.  Advances in all-neural speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Razvan Pascanu,et al.  Theano: Deep Learning on GPUs with Python , 2012 .

[11]  Geoffrey Zweig,et al.  The microsoft 2016 conversational speech recognition system , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[13]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[14]  Shigeki Sagayama,et al.  Multiple-regression hidden Markov model , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[15]  Xu Shao,et al.  Pitch prediction from MFCC vectors for speech reconstruction , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Shlomo Dubnov,et al.  Maximum a-posteriori probability pitch tracking in noisy environments using harmonic model , 2004, IEEE Transactions on Speech and Audio Processing.

[17]  Fabrice Plante,et al.  A pitch extraction reference database , 1995, EUROSPEECH.

[18]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[19]  Thomas F. Coleman,et al.  A Reflective Newton Method for Minimizing a Quadratic Function Subject to Bounds on Some of the Variables , 1992, SIAM J. Optim..

[20]  Taoufik En-Najjary,et al.  A new method for pitch prediction from spectral envelope and its application in voice conversion , 2003, INTERSPEECH.

[21]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[22]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[23]  Thomas Quatieri,et al.  Discrete-Time Speech Signal Processing: Principles and Practice , 2001 .

[24]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .