Pitch prediction from MFCC vectors for speech reconstruction

The paper proposes a technique for reconstructing an acoustic speech signal solely from a stream of Mel-frequency cepstral coefficients (MFCCs). Previous speech reconstruction methods have required an additional pitch element, but this work proposes two maximum a posteriori (MAP) methods for predicting pitch from the MFCC vectors themselves. The first method is based on a Gaussian mixture model (GMM) while the second scheme utilises the temporal correlation available from a hidden Markov model (HMM) framework. A formal measurement of both frame classification accuracy and RMS pitch error shows that an HMM-based scheme with 5 clusters per state is able to classify correctly over 94% of frames and has an RMS pitch error of 3.1 Hz in comparison to a reference pitch. Informal listening tests and analysis of spectrograms reveals that speech reconstructed solely from the MFCC vectors is almost indistinguishable from that using the reference pitch.

[1]  Yannis Stylianou,et al.  Stochastic modeling of spectral adjustment for high quality pitch modification , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[2]  Meir Tzur,et al.  Efficient periodicity extraction based on sine-wave representation and its application to pitch determination of speech signals , 2001, INTERSPEECH.

[3]  Taoufik En-Najjary,et al.  A new method for pitch prediction from spectral envelope and its application in voice conversion , 2003, INTERSPEECH.

[4]  Shigeki Sagayama,et al.  Multiple-regression hidden Markov model , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5]  Harald Singer,et al.  Pitch dependent phone modelling for HMM based speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Richard M. Stern,et al.  Reconstruction of incomplete spectrograms for robust speech recognition , 2000 .