A spectral envelope estimation method based on F0-adaptive multi-frame integration analysis

This paper presents a novel method of spectral envelope estimation and representation. Despite much sophisticated work in this area, estimating an appropriate envelope is still difficult. We therefore propose an F0-adaptive multi-frame integration analysis method for estimating spectral envelopes with appropriate shape and high temporal resolution. The method does not use pitch marks or phoneme labels and can be used with various types of sound (speech, singing, and instruments). The basic idea is to use F0-adaptive window analysis with a small window length yielding high temporal resolution. The analysis is then extended by using neighboring frames to obtain a stable spectral envelope. In tests using synthesized sound and resynthesized natural sound samples, for 8 of 14 samples the log-spectral distances obtained with the proposed method were smaller than those obtained with well-known previous methods.

[1]  L. H. Anauer,et al.  Speech Analysis and Synthesis by Linear Prediction of the Speech Wave , 2000 .

[2]  Yannis Stylianou,et al.  Iterative Estimation of Sinusoidal Signal Parameters , 2010, IEEE Signal Processing Letters.

[3]  Keiichi Tokuda,et al.  Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory HMM , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Hirokazu Kameoka,et al.  Auxiliary function approach to parameter estimation of constrained sinusoidal model for monaural speech separation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Mototsugu Abe,et al.  Design Criteria for Simple Sinusoidal Parameter Estimation Based on Quadratic Interpolation of FFT Magnitude Peaks , 2004 .

[6]  Hiromasa Fujihara,et al.  A novel framework for recognizing phonemes of singing voice in polyphonic music , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[7]  Shigeru Katagiri,et al.  Bayesian modelling of the speech spectrum using mixture of Gaussians , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Jordi Bonada WIDE-BAND HARMONIC SINUSOIDAL MODELING , 2008 .

[9]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[10]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[11]  Yannis Stylianou,et al.  Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification , 1996 .

[12]  J. L. Flanagan,et al.  PHASE VOCODER , 2008 .

[13]  Mark J. T. Smith,et al.  Analysis-by-Synthesis/Overlap-Add Sinusoidal Modeling Applied to the Analysis and Synthesis of Musical Tones , 1992 .

[14]  Xavier Serra,et al.  Digital Audio Effects , 2011 .

[15]  Hirokazu Kameoka,et al.  Speech Spectrum Modeling for Joint Estimation of Spectral Envelope and Fundamental Frequency , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Axel Röbel,et al.  Improving Lpc Spectral Envelope Extraction Of Voiced Speech By True-Envelope Estimation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[17]  Masataka Goto,et al.  RWC Music Database: Music genre database and musical instrument sound database , 2003, ISMIR.

[18]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[19]  Axel Röbel,et al.  Extending efficient spectral envelope modeling to Mel-frequency based representation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Julius O. Smith,et al.  Spectral modeling synthesis: A sound analysis/synthesis based on a deterministic plus stochastic decomposition , 1990 .

[21]  Daniel W. Griffin,et al.  Multi-band excitation vocoder , 1987 .

[22]  P. Depalle,et al.  Extraction of spectral peak parameters using a short-time Fourier transform modeling and no sidelobe windows , 1997, Proceedings of 1997 Workshop on Applications of Signal Processing to Audio and Acoustics.

[23]  Takehiko Kagoshima,et al.  Analytic generation of synthesis units by closed loop training for totally speaker driven text to speech system (TOS drive TTS) , 1998, ICSLP.

[24]  Julius O. Smith,et al.  PARSHL: An Analysis/Synthesis Program for Non-Harmonic Sounds Based on a Sinusoidal Representation , 1987, ICMC.

[25]  Satoshi Nakamura,et al.  Efficient representation of short-time phase based on group delay , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[26]  Dennis H. Klatt,et al.  Software for a cascade/parallel formant synthesizer , 1980 .

[27]  Masashi Ito,et al.  Sinusoidal modeling for nonstationary voiced speech based on a local vector transform. , 2007, The Journal of the Acoustical Society of America.

[28]  X. Rodet EFFICIENT SPECTRAL ENVELOPE ESTIMATION AND ITS APPLICATION TO PITCH SHIFTING AND ENVELOPE PRESERVATION , 2005 .

[29]  Hideki Kawahara,et al.  Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Simon King,et al.  Estimating the spectral envelope of voiced speech using multi-frame analysis , 2003, INTERSPEECH.

[31]  Alexander A. Petrovsky,et al.  Robust HNR-Based Closed-Loop Pitch and Harmonic Parameters Estimation , 2011, INTERSPEECH.

[32]  Jae S. Lim,et al.  Multiband excitation vocoder , 1988, IEEE Transactions on Acoustics, Speech, and Signal Processing.

[33]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.