Cross-spectral methods for processing speech.

We present time-frequency methods which are well suited to the analysis of nonstationary multicomponent FM signals, such as speech. These methods are based on group delay, instantaneous frequency, and higher-order phase derivative surfaces computed from the short time Fourier transform (STFT). Unlike more conventional approaches, these methods do not assume a locally stationary approximation of the signal model. We describe the computation of the phase derivatives, the physical interpretation of these derivatives, and a re-mapping algorithm based on these phase derivatives. We show analytically, and by example, the convergence of the re-mapping to the FM representation of the signal. The methods are applied to speech to estimate signal parameters, such as the group delay of a transmission channel and speech formant frequencies. Our goal is to develop a unified method which can accurately estimate speech components in both time and frequency and to apply these methods to the estimation of instantaneous formant frequencies, effective excitation time, vocal tract group delay, and channel group delay. The proposed method has several interesting properties, the most important of which is the ability to simultaneously resolve all FM components of a multicomponent signal, as long as the STFT of the composite signal satisfies a simple separability condition. The method can provide super-resolution in both time and frequency in the sense that it can simultaneously provide time and frequency estimates of FM components, which have much better accuracy than the Heisenberg uncertainty of the STFT. Super-resolution provides the capability to accurately "re-map" each component of the STFT surface to the time and frequency of the FM signal component it represents. To attain high resolution and accuracy, the signal must be jointly estimated simultaneously in time and frequency. This is accomplished by estimating two surfaces, which are essentially the derivatives of the STFT phase with respect to time and frequency. To avoid phase ambiguities, the differentiation is performed as a cross-spectral product.

[1]  Ralph O. Schmidt,et al.  New Mathematical Tools in Direction Finding and Spectral Analysis , 1983, Optics & Photonics.

[2]  Bayya Yegnanarayana,et al.  Robustness of group-delay-based method for extraction of significant instants of excitation from speech signals , 1999, IEEE Trans. Speech Audio Process..

[3]  Hermann Ney,et al.  Formant estimation for speech recognition , 1998, IEEE Trans. Speech Audio Process..

[4]  B. Atal,et al.  Speech analysis and synthesis by linear prediction of the speech wave. , 1971, The Journal of the Acoustical Society of America.

[5]  Steven Kay Statistically/computationally efficient frequency estimation , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[6]  Ralph Otto Schmidt,et al.  A signal subspace approach to multiple emitter location and spectral estimation , 1981 .

[7]  J. F. Kaiser The design of digital filters , 1967 .

[8]  Douglas Nelson,et al.  Special purpose correlation functions for improved signal detection and parameter estimation , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Douglas J. Nelson,et al.  Cross-spectral methods with an application to speech processing , 1999, Optics & Photonics.

[10]  John Bowman Thomas,et al.  An introduction to statistical communication theory , 1969 .

[11]  Douglas J. Nelson,et al.  Pitch-based methods for speech detection and automatic frequency recovery , 1995, Optics & Photonics.

[12]  S. Umesh,et al.  Computationally efficient estimation of sinusoidal frequency at low SNR , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[13]  D. Friedman Formulation of a vector distance measure for the instantaneous-frequency distribution (IFD) of speech , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Douglas A. Reynolds,et al.  Modeling of the glottal flow derivative waveform with application to speaker identification , 1999, IEEE Trans. Speech Audio Process..

[15]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[16]  Thomas F. Quatieri,et al.  Short-time Fourier transform , 1987 .

[17]  L. Cohen,et al.  Time-frequency distributions-a review , 1989, Proc. IEEE.

[18]  B. Yegnanarayana Formant extraction from linear‐prediction phase spectra , 1978 .

[19]  L. Rayleigh,et al.  XII. On our perception of sound direction , 1907 .

[20]  Takao Kobayashi,et al.  Robust pitch estimation with harmonics enhancement in noisy environments based on instantaneous frequency , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[21]  Petros Maragos,et al.  On amplitude and frequency demodulation using energy operators , 1993, IEEE Trans. Signal Process..

[22]  B. Yegnanarayana,et al.  Formant extraction from phase using weighted group delay function , 1989 .