Robust cepstral based pitch determination

Visual Information Technologies 7Afin l n m r T)r Erik Jonsson School of Engineering and Computer Science The University of Texas at Dallas >_"" -."" -I. Plano, Texas 75075 Richardson,-Texas 75083-0688 (214) 985-2267 (214) 690-2894 degroat@utdallas.edu Joseph Picone Speech and Image Understanding Laboratory Texas Insmments Inc. P.O. Box 655474 MS 238 Dallas, Texas 75265 (214) 995-6627 The FIT based cepstral method of human speech pitch (or fundamental frequency) determination is known to be accurate and reliable in studio quality environments, however, it leaves much to be desired at lower signal to noise ratios. Cepstral pitch determination techniques, which are a special case of the more general t h q of homomorphic signal processing. rely on the log operation to deconvolve the pitch sequence from the vocal uact response sequence. Classical cepstral processing modcls do not account for noise added to the signal. In this paper, we develop a noisy cepsual signal model for speech processing and we propose two Singular Value Decomposition (SVD) based approaches which greatly enhance cepstral based pitch estimation performance in noisy environments. Speech Production and Cepstral Pitch Determination Voiced speech pmduction can be modeled reasonably well as a pseudo pulse uain (pitch sequence) convolved with a linear system (vocal tract impulse response). Speech is considered wide sense stationary over short time segments (20 40 msec) [I] which makes analysis possible over short time windows (M frames). We assume that the r-domain description of the speech signal is modeled by [21, [31 S (2) = H ( 2 ) P ( 2 ) (1) where H ( z ) is the 2-transform of the vocal tract response sequence and P ( z ) is the r-transform of the pitch sequence. Analytical expressions for H ( z ) and P(r ) may be found in [2] or 131. We may use homomorphic filtering techniques to separate the multiplicative rclationship in ( I ) using the complex log operation thereby causing the pitch cepsmm and the vocal tract response cepsmm to occupy approximately disjoint quefrency spaces [2], [4]. Practical implementations of cepstral pitch determination.may be obtained from 141 in which it is shown that the Inverse FFT of the log of the magnitude of the FIT provides us with the real version of the quefrency. The connections between the complex cepsmm and the real cepsmm (usually denoted by just cepsmm) an shown in [21 and [31. The Noise Fmblem It is easy to see that homomorphic filtering (cepstral) techniques will not offer good performance in noise. Returning to ( I ) and taking the complex log operation, we find that log [S ( z ) ] = log [H (2) P (r)] = log [H (211 +log [P (.)I. (2) The separation of S(z) into its constituent parts works out very neatly assuming that no noise is added to the system. On the other band, if noise is added to the system, we obtain log [S (z ) + N ( z ) ] = log[H ( z ) P ( z ) + N (z)] . (3) A Cepstral Model for Speech Signals in Noise Manipulating (3) yields a noisy cepsual signal model . . which clearly exposes the desired signal component in the fist term of the right-hand side. We shall find great utility in going to vector and matrix notation at this point following a discretization of equation (4). ?he appropriate discrete Fourier uansform (DIT') equivalent of (4) is 1% [ H ( k ) P ( k ) + N (k)l = 1% [H (k) P ( k ) l where k = 0, ..., M 1 is the discrete normalized frequency variahle. We shall also stay consistent with the notation found in [21 andJ31 for representing the log of a general function, X ( k ) , as X (k). Thus, we represent (5 ) in vector form as P = Z+ log [l + D-'n] (6) 744 23ACSSC-12/89/0744 $1.00 Q 1989 MAPLE PRESS

[1]  Joseph Picone,et al.  Spectrum estimation using an analytic signal representation , 1988 .

[2]  B. P. Bogert,et al.  The quefrency analysis of time series for echoes : cepstrum, pseudo-autocovariance, cross-cepstrum and saphe cracking , 1963 .

[3]  A. Oppenheim,et al.  Homomorphic analysis of speech , 1968 .

[4]  C. Demeure,et al.  The high-resolution spectrum estimator—A subjective entity , 1984, Proceedings of the IEEE.

[5]  Werner Verhelst,et al.  A new model for the short-time complex cepstrum of voiced speech , 1986, IEEE Trans. Acoust. Speech Signal Process..

[6]  Thomas W. Parsons,et al.  Voice and Speech Processing , 1986 .

[7]  A. Noll Cepstrum pitch determination. , 1967, The Journal of the Acoustical Society of America.

[8]  George R. Doddington,et al.  Robust pitch detection in a noisy telephone environment , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.