Auditory VOCODER: Speech resynthesis from an auditory Mellin representation

We assume that speech rnorphing, noise suppression, and speech segregation would improve if they were more accurately based on human perception. Accordingly, an Auditory VOCODER was developed to resynthesize speech from an auditory Mellin representation used to explain human perception. The Auditory VOCODER has three modules: an Auditory Mellin Image model [9,10], a STRAIGHT VOCODER [2], and a mapping module consisting of warped-frequency cepstral analysis and nonlinear, multivariate regression analysis (MRA). We describe the modules and an evaluation of the system. Informal listening indicates that the sound quality is reasonable.

[1]  Toshio Irino,et al.  An analysis/synthesis auditory filterbank based on an IIR implementation of the gammachirp , 1999 .

[2]  E. Lopez-Poveda,et al.  A computational algorithm for computing nonlinear auditory frequency selectivity. , 2001, The Journal of the Acoustical Society of America.

[3]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[4]  T. Ozaki,et al.  Modelling nonlinear random vibrations using an amplitude-dependent autoregressive time series model , 1981 .

[5]  Roy D. Patterson,et al.  Sound resynthesis from Auditory Mellin Image using STRAIGHT , 2001 .

[6]  R. Patterson,et al.  Time-domain modeling of peripheral auditory processing: a modular architecture and a software platform. , 1995, The Journal of the Acoustical Society of America.

[7]  Malcolm Slaney Pattern playback from 1950 to 1995 , 1995, 1995 IEEE International Conference on Systems, Man and Cybernetics. Intelligent Systems for the 21st Century.

[8]  R. Patterson,et al.  A pulse ribbon model of monaural phase perception. , 1987, The Journal of the Acoustical Society of America.

[9]  T. Irino,et al.  A time-domain, level-dependent auditory filter: The gammachirp , 1997 .

[10]  Roy D. Patterson,et al.  Stabilised wavelet mellin transform: an auditory strategy for normalising sound-source size , 1999, EUROSPEECH.

[11]  Satoshi Imai,et al.  Cepstral analysis synthesis on the mel frequency scale , 1983, ICASSP.

[12]  Roy D. Patterson,et al.  Segregating information about the size and shape of the vocal tract using a time-domain auditory model: The stabilised wavelet-Mellin transform , 2002, Speech Commun..

[13]  H. Strube Linear prediction on a warped frequency scale , 1980 .

[14]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..