Design considerations for optimizing the intelligibility of a DFT‐based, pitch‐excited, critical‐band‐spectrum speech analysis/resynthesis system

The primary objective of the research to be described was to determine whether a particular critical‐band spectral representation retains information sufficient for high‐performance speech recognition. The technique employed to answer this question was to design a speech analysis/resynthesis system (vocoder) in which the synthesizer only made use of a critical‐band spectral representation. The intelligibility of speech processed through this system was determined as a function of a number of design parameters. The magnitude spectrum was computed for overlapping windowed segments of the speech waveform every 10 ms. The critical‐band spectrum (38 spectral coefficients) was derived by forming the appropriate weighted sums of DFT magnitude coefficients. The analyzer also made a voicing decision and estimated fundamental frequency based on low‐frequency DFT peaks. During resynthesis, the full DFT magnitude spectrum was regenerated by interpolation between the 38 available coefficients. An inverse DFT, in which the phase was set to zero, yielded a finite impulse response that could be convolved with the idealized excitation source to reconstruct a speech waveform. Not surprisingly, the synthetic speech sounded muffled due to the width of the critical‐band spectral peaks. Several algorithms were then developed to sharpen these peaks prior to resynthesis. The best algorithm, to be described, was compared with a similar DFT‐magnitude vocoder without critical‐band smoothing and with a linear‐prediction vocoder, using a modified rhyme test. The results have implications for both speech recognition and vocoder design. [Work supported in part by an NIH grant.]