Delayed decisions in speech recognition - The case of formants

Abstract Consciously designed message bearing signals are generally well suited to bottom-up decoding. Signals are first segmented into low-level units, the units are classified, and the higher level information is then deduced. Evidence from the reading of printed text suggests that humans may not use such a strategy even on signals apparently well suited to it, and that spontaneous modes of communication - handwriting and speech - are quite unsuited to the stratery. It is argued that the most effective algorithms for automatic speech recognition derive their effectiveness from an ability to delay low-level decisions (such as segmential identities and boundaries) until higher level decisions (such as word identities) have been made. A case is made for the representation of speech for recognition purposes in terms of the frequencies of the vocal tract resonances ( formants ). The fact that formant frequencies have not hitherto been widely used in speech recognition is ascribed to their resembling other low-level features in that they too cannot be reliably extracted and labeled prior to some higher level decisions. An algorithm is presented that allows decisions on formant identities to be delayed and made contingent on higher level decisions. A technique is described for deriving the cost functions used in this algorithm from the statistical properties of formants, and some practical applications of the algorithm are briefly described.

[1]  L. R. Rabiner,et al.  An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition , 1983, The Bell System Technical Journal.

[2]  B. Yegnanarayana,et al.  Voice conversion: Factors responsible for quality , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  John E. Markel,et al.  Linear Prediction of Speech , 1976, Communication and Cybernetics.

[4]  M. Hunt,et al.  Speech recognition using a cochlear model , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  G. Kuhn On the front cavity resonance and its possible role in speech perception. , 1975, The Journal of the Acoustical Society of America.

[6]  I. Taylor,et al.  The Psychology of Reading , 1983 .

[7]  G. Fant,et al.  Two-formant Models, Pitch and Vowel Perception , 1975 .

[8]  Roger K. Moore,et al.  Some techniques for incorporating local timescale variability information into a dynamic time-warping algorithm for automatic speech recognition , 1983, ICASSP.

[9]  J. S. Bridle Inside a speech recognition machine , 1983 .

[10]  Paul Mermelstein,et al.  Experiments in syllable-based recognition of continuous speech , 1980, ICASSP.

[11]  E. Javel,et al.  Suppression of auditory nerve responses I: temporal analysis, intensity effects and suppression contours. , 1981, The Journal of the Acoustical Society of America.

[12]  Ronald A. Cole,et al.  Experiments on spectrogram reading , 1979, ICASSP.

[13]  Michael D. Brown,et al.  An algorithm for connected word recognition , 1982, ICASSP.

[14]  Stephanie Seneff Pitch and spectral estimation of speech based on auditory synchrony model , 1984, ICASSP.

[15]  E Abberton,et al.  First applications of a new laryngograph. , 1971, Medical & biological illustration.

[16]  Dennis H. Klatt,et al.  Prediction of perceived phonetic distance from critical-band spectra: A first step , 1982, ICASSP.