Consonant recognition with continuous-state hidden Markov models and perceptually-motivated features

Research into human perception of consonants has identified phoneme-specific perceptual cues. It has also been shown that the characteristics of the speech signal most useful for recognition depend on the specific speech sound. Typical ASR features and recognisers however neither vary with the type of sound nor relate directly to perceptual cues. We investigate classification and decoding of non-sonorant consonants using basic perceptually-motivated features – phoneme durations and energy in a few broad spectral bands. Our classification results using simple classifiers suggest that features optimal for human perception also perform best for machine classification. We show how characteristics of the models learned relate to knowledge of human speech perception. Recognition results using a continuous-state HMM (CSHMM) show accuracy similar to a discrete-state HMM with similar assumptions. We conclude by outlining how the CSHMM provides a mechanism to make use of other perceptually-important features by integration with similar models for recognition of voiced sounds.

[1]  Mark Hasegawa-Johnson,et al.  Detecting articulatory compensation in acoustic data through linear regression modeling , 2014, INTERSPEECH.

[2]  Martin J. Russell,et al.  Trajectory analysis of speech using continuous state hidden Markov Models , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  B. Moore,et al.  Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. , 1983, The Journal of the Acoustical Society of America.

[4]  Douglas D. O'Shaughnessy Speech Communications: Human and Machine , 2012 .

[5]  Lori L. Holt,et al.  A standard set of American-English voiced stop-consonant stimuli from morphed natural speech , 2011, Speech Commun..

[6]  Jont B. Allen,et al.  A psychoacoustic method for studying the necessary and sufficient perceptual cues of American English fricative consonants in noise. , 2012, The Journal of the Acoustical Society of America.

[7]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[8]  Colin J. Champion,et al.  Application of continuous state Hidden Markov Models to a classical problem in speech recognition , 2016, Comput. Speech Lang..

[9]  K. Stevens,et al.  On the Properties of Voiceless Fricative Consonants , 1961 .

[10]  Jont B. Allen,et al.  A psychoacoustic method to find the perceptual cues of stop consonants in natural speech. , 2010, The Journal of the Acoustical Society of America.

[11]  Keikichi Hirose,et al.  Speech Synthesis by Rule. , 1996 .

[12]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[13]  Olivier Ledoit,et al.  A well-conditioned estimator for large-dimensional covariance matrices , 2004 .

[14]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[15]  D. D. Greenwood A cochlear frequency-position function for several species--29 years later. , 1990, The Journal of the Acoustical Society of America.

[16]  Lorin F. Wilde,et al.  Analysis and synthesis of fricative consonants , 1995 .

[17]  L. Raphael Preceding vowel duration as a cue to the perception of the voicing characteristic of word-final consonants in American English. , 1972, The Journal of the Acoustical Society of America.

[18]  S. Blumstein,et al.  Acoustic and perceptual characteristics of voicing in fricatives and fricative clusters. , 1992, The Journal of the Acoustical Society of America.

[19]  E. Zwicker,et al.  Subdivision of the audible frequency range into critical bands , 1961 .

[20]  S. Blumstein,et al.  Invariant cues for place of articulation in stop consonants. , 1978, The Journal of the Acoustical Society of America.

[21]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[22]  K. Stevens Acoustic correlates of some phonetic categories. , 1979, The Journal of the Acoustical Society of America.

[23]  Martin Russell,et al.  A segmental HMM for speech pattern modelling , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.