Human and automatic speech recognition in the presence of speech-intrinsic variations

Ziel der vorliegenden Dissertation ist die Analyse und Verbesserung automatischer Spracherkennung (ASR). Da das menschliche auditorische System heutigen ASR-System weit uberlegen ist, wurde zunachst die Erkennungsleistung von Mensch und Maschine verglichen. Aus den spezifischen Unterschieden wurden Ruckschlusse auf Signalverarbeitungsmechanismen gezogen, die zu einer Verbesserung von ASR fuhren. Beim Vergleich wurde insbesondere der Einfluss intrinsischer Variabilitat (Anderungen der Sprachrate, des Sprachaufwands und -stils, sowie Dialekt und Akzent) evaluiert. Die Ergebnisse belegen, dass die Verarbeitung zeitlicher Merkmale in ASR Optimierungspotential birgt. Daher wurden spektro-temporale Merkmale fur ASR eingesetzt, mit denen bei verandertem Sprachaufwand und variierender Sprechweise eine Verbesserung gegenuber Standardmerkmalen erzielt wurde; dies belegt die Nutzlichkeit spektro-temporaler und temporaler Information fur automatische Erkenner.

[1]  James Emil Flege,et al.  Interaction between the native and second language phonetic subsystems , 2003, Speech Commun..

[2]  L D Shriberg,et al.  A procedure for phonetic transcription by consensus. , 1984, Journal of speech and hearing research.

[3]  B. Kollmeier,et al.  Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes. , 2011, The Journal of the Acoustical Society of America.

[4]  Richard M. Stern,et al.  On the effects of speech rate in large vocabulary speech recognition systems , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[5]  Christian Kaernbach A behavioral reverse correlation technique to decipher early auditory feature coding , 1999 .

[6]  T Dau,et al.  A quantitative model of the "effective" signal processing in the auditory system. I. Model structure. , 1996, The Journal of the Acoustical Society of America.

[7]  Kathryn Woodcock,et al.  Ergonomics and automatic speech recognition applications for deaf and hard-of-hearing users , 1997 .

[8]  M. D. Wang,et al.  Consonant confusions in noise: a study of perceptual features. , 1973, The Journal of the Acoustical Society of America.

[9]  J. C. Steinberg,et al.  Factors Governing the Intelligibility of Speech Sounds , 1945 .

[10]  Jont B. Allen How do humans process and recognize speech , 1993 .

[11]  G. A. Miller,et al.  An Analysis of Perceptual Confusions Among Some English Consonants , 1955 .

[12]  Jean C. Krause,et al.  Investigating alternative forms of clear speech: the effects of speaking rate and speaking mode on intelligibility. , 2002, The Journal of the Acoustical Society of America.

[13]  Martin Heckmann,et al.  A closer look on hierarchical spectro-temporal features (HIST) , 2008, INTERSPEECH.

[14]  C. Schreiner,et al.  Gabor analysis of auditory midbrain receptive fields: spectro-temporal and binaural composition. , 2003, Journal of neurophysiology.

[15]  Richard M. Stern,et al.  Analysis of physiologically-motivated signal processing for robust speech recognition , 2008, INTERSPEECH.

[16]  Stephen V. David,et al.  Representation of Phonemes in Primary Auditory Cortex: How the Brain Analyzes Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[17]  S.D. Peters,et al.  On the limits of speech recognition in noise , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[18]  Louis D. Braida,et al.  Human and machine consonant recognition , 2005, Speech Commun..

[19]  Frantisek Grézl,et al.  Improved MLP structures for data-driven feature extraction for ASR , 2005, INTERSPEECH.

[20]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[21]  S A Shamma,et al.  Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex. , 2001, Journal of neurophysiology.

[22]  Tim Jürgens,et al.  Modelling the human-machine gap in speech reception: microscopic speech intelligibility prediction for normal-hearing subjects with an auditory model , 2007, INTERSPEECH.

[23]  J. C. Krause,et al.  Acoustic properties of naturally produced clear speech at normal speaking rates. , 1996, The Journal of the Acoustical Society of America.

[24]  Frank Joublin,et al.  Hierarchical spectro-temporal features for robust speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  J M Festen Contributions of comodulation masking release and temporal resolution to the speech-reception threshold masked by an interfering voice. , 1993, The Journal of the Acoustical Society of America.

[26]  Richard M. Stern,et al.  Signal Processing for Robust Speech Recognition , 1994, HLT.

[27]  Alexander Fischer,et al.  Progress with the philips continuous ASR system on the Aurora 2 noisy digits database , 2002, INTERSPEECH.

[28]  Alfred Mertins,et al.  Oldenburg logatome speech corpus (OLLO) for speech recognition experiments with humans and machines , 2005, INTERSPEECH.

[29]  B Kollmeier,et al.  Development and evaluation of a German sentence test for objective and subjective speech intelligibility assessment. , 1997, The Journal of the Acoustical Society of America.

[30]  Ernst Günter Schukat-Talamazzini Statistische Spracherkennung , 1995, Künstliche Intell..

[31]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[32]  Alfred Mertins,et al.  Automatic speech recognition and speech variability: A review , 2007, Speech Commun..

[33]  Dirk Van Compernolle,et al.  Synthesizing speech from speech recognition parameters , 2004, INTERSPEECH.

[34]  Florian Schiel,et al.  Automatic detection and segmentation of pronunciation variants in German speech corpora , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[35]  Birger Kollmeier,et al.  Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition , 2011, Speech Commun..

[36]  Joseph P. Olive,et al.  Two protocols comparing human and machine phonetic recognition performance in conversational speech , 2008, INTERSPEECH.

[37]  H. Levitt,et al.  Predicting consonant confusions from acoustic analysis. , 1981, The Journal of the Acoustical Society of America.

[38]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[39]  W. Dreschler,et al.  Artificial noise signals with speechlike spectral and temporal properties for hearing instrument assessment , 1999 .

[40]  T. Mcarthur,et al.  The Oxford companion to the English language , 1994 .

[41]  K. Kohler Einführung in die Phonetik des Deutschen , 1981 .

[42]  J. Hillenbrand,et al.  Acoustic characteristics of American English vowels. , 1994, The Journal of the Acoustical Society of America.

[43]  C W Turner,et al.  Use of temporal envelope cues in speech recognition by normal and hearing-impaired listeners. , 1995, The Journal of the Acoustical Society of America.

[44]  S. Phatak,et al.  Consonant and Vowel confusions , 2006 .

[45]  Kate Hunicke-Smith,et al.  Effect of Speaking Style on LVCSR Performance , 1996 .

[46]  Albert S. Bregman,et al.  The Auditory Scene. (Book Reviews: Auditory Scene Analysis. The Perceptual Organization of Sound.) , 1990 .

[47]  Nelson Morgan,et al.  Multi-stream spectro-temporal features for robust speech recognition , 2008, INTERSPEECH.

[48]  Steve Young,et al.  The HTK book , 1995 .

[49]  Matthew H. Davis,et al.  Leading Up the Lexical Garden Path: Segmentation and Ambiguity in Spoken Word Recognition , 2002 .

[50]  E. Vajda Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet , 2000 .

[51]  Jean C. Krause,et al.  The effects of speaking rate on the intelligibility of speech for various speaking modes , 1995 .

[52]  Odette Scharenborg,et al.  Parallels between HSR and ASR: how ASR can contribute to HSR , 2005, INTERSPEECH.

[53]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[54]  Birger Kollmeier,et al.  Optimization and evaluation of Gabor feature sets for ASR , 2008, INTERSPEECH.

[55]  Fosler-Lussier,et al.  EFFECTS OF SPEAKING RATE AND WORD FREQUENCY ONCONVERSATIONAL PRONUNCIATIONSEric , 1999 .

[56]  B. Kollmeier,et al.  A HUMAN-MACHINE COMPARISON IN SPEECH RECOGNITION BASED ON A LOGATOME CORPUS , 2006 .

[57]  Hynek Hermansky,et al.  Noise resistant auditory model for parametrization of speech , 1997 .

[58]  M. Kleinschmidt Methods for capturing spectro-temporal modulations in automatic speech recognition , 2001 .

[59]  Louis ten Bosch,et al.  Bridging the gap between human and automatic speech recognition , 2007, Speech Commun..

[60]  Jon Barker,et al.  Modelling speaker intelligibility in noise , 2007, Speech Commun..

[61]  Nelson Morgan,et al.  Multi-stream to many-stream: using spectro-temporal features for ASR , 2009, INTERSPEECH.

[62]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[63]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[64]  Valerie Hazan,et al.  Acoustic-phonetic correlates of talker intelligibility for adults and children. , 2004, The Journal of the Acoustical Society of America.

[65]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[66]  Hermann Ney,et al.  Using phase spectrum information for improved speech recognition performance , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[67]  Josef Kittler,et al.  Floating search methods for feature selection with nonmonotonic criterion functions , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[68]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[69]  R. Mühler,et al.  Development of a Speaker Discrimination Test for Cochlear Implant Users Based on the Oldenburg Logatome Corpus , 2008, ORL.

[70]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[71]  Jörn Anemüller,et al.  Predictability of STRFs in auditory cortex neurons depends on stimulus class , 2008, INTERSPEECH.

[72]  J Tchorz,et al.  A model of auditory perception as front end for automatic speech recognition. , 1999, The Journal of the Acoustical Society of America.

[73]  David Gelbart,et al.  Improving word accuracy with Gabor feature extraction , 2002, INTERSPEECH.

[74]  Petros Maragos,et al.  Robust AM-FM features for speech recognition , 2005, IEEE Signal Processing Letters.

[75]  T. Brand,et al.  Microscopic prediction of speech recognition for listeners with normal hearing in noise using an auditory model. , 2009, The Journal of the Acoustical Society of America.

[76]  Bernd T. Meyer,et al.  The non-native consonant challenge for european languages , 2008, INTERSPEECH.

[77]  Chi‐nin Li Accent, intelligibility, and comprehensibility in the perception of foreign‐accented Lombard speech , 2003 .

[78]  Michael Kleinschmidt,et al.  Robust speech recognition based on spectro-temporal processing , 2002 .

[79]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[80]  Hans Werner Strube,et al.  Recognition of isolated words based on psychoacoustics and neurobiology , 1990, Speech Commun..

[81]  S. Gelfand,et al.  Consonant recognition in quiet as a function of aging among normal hearing subjects. , 1985, The Journal of the Acoustical Society of America.

[82]  T. Gramss Fast algorithms to find invariant features for a word recognizing neural net , 1991 .

[83]  W. Dreschler,et al.  ICRA noises: artificial noise signals with speech-like spectral and temporal properties for hearing instrument assessment. International Collegium for Rehabilitative Audiology. , 2001, Audiology : official organ of the International Society of Audiology.

[84]  Birger Kollmeier,et al.  Complementarity of MFCC, PLP and Gabor features in the presence of speech-intrinsic variabilities , 2009, INTERSPEECH.

[85]  Alfred Mertins,et al.  Introduction to the Special Issue on Intrinsic Speech Variations , 2007, Speech Commun..

[86]  Richard M. Stern,et al.  Towards fusion of feature extraction and acoustic model training: a top down process for robust speech recognition , 2009, INTERSPEECH.

[87]  Michael Kleinschmidt,et al.  Localized spectro-temporal features for automatic speech recognition , 2003, INTERSPEECH.

[88]  J C Junqua,et al.  The Lombard reflex and its role on human listeners and automatic speech recognizers. , 1993, The Journal of the Acoustical Society of America.

[89]  Melvyn J. Hunt,et al.  Spectral Signal Processing for ASR , 2007 .

[90]  Hynek Hermansky,et al.  Should recognizers have ears? , 1998, Speech Commun..

[91]  B E Walden,et al.  Evaluating the articulation index for auditory-visual consonant recognition. , 1996, The Journal of the Acoustical Society of America.

[92]  Birger Kollmeier,et al.  Phoneme confusions in human and automatic speech recognition , 2007, INTERSPEECH.

[93]  Tony Ezzat,et al.  Spectro-temporal analysis of speech using 2-d Gabor filters , 2007, INTERSPEECH.

[94]  Odette Scharenborg,et al.  Reaching over the gap: A review of efforts to link human and automatic speech recognition research , 2007, Speech Commun..