An attribute detection based approach to automatic speech processing

State-of-the-art automatic speech and speaker recognition systems are often built with a pattern matching framework that has proven to achieve low recognition error rates for a variety of resource-rich tasks when the volume of speech and text examples to build statistical acoustic and language models is plentiful, and the speaker, acoustics and language conditions follow a rigid protocol. However, because of the “blackbox” top-down knowledge integration approach, such systems cannot easily leverage a rich set of knowledge sources already available in the literature on speech, acoustics and languages. In this paper, we present a bottom-up approach to knowledge integration, called automatic speech attribute transcription (ASAT), which is intended to be “knowledge-rich”, so that new and existing knowledge sources can be verified and integrated into current spoken language systems to improve recognition accuracy and system robustness. Since the ASAT framework offers a “divide-and-conquer” strategy and a “plug-andplay” game plan, it will facilitate a cooperative speech processing community that every researcher can contribute to, with a view to improving speech processing capabilities which are currently not easily accessible to researchers in the speech science community.

[1]  Partha Niyogi,et al.  A detection framework for locating phonetic events , 1998, ICSLP.

[2]  Qiang Huo,et al.  On adaptive decision rules and decision parameter adaptation for automatic speech recognition , 2000, Proceedings of the IEEE.

[3]  Mehryar Mohri,et al.  Finite-State Transducers in Language and Speech Processing , 1997, CL.

[4]  S. Shamma On the role of space and time in auditory processing , 2001, Trends in Cognitive Sciences.

[5]  Wayne H. Ward,et al.  Parsing speech into articulatory events , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Y.-L. Chow Maximum mutual information estimation of HMM parameters for continuous speech recognition using the N-best algorithm , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[7]  Hynek Hermansky,et al.  Towards ASR Based on Hierarchical Posterior-Based Keyword Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[8]  J. Baker,et al.  The DRAGON system--An overview , 1975 .

[9]  Chin-Hui Lee,et al.  A study on word detector design and knowledge-based pruning and rescoring , 2007, INTERSPEECH.

[10]  Victor W. Zue,et al.  Acoustic-Phonetic Knowledge Representation: Implications from Spectrogram Reading Experiments , 1982 .

[11]  P. Bahr,et al.  Sampling: Theory and Applications , 2020, Applied and Numerical Harmonic Analysis.

[12]  Partha Niyogi,et al.  The voicing feature for stop consonants: acoustic phonetic analyses and automatic speech recognition experiments , 1998, ICSLP.

[13]  Hermann Ney,et al.  Progress in dynamic programming search for LVCSR , 2000 .

[14]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[15]  Chin-Hui Lee,et al.  Vocabulary independent discriminative utterance verification for nonkeyword rejection in subword based speech recognition , 1996, IEEE Trans. Speech Audio Process..

[16]  Jinyu Li,et al.  A study on separation between acoustic models and its applications , 2005, INTERSPEECH.

[17]  P. Mahadevan,et al.  An overview , 2007, Journal of Biosciences.

[18]  S. Ortmanns,et al.  Progress in dynamic programming search for LVCSR , 1997, Proceedings of the IEEE.

[19]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[20]  Biing-Hwang Juang,et al.  Speech Analysis in a Model of the Central Auditory System , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[22]  Douglas D. O'Shaughnessy,et al.  Speech communication : human and machine , 1987 .

[23]  Klaus A J Riederer 1 LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 2000 .

[24]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[25]  Biing-Hwang Juang,et al.  Flexible speech understanding based on combined key-phrase detection and verification , 1998, IEEE Trans. Speech Audio Process..

[26]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[27]  Jessika Eichel Fundamentals Of Speech , 2016 .

[28]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[29]  L. Lamel,et al.  Large-vocabulary continuous speech recognition: advances and applications , 2000, Proceedings of the IEEE.

[30]  Chin-Hui Lee,et al.  An Information-Extraction Approach to Speech Processing: Analysis, Detection, Verification, and Recognition , 2013, Proceedings of the IEEE.

[31]  Jinyu Li,et al.  On designing and evaluating speech event detectors , 2005, INTERSPEECH.

[32]  J.R. Bellegarda,et al.  Exploiting latent semantic information in statistical language modeling , 2000, Proceedings of the IEEE.

[33]  Chen-Yu Chiang,et al.  可變速中文文字轉語音系統 (Variable Speech Rate Mandarin Chinese Text-to-Speech System) [In Chinese] , 2010, ROCLING.

[34]  Dau-Cheng Lyu,et al.  Experiments on Cross-Language Attribute Detection and Phone Recognition With Minimal Target-Specific Training Data , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Chin-Hui Lee,et al.  On Automatic Speech Recognition at the Dawn of the 21st Century , 2003 .

[36]  Lawrence R. Rabiner,et al.  A tutorial on Hidden Markov Models , 1986 .

[37]  Dong Yu,et al.  Deep Convex Net: A Scalable Architecture for Speech Pattern Classification , 2011, INTERSPEECH.

[38]  Sharlene A. Liu,et al.  Landmark detection for distinctive feature-based speech recognition , 1996 .

[39]  Chin-Hui Lee,et al.  A study on integrating acoustic-phonetic information into lattice rescoring for automatic speech recognition , 2009, Speech Commun..

[40]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[41]  Chin-Hui Lee,et al.  Universal attribute characterization of spoken languages for automatic spoken language recognition , 2013, Comput. Speech Lang..

[42]  Eric Fosler-Lussier,et al.  Combining phonetic attributes using conditional random fields , 2006, INTERSPEECH.