Exploiting speech production information for automatic speech and speaker modeling and recognition - possibilities and new opportunities

We consider the potential for incorporating direct, or inferred, speech production knowledge in speech technology development. We first review the technologies that can be used to capture speech articulation information. We discuss how meaningful (speech and speaker) representations can be derived from articulatory data thus captured and further how they can be estimated from the acoustics in the absence of these direct measurements. We present some applications that have used speech production information to further the state of the art in automatic speech and speaker recognition. We also offer an outlook on how such knowledge and applications can in turn inform scientific understanding of the human speech communication process.

[1]  Athanasios Katsamanis,et al.  Validating rt-MRI Based Articulatory Representations via Articulatory Recognition , 2011, INTERSPEECH.

[2]  Waveforms Hisashi Wakita Direct Estimation of the Vocal Tract Shape by Inverse Filtering of Acoustic Speech , 1973 .

[3]  Korin Richmond,et al.  Mixture density networks, human articulatory data and acoustic-to-articulatory inversion of continuous speech. , 2001 .

[4]  Shrikanth S. Narayanan,et al.  Data-driven analysis of realtime vocal tract MRI using correlated image regions , 2010, INTERSPEECH.

[5]  Li Deng,et al.  Speech recognition using the atomic speech units constructed from overlapping articulatory features , 1994, EUROSPEECH.

[6]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[7]  M. Iacoboni,et al.  Listening to speech activates motor areas involved in speech production , 2004, Nature Neuroscience.

[8]  Athanasios Katsamanis,et al.  A Multimodal Real-Time MRI Articulatory Corpus for Speech Research , 2011, INTERSPEECH.

[9]  Atsushi Nakamura,et al.  Production-Oriented Models for Speech Recognition , 2006, IEICE Trans. Inf. Syst..

[10]  T. Flash,et al.  When practice leads to co-articulation: the evolution of geometrically defined movement primitives , 2004, Experimental Brain Research.

[11]  D J Ostry,et al.  Coarticulation of jaw movements in speech production: is context sensitivity in speech kinematics centrally planned? , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[12]  Pascal Perrier,et al.  Do Speakers' Vocal Tract Geometries Shape their Articulatory Vowel Space? , 2008 .

[13]  Shrikanth S. Narayanan,et al.  Region Segmentation in the Frequency Domain Applied to Upper Airway Real-Time Magnetic Resonance Images , 2009, IEEE Transactions on Medical Imaging.

[14]  Louis-Jean Boë,et al.  Articulatory-acoustic relationships during vocal tract growth for French vowels: Analysis of real data and simulations with an articulatory model , 2007, J. Phonetics.

[15]  Li Deng,et al.  Variational inference and learning for segmental switching state space models of hidden speech dynamics , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[16]  Eric Vatikiotis-Bateson,et al.  The Haskins optically corrected ultrasound system (HOCUS). , 2005, Journal of speech, language, and hearing research : JSLHR.

[17]  R. S. McGowan Knowledge from Speech Production Used in Speech Technology : Articulatory Synthesis * , 2009 .

[18]  Raymond D. Kent,et al.  X‐ray microbeam speech production database , 1990 .

[19]  Simon King,et al.  Dynamical system modelling of articulator movement. , 1999 .

[20]  Shrikanth Narayanan,et al.  An approach to real-time magnetic resonance imaging for speech production. , 2003, The Journal of the Acoustical Society of America.

[21]  Shrikanth Narayanan,et al.  A generalized smoothness criterion for acoustic-to-articulatory inversion. , 2010, The Journal of the Acoustical Society of America.

[22]  Li Deng,et al.  A dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition , 1998, Speech Commun..

[23]  Athanasios Katsamanis,et al.  Statistical multi-stream modeling of real-time MRI articulatory speech data , 2010, INTERSPEECH.

[24]  Shrikanth Narayanan,et al.  Automatic speech recognition using articulatory features from subject-independent acoustic-to-articulatory inversion. , 2011, The Journal of the Acoustical Society of America.

[25]  Shrikanth S. Narayanan,et al.  Investigating articulatory setting - pauses, ready position, and rest - using real-time MRI , 2010, INTERSPEECH.

[26]  Shrikanth S. Narayanan,et al.  A subject-independent acoustic-to-articulatory inversion , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Li Deng,et al.  Production models as a structural basis for automatic speech recognition , 1997, Speech Commun..

[28]  M M Sondhi,et al.  The potential role of speech production models in automatic speech recognition. , 1996, The Journal of the Acoustical Society of America.

[29]  Simon King,et al.  Detection of phonological features in continuous speech using neural networks , 2000, Comput. Speech Lang..

[30]  F. Mussa-Ivaldi Motor Primitives , Force-Fields and the Equilibrium Point Theory , .

[31]  S. Giszter,et al.  A Neural Basis for Motor Primitives in the Spinal Cord , 2010, The Journal of Neuroscience.

[32]  Li Deng,et al.  Target-directed mixture dynamic models for spontaneous speech recognition , 2004, IEEE Transactions on Speech and Audio Processing.

[33]  B. Lindblom,et al.  Role of articulation in speech perception: clues from production. , 1996, The Journal of the Acoustical Society of America.

[34]  P. Mermelstein Articulatory model for the study of speech production. , 1973, The Journal of the Acoustical Society of America.

[35]  Prasanta Kumar Ghosh,et al.  Processing speech signal using auditory-like filterbank provides least uncertainty about articulatory gestures. , 2011, The Journal of the Acoustical Society of America.

[36]  Simon King,et al.  ASR - articulatory speech recognition , 2001, INTERSPEECH.

[37]  Yves Laprie,et al.  Improving acoustic-to-articulatory inversion by using hypercube codebooks , 2000, INTERSPEECH.

[38]  SachaKrstulović LPC MODELING WITH SPEECH PRODUCTION CONSTRAINTS , 2001 .

[39]  Alan A Wrench,et al.  A MULTI-CHANNEL/MULTI-SPEAKER ARTICULATORY DATABASE FOR CONTINUOUS SPEECH RECOGNITION RESEARCH , 2000 .

[40]  Shrikanth Narayanan,et al.  Morphological variation in the adult hard palate and posterior pharyngeal wall. , 2013, Journal of speech, language, and hearing research : JSLHR.

[41]  Florian Metze,et al.  A flexible stream architecture for ASR using articulatory features , 2002, INTERSPEECH.

[42]  Louis Goldstein,et al.  Dynamics and articulatory phonology , 1996 .

[43]  Yves Laprie,et al.  A variational approach for estimating vocal tract shapes from the speech signal , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[44]  Athanasios Katsamanis,et al.  Automatic Data-Driven Learning of Articulatory Primitives from Real-Time MRI Data Using Convolutive NMF with Sparseness Constraints , 2011, INTERSPEECH.

[45]  Shrikanth S. Narayanan,et al.  Statistical methods for estimation of direct and differential kinematics of the vocal tract , 2013, Speech Commun..

[46]  Shinji Maeda,et al.  Compensatory Articulation During Speech: Evidence from the Analysis and Synthesis of Vocal-Tract Shapes Using an Articulatory Model , 1990 .