Exploring the Predictability of Non-Unique Acoustic-to-Articulatory Mappings

This paper explores statistical tools that help analyze the predictability in the acoustic-to-articulatory inversion of speech, using an Electromagnetic Articulography database of simultaneously recorded acoustic and articulatory data. Since it has been shown that speech acoustics can be mapped to non-unique articulatory modes, the variance of the articulatory parameters is not sufficient to understand the predictability of the inverse mapping. We, therefore, estimate an upper bound to the conditional entropy of the articulatory distribution. This provides a probabilistic estimate of the range of articulatory values (either over a continuum or over discrete non-unique regions) for a given acoustic vector in the database. The analysis is performed for different British/Scottish English consonants with respect to which articulators (lips, jaws or the tongue) are important for producing the phoneme. The paper shows that acoustic-articulatory mappings for the important articulators have a low upper bound on the entropy, but can still have discrete non-unique configurations.

[1]  M. Schroeder Determination of the geometry of the human vocal tract by acoustic measurements. , 1967, The Journal of the Acoustical Society of America.

[2]  Eric Vatikiotis-Bateson,et al.  Trade-offs in tongue, jaw, and palatecontributions to speech production , 1995 .

[3]  A M Liberman,et al.  Perception of the speech code. , 1967, Psychological review.

[4]  Phil Hoole,et al.  Coordination of lingual and mandibular gestures for different manners of articulation , 2003 .

[5]  Keiichi Tokuda,et al.  Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[6]  B. Atal,et al.  Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique. , 1978, The Journal of the Acoustical Society of America.

[7]  Hugh F. Durrant-Whyte,et al.  On entropy approximation for Gaussian mixture random vectors , 2008, 2008 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems.

[8]  Philip Hoole Issues in the acquisition , processing , reduction and parameterization of articulographic data , 2001 .

[9]  Alan A Wrench,et al.  A MULTI-CHANNEL/MULTI-SPEAKER ARTICULATORY DATABASE FOR CONTINUOUS SPEECH RECOGNITION RESEARCH , 2000 .

[10]  R. Diehl,et al.  Speech Perception , 2004, Annual review of psychology.

[11]  Olov Engwall,et al.  In search of non-uniqueness in the acoustic-to-articulatory mapping , 2009, INTERSPEECH.

[12]  Chao Qin,et al.  The geometry of the articulatory region that produces a speech sound , 2009, 2009 Conference Record of the Forty-Third Asilomar Conference on Signals, Systems and Computers.

[13]  Miguel Á. Carreira-Perpiñán,et al.  An empirical investigation of the nonuniqueness in the acoustic-to-articulatory mapping , 2007, INTERSPEECH.

[14]  P. Mermelstein Determination of the vocal-tract shape from measured formant frequencies. , 1967, The Journal of the Acoustical Society of America.

[15]  Joachim M. Buhmann,et al.  Feature selection for support vector machines , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[16]  Anil K. Bera,et al.  Maximum entropy autoregressive conditional heteroskedasticity model , 2009 .

[17]  Olov Engwall,et al.  The acoustic to articulation mapping: non-linear or non-unique? , 2008, INTERSPEECH.

[18]  Alfonso Nieto-Castanon,et al.  A modeling investigation of articulatory variability and acoustic stability during American English /r/ production. , 2005, The Journal of the Acoustical Society of America.

[19]  Christian Kroos,et al.  Tongue–jaw trade‐offs and naturally occurring perturbation , 1999 .

[20]  Joanne L. Miller,et al.  Speech Perception , 1990, Springer Handbook of Auditory Research.

[21]  Frank H. Guenther,et al.  Articulatory trade‐offs reduce acoustic variability in American English /r/ productions , 1997 .

[22]  Miguel Á. Carreira-Perpiñán,et al.  Predicting tongue shapes from a few landmark locations , 2008, INTERSPEECH.

[23]  C Y Espy-Wilson,et al.  Articulatory tradeoffs reduce acoustic variability during American English /r/ production. , 1999, The Journal of the Acoustical Society of America.

[24]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[25]  Alfred Mertins,et al.  Oldenburg logatome speech corpus (OLLO) for speech recognition experiments with humans and machines , 2005, INTERSPEECH.

[26]  James Lubker,et al.  Formant frequencies of some fixed-mandible vowels and a model of speech motor programming by predict , 1977 .

[27]  A.R. Runnalls,et al.  A Kullback-Leibler Approach to Gaussian Mixture Reduction , 2007 .