The Use of Phonetic Motor Invariants Can Improve Automatic Phoneme Discrimination

We investigate the use of phonetic motor invariants (MIs), that is, recurring kinematic patterns of the human phonetic articulators, to improve automatic phoneme discrimination. Using a multi-subject database of synchronized speech and lips/tongue trajectories, we first identify MIs commonly associated with bilabial and dental consonants, and use them to simultaneously segment speech and motor signals. We then build a simple neural network-based regression schema (called Audio-Motor Map, AMM) mapping audio features of these segments to the corresponding MIs. Extensive experimental results show that a small set of features extracted from the MIs, as originally gathered from articulatory sensors, are dramatically more effective than a large, state-of-the-art set of audio features, in automatically discriminating bilabials from dentals; the same features, extracted from AMM-reconstructed MIs, are as effective as or better than the audio features, when testing across speakers and coarticulating phonemes; and dramatically better as noise is added to the speech signal. These results seem to support some of the claims of the motor theory of speech perception and add experimental evidence of the actual usefulness of MIs in the more general framework of automated speech recognition.

[1]  Feng Rong,et al.  Sensorimotor Integration in Speech Processing: Computational Basis and Neural Organization , 2011, Neuron.

[2]  L. Fadiga,et al.  The Motor Somatotopy of Speech Perception , 2009, Current Biology.

[3]  Samy Bengio,et al.  Automatic speech recognition using dynamic bayesian networks with both acoustic and articulatory variables , 2000, INTERSPEECH.

[4]  A. Liberman,et al.  The motor theory of speech perception revised , 1985, Cognition.

[5]  Bernd J. Kröger,et al.  Towards a neurocomputational model of speech production and perception , 2009, Speech Commun..

[6]  James R. Glass,et al.  Hidden feature models for speech recognition using dynamic Bayesian networks , 2003, INTERSPEECH.

[7]  Giulio Sandini,et al.  New Technologies for Simultaneous Acquisition of Speech Articulatory Data : 3 D Articulograph , Ultrasound and Electroglottograph , 2008 .

[8]  Emilie Marcus,et al.  Growth and Change at Neuron , 2001, Neuron.

[9]  Jeff A. Bilmes,et al.  Hidden-articulator Markov models: performance improvements and robustness to noise , 2000, INTERSPEECH.

[10]  D. Poeppel,et al.  Towards a functional neuroanatomy of speech perception , 2000, Trends in Cognitive Sciences.

[11]  Ibon Saratxaga,et al.  Detection of synthetic speech for the problem of imposture , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Simon King,et al.  Modelling the uncertainty in recovering articulation from acoustics , 2003, Comput. Speech Lang..

[13]  Hynek Hermansky,et al.  Exploiting contextual information for improved phoneme recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  M. Turvey,et al.  The motor theory of speech perception reviewed , 2006, Psychonomic bulletin & review.

[15]  Alan Powell,et al.  Sound‐production mechanisms of the axisymmetric choked jet impinging on small plates: The production of primary tones , 1996 .

[16]  P. Denes On the Motor Theory of Speech Perception , 1965 .

[17]  D. Poeppel,et al.  Dorsal and ventral streams: a framework for understanding aspects of the functional anatomy of language , 2004, Cognition.

[18]  Geoffrey E. Hinton,et al.  Inferring Motor Programs from Images of Handwritten Digits , 2005, NIPS.

[19]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[20]  A M Liberman,et al.  Perception of the speech code. , 1967, Psychological review.

[21]  Mari Ostendorf,et al.  Moving beyond the 'beads-on-a-string' model of speech , 1999 .

[22]  Roger K. Moore Computer Speech and Language , 1986 .

[23]  Alan Wrench,et al.  Continuous speech recognition using articulatory data , 2000, INTERSPEECH.

[24]  Keiichi Tokuda,et al.  Acoustic-to-articulatory inversion mapping with Gaussian mixture model , 2004, INTERSPEECH.

[25]  M M Sondhi,et al.  The potential role of speech production models in automatic speech recognition. , 1996, The Journal of the Acoustical Society of America.

[26]  R. J. Lickley,et al.  Proceedings of the International Conference on Spoken Language Processing. , 1992 .

[27]  Martin Fodslette Møller,et al.  A scaled conjugate gradient algorithm for fast supervised learning , 1993, Neural Networks.

[28]  Climent Nadeu,et al.  Comparison and combination of features in a hybrid HMM/MLP and a HMM/GMM speech recognition system , 2005, IEEE Transactions on Speech and Audio Processing.

[29]  Simon King,et al.  Speech production knowledge in automatic speech recognition. , 2007, The Journal of the Acoustical Society of America.

[30]  Korin Richmond,et al.  Trajectory Mixture Density Networks with Multiple Mixtures for Acoustic-Articulatory Inversion , 2007, NOLISP.

[31]  G Papcun,et al.  Inferring articulation and recognizing gestures from acoustics with a neural network trained on x-ray microbeam data. , 1992, The Journal of the Acoustical Society of America.

[32]  I. Zlokarnik Adding articulatory features to acoustic features for automatic speech recognition , 1995 .

[33]  G. Sandini,et al.  Understanding mirror neurons. , 2006 .

[34]  Dominic W. Massaro,et al.  The motor theory of speech perception revisited , 2008, Psychonomic bulletin & review.

[35]  G. Rizzolatti,et al.  I Know What You Are Doing A Neurophysiological Study , 2001, Neuron.

[36]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.