Spatio-temporal articulatory movement primitives during speech production: extraction, interpretation, and validation.

This paper presents a computational approach to derive interpretable movement primitives from speech articulation data. It puts forth a convolutive Nonnegative Matrix Factorization algorithm with sparseness constraints (cNMFsc) to decompose a given data matrix into a set of spatiotemporal basis sequences and an activation matrix. The algorithm optimizes a cost function that trades off the mismatch between the proposed model and the input data against the number of primitives that are active at any given instant. The method is applied to both measured articulatory data obtained through electromagnetic articulography as well as synthetic data generated using an articulatory synthesizer. The paper then describes how to evaluate the algorithm performance quantitatively and further performs a qualitative assessment of the algorithm's ability to recover compositional structure from data. This is done using pseudo ground-truth primitives generated by the articulatory synthesizer based on an Articulatory Phonology frame-work [Browman and Goldstein (1995). "Dynamics and articulatory phonology," in Mind as motion: Explorations in the dynamics of cognition, edited by R. F. Port and T.van Gelder (MIT Press, Cambridge, MA), pp. 175-194]. The results suggest that the proposed algorithm extracts movement primitives from human speech production data that are linguistically interpretable. Such a framework might aid the understanding of longstanding issues in speech production such as motor control and coarticulation.

[1]  T. Flash,et al.  When practice leads to co-articulation: the evolution of geometrically defined movement primitives , 2004, Experimental Brain Research.

[2]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[3]  Bishnu S. Atal,et al.  Efficient coding of LPC parameters by temporal decomposition , 1983, ICASSP.

[4]  D J Ostry,et al.  Coarticulation of jaw movements in speech production: is context sensitivity in speech kinematics centrally planned? , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[5]  Louis Goldstein,et al.  Recognizing articulatory gestures from speech for robust speech recognition. , 2012, The Journal of the Acoustical Society of America.

[6]  Chris H. Q. Ding,et al.  Robust nonnegative matrix factorization using L21-norm , 2011, CIKM '11.

[7]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[8]  S Ma,et al.  Two functionally different synergies during arm reaching movements involving the trunk. , 1995, Journal of neurophysiology.

[9]  Toshihisa Tanaka,et al.  First results on uniqueness of sparse non-negative matrix factorization , 2005, 2005 13th European Signal Processing Conference.

[10]  G. Strang Introduction to Linear Algebra , 1993 .

[11]  Rachid Ridouane,et al.  Where do phonological features come from? : cognitive, physical and developmental bases of distinctive speech categories , 2011 .

[12]  H. Haken,et al.  A theoretical model of phase transitions in human hand movements , 2004, Biological Cybernetics.

[13]  J A Scott Kelso,et al.  Synergies: atoms of brain and behavior. , 2009, Advances in experimental medicine and biology.

[14]  Shrikanth Narayanan,et al.  An approach to real-time magnetic resonance imaging for speech production. , 2003, The Journal of the Acoustical Society of America.

[15]  Francesco Lacquaniti,et al.  Control of Fast-Reaching Movements by Muscle Synergy Combinations , 2006, The Journal of Neuroscience.

[16]  B. Silverman,et al.  Canonical correlation analysis when the data are curves. , 1993 .

[17]  Tamar Flash,et al.  Motor primitives in vertebrates and invertebrates , 2005, Current Opinion in Neurobiology.

[18]  Elliot Saltzman,et al.  Articulatory Information for Noise Robust Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  T. Hromádka,et al.  Sparse Representation of Sounds in the Unanesthetized Auditory Cortex , 2008, PLoS biology.

[20]  S. Giszter,et al.  A Neural Basis for Motor Primitives in the Spinal Cord , 2010, The Journal of Neuroscience.

[21]  C. Browman,et al.  Articulatory Phonology: An Overview , 1992, Phonetica.

[22]  Ali Taylan Cemgil,et al.  Nonnegative matrix factorizations as probabilistic inference in composite models , 2009, 2009 17th European Signal Processing Conference.

[23]  N. A. Bernshteĭn The co-ordination and regulation of movements , 1967 .

[24]  Andrzej Cichocki,et al.  A Multiplicative Algorithm for Convolutive Non-Negative Matrix Factorization Based on Squared Euclidean Distance , 2009, IEEE Transactions on Signal Processing.

[25]  Jessica K. Hodgins,et al.  Aligned Cluster Analysis for temporal segmentation of human motion , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[26]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[27]  P. Rubin,et al.  CASY: The Haskins Configurable Articulatory Synthesizer , 2003 .

[28]  Louis Goldstein,et al.  Dynamics and articulatory phonology , 1996 .

[29]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[30]  E. Bizzi,et al.  Article history: , 2005 .

[31]  Athanasios Katsamanis,et al.  Automatic Data-Driven Learning of Articulatory Primitives from Real-Time MRI Data Using Convolutive NMF with Sparseness Constraints , 2011, INTERSPEECH.

[32]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[33]  H. Akaike Likelihood of a model and information criteria , 1981 .

[34]  F.A. Mussa-Ivaldi,et al.  Neural primitives for motion control , 2004, IEEE Journal of Oceanic Engineering.

[35]  Louis Goldstein,et al.  A task-dynamic toolkit for modeling the effects of prosodic structure on articulation , 2008, Speech Prosody 2008.

[36]  M. Halle,et al.  Preliminaries to Speech Analysis: The Distinctive Features and Their Correlates , 1961 .

[37]  L Saltzman Elliot,et al.  A Dynamical Approach to Gestural Patterning in Speech Production , 1989 .

[38]  Paris Smaragdis,et al.  Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  Bruno A Olshausen,et al.  Sparse coding of sensory inputs , 2004, Current Opinion in Neurobiology.

[40]  Yiannis Aloimonos,et al.  A Language for Human Action , 2007, Computer.

[41]  Alan A Wrench,et al.  A MULTI-CHANNEL/MULTI-SPEAKER ARTICULATORY DATABASE FOR CONTINUOUS SPEECH RECOGNITION RESEARCH , 2000 .

[42]  Barak A. Pearlmutter,et al.  Convolutive Non-Negative Matrix Factorisation with a Sparseness Constraint , 2006 .

[43]  F. Mussa-Ivaldi Motor Primitives , Force-Fields and the Equilibrium Point Theory , .

[44]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[45]  Mark Hasegawa-Johnson,et al.  A procedure for estimating gestural scores from speech acoustics. , 2012, The Journal of the Acoustical Society of America.

[46]  Bartlett W. Mel Computational neuroscience: Think positive to find parts , 1999, Nature.

[47]  Korin Richmond,et al.  Estimating articulatory parameters from the acoustic speech signal , 2002 .

[48]  Andrea d'Avella,et al.  Matrix factorization algorithms for the identification of muscle synergies: evaluation on simulated and experimental data sets. , 2006, Journal of neurophysiology.

[49]  F. Huddle Coordination , 1966, Open Knowledge Institutions.

[50]  Khalil Iskarous,et al.  Patterns of tongue movement , 2005, J. Phonetics.

[51]  Emilio Bizzi,et al.  Shared and specific muscle synergies in natural motor behaviors. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Victoria Stodden,et al.  When Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts? , 2003, NIPS.

[53]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[54]  Khalil Iskarous,et al.  Articulatory-acoustic kinematics: the production of American English /s/. , 2011, The Journal of the Acoustical Society of America.

[55]  Gregory Shakhnarovich,et al.  Sparse Coding for Learning Interpretable Spatio-Temporal Primitives , 2010, NIPS.

[56]  Shrikanth Narayanan,et al.  Investigation of the inter‐articulator correlation in acoustic‐to‐articulatory inversion using generalized smoothness criterion. , 2010 .

[57]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.