A unified approach to speech production and recognition based on articulatory motor representations

We present a unified approach for speech production and recognition based on articulatory motor representations. The approach is inspired by the motor theory and the discovery of mirror neurons, and use motor representations for both reproduction and recognition of speech. A model of the vocal tract is used to create sound and the created sound is then mapped back to the motor representation using a neural network. To learn the map we mimic the behavior of a child that uses a combination of babbling and interaction with its caregiver to learn how to speak. Several different phases of babbling and interaction are identified and described. These help to overcome the inversion problem. The approach has been implemented on a humanoid robot, which has successfully learned to pronounce Swedish and Portuguese vowels. We have also studied how the different phases of babbling and interaction effect the error of the map and the achieved recognition rate when presented with vowels from different subjects. Finally we compare the recognition rates obtained using motor space with recognition rates obtained by directly using the acoustic parameters.

[1]  Minoru Asada,et al.  Primary vowel imitation between agents with different articulation parameters by parrot-like teaching , 2003, Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453).

[2]  Atsuo Takanishi,et al.  Development of a New Human-like Talking Robot for Human Vocal Mimicry , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[3]  José Santos-Victor,et al.  Visual learning by imitation with motor representations , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[4]  Sacha Krstulovic LPC modeling with speech production constraints , 2000 .

[5]  Giampiero Salvi,et al.  Ecological language acquisition via incremental model-based clustering , 2005, INTERSPEECH.

[6]  Mitsuhiro Nakamura,et al.  Talking Robot and the Analysis of Autonomous Voice Acquisition , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[7]  Shinji Maeda,et al.  Compensatory Articulation During Speech: Evidence from the Analysis and Synthesis of Vocal-Tract Shapes Using an Articulatory Model , 1990 .

[8]  A. Liberman,et al.  The motor theory of speech perception revised , 1985, Cognition.

[9]  P. Denes On the Motor Theory of Speech Perception , 1965 .

[10]  G. Rizzolatti,et al.  Speech listening specifically modulates the excitability of tongue muscles: a TMS study , 2002, The European journal of neuroscience.

[11]  G. Rizzolatti,et al.  Action recognition in the premotor cortex. , 1996, Brain : a journal of neurology.

[12]  Satrajit S. Ghosh,et al.  Neural modeling and imaging of the cortical interactions underlying syllable production , 2006, Brain and Language.

[13]  Hideyuki Sawada,et al.  Speech production by a mechanical model: Construction of a vocal tract and its control by neural network , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[14]  Dominic W. Massaro,et al.  The motor theory of speech perception revisited , 2008, Psychonomic bulletin & review.