Evaluating Features and Metrics for High-Quality Simulation of Early Vocal Learning of Vowels

The way infants use auditory cues to learn to speak despite the acoustic mismatch of their vocal apparatus is a hot topic of scientific debate. The simulation of early vocal learning using articulatory speech synthesis offers a way towards gaining a deeper understanding of this process. One of the crucial parameters in these simulations is the choice of features and a metric to evaluate the acoustic error between the synthesised sound and the reference target. We contribute with evaluating the performance of a set of 40 feature-metric combinations for the task of optimising the production of static vowels with a high-quality articulatory synthesiser. Towards this end we assess the usability of formant error and the projection of the feature-metric error surface in the normalised F1-F2 formant space. We show that this approach can be used to evaluate the impact of features and metrics and also to offer insight to perceptual results.

[1]  D. Kimbrough Oller,et al.  Development of Speech Production: Perspectives from Natural and Perturbed Speech , 1983 .

[2]  B. de Boysson-Bardies,et al.  The Nature and Origins of Ambient Language Influence on Infant Vocal Production and Early Words , 1994, Phonetica.

[3]  Steve Young,et al.  The HTK book , 1995 .

[4]  Shrikanth S. Narayanan,et al.  Analysis of children's speech: duration, pitch and formants , 1997, EUROSPEECH.

[5]  G. Conti-Ramsden,et al.  Language Development and Social Interaction in Blind Children , 1999 .

[6]  W. Fitch,et al.  Morphology and development of the human vocal tract: a study using magnetic resonance imaging. , 1999, The Journal of the Acoustical Society of America.

[7]  P. Kuhl A new view of language acquisition. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[8]  C. Breazeal,et al.  Robots that imitate humans , 2002, Trends in Cognitive Sciences.

[9]  Peter Birkholz,et al.  Construction And Control Of A Three-Dimensional Vocal Tract Model , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[10]  P. Birkholz,et al.  SIMULATION OF VOCAL TRACT GROWTH FOR ARTICULATORY SPEECH SYNTHESIS , 2007 .

[11]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[12]  I. Howard,et al.  Modeling the development of pronunciation in infant speech acquisition. , 2011, Motor control.

[13]  Lucie Ménard,et al.  Acoustic and articulatory analysis of French vowels produced by congenitally blind adults and sighted adults. , 2013, The Journal of the Acoustical Society of America.

[14]  Peter Birkholz,et al.  Training an articulatory synthesizer with continuous acoustic data , 2013, INTERSPEECH.

[15]  P. Birkholz Modeling Consonant-Vowel Coarticulation for Articulatory Speech Synthesis , 2013, PloS one.

[16]  Mark Liberman,et al.  Highly Accurate Mandarin Tone Classification In The Absence of Pitch Information , 2014 .

[17]  Peter Birkholz,et al.  Identifying underlying articulatory targets of Thai vowels from acoustic data based on an analysis-by-synthesis approach , 2014, EURASIP J. Audio Speech Music. Process..

[18]  Jochen Triesch,et al.  Seeing [u] aids vocal learning: Babbling and imitation of vowels using a 3D vocal tract model, reinforcement learning, and reservoir computing , 2015, 2015 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob).

[19]  Ian S. Howard,et al.  Creating the cognitive form of phonological units: The speech sound correspondence problem in infancy could be solved by mirrored vocal interactions rather than by imitation , 2015, J. Phonetics.

[20]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[21]  Okko Johannes Räsänen,et al.  An online model for vowel imitation learning , 2017, Speech Commun..

[22]  Michael Schoeffler,et al.  webMUSHRA — A Comprehensive Framework for Web-based Listening Tests , 2018 .

[23]  Benjamin Parrell,et al.  The FACTS model of speech motor control: Fusing state estimation and task-based control , 2019, bioRxiv.

[24]  Peter Birkholz,et al.  Articulatory Copy Synthesis Based on a Genetic Algorithm , 2019, INTERSPEECH.

[25]  Peter Birkholz,et al.  Coarticulation as synchronized dimension-specific sequential target approximation : An articulatory synthesis simulation , 2022 .