Articulatory Synthesis of Speech and Singing: State of the Art and Suggestions for Future Research

Articulatory synthesis of speech and singing aims for modeling the production process of speech and singing as human-like or natural as possible. The state of the art is described for all modules of articulatory synthesis systems, i.e. vocal tract models, acoustic models, glottis models, noise source models, and control models generating articulator movements and phonatory control information. While a lot of knowledge is available for the production and for the high quality acoustic realization of static spoken and sung sounds it is suggested to improve the quality of control models especially for the generation of articulatory movements . Thus the main problem which should be addressed for improving articulatory synthesis over the next years is the development of high quality control concepts. It is suggested to use action based control concepts and to gather control knowledge by imitating natural speech acquisition and singing acquisition scenarios. It is emphasized that teacher-learner interaction and production, perception, and compre hension of auditory as well as of visual and somatosensory infor mation (multi modal information) should be included in the acquisition (i.e. training or learning) procedures.

[1]  Olov Engwall,et al.  Combining MRI, EMA and EPG measurements in a three-dimensional tongue model , 2003, Speech Commun..

[2]  Peter Birkholz,et al.  A Gesture-Based Concept for Speech Movement Control in Articulatory Speech Synthesis , 2007, COST 2102 Workshop.

[3]  Peter Birkholz Articulatory synthesis of singing , 2007, INTERSPEECH.

[4]  W. Strong,et al.  A model for the synthesis of natural sounding vowels , 1983 .

[5]  A. Serrurier,et al.  A three-dimensional articulatory model of the velum and nasopharyngeal wall based on MRI and CT data. , 2008, The Journal of the Acoustical Society of America.

[6]  J. L. Flanagan,et al.  Synthesis of speech from a dynamic model of the vocal cords and vocal tract , 1975, The Bell System Technical Journal.

[7]  B. Kröger Ein visuelles Modell der Artikulation , 2003 .

[8]  C. Browman,et al.  Articulatory Phonology: An Overview , 1992, Phonetica.

[9]  Peter Birkholz,et al.  Simulation of Losses Due to Turbulence in the Time-Varying Vocal System , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Bernd J. Kröger,et al.  The Organization of a Neurocomputational Control Model for Articulatory Speech Synthesis , 2008, COST 2102 Workshop.

[11]  G. Bailly,et al.  Linear degrees of freedom in speech production: analysis of cineradio- and labio-film data and articulatory-acoustic modeling. , 2001, The Journal of the Acoustical Society of America.

[12]  Gérard Bailly,et al.  Three-dimensional linear articulatory modeling of tongue, lips and face, based on MRI and video images , 2002, J. Phonetics.

[13]  Michael Vorländer,et al.  Physical Modeling of the Singing Voice , 2002 .

[15]  Dani Byrd,et al.  Action to Language via the Mirror Neuron System: The role of vocal tract gestural action units in understanding the evolution of phonology , 2006 .

[16]  B. Kröger,et al.  ON THE QUANTITATIVE RELATIONSHIP BETWEEN SUBGLOTTAL PRESSURE , VOCAL CORD TENSION , AND GLOTTAL ADDUCTION IN SINGING , 2009 .

[17]  Ritu Sharma Speech Synthesis , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[18]  Dani Byrd,et al.  Task-dynamics of gestural timing: Phase windows and multifrequency rhythms , 2000 .

[19]  Peter Birkholz,et al.  Construction And Control Of A Three-Dimensional Vocal Tract Model , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[20]  J. Schwartz,et al.  Seeing to hear better: evidence for early audio-visual interactions in speech identification , 2004, Cognition.

[21]  Gunnar Fant,et al.  Some problems in voice source analysis , 1993, Speech Commun..

[22]  D. Poeppel,et al.  Towards a functional neuroanatomy of speech perception , 2000, Trends in Cognitive Sciences.

[23]  F. Guenther Cortical interactions underlying the production of speech sounds. , 2006, Journal of communication disorders.

[24]  J. Liljencrants,et al.  Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow , 2022 .

[25]  P. Birkholz,et al.  Vocal Tract Model Adaptation Using Magnetic Resonance Imaging , 2006 .

[26]  Shinji Maeda Improved articulatory models , 1988 .

[27]  John Nicholas Holmes,et al.  Speech synthesis , 1972 .

[28]  Bernd J. Kröger Ein phonetisches Modell der Sprachproduktion , 1998 .

[29]  S. Maeda An articulatory model of the tongue based on a statistical analysis , 1979 .

[30]  D. Berry,et al.  A finite-element model of vocal-fold vibration. , 2000, The Journal of the Acoustical Society of America.

[31]  Gérard Bailly,et al.  Learning to speak. Sensori-motor control of speech movements , 1997, Speech Commun..

[32]  P. Mermelstein Articulatory model for the study of speech production. , 1973, The Journal of the Acoustical Society of America.

[33]  Man Mohan Sondhi,et al.  A hybrid time-frequency domain articulatory speech synthesizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[34]  Satrajit S. Ghosh,et al.  Neural modeling and imaging of the cortical interactions underlying syllable production , 2006, Brain and Language.

[35]  Bert Cranen,et al.  Physiologically motivated modelling of the voice source in articulatory analysis/synthesis , 1993, Speech Commun..

[36]  L. Boves,et al.  On subglottal formant analysis. , 1987, The Journal of the Acoustical Society of America.

[37]  B J Kröger A Gestural Production Model and Its Application to Reduction in German , 1993, Phonetica.

[38]  I. Titze,et al.  Voice simulation with a body-cover model of the vocal folds. , 1995, The Journal of the Acoustical Society of America.

[39]  Gérard Bailly,et al.  Synthesis of French fricatives by audio-video to articulatory inversion , 2001 .

[40]  H. Strube,et al.  A quasiarticulatory speech synthesizer for German language running in real time , 1989 .

[41]  J. Dang,et al.  Construction and control of a physiological articulatory model. , 2004, The Journal of the Acoustical Society of America.

[42]  Pierre Badin,et al.  Vocal tract acoustics using the transmission line matrix (TLM) method , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[43]  Ingo Titze,et al.  A four-parameter model of the glottis and vocal fold contact area , 1989, Speech Commun..

[44]  J. Flanagan,et al.  Synthesis of voiced sounds from a two-mass model of the vocal cords , 1972 .

[45]  Shinji Maeda,et al.  A digital simulation method of the vocal-tract system , 1982, Speech Commun..

[46]  R. Wilhelms-Tricarico Physiological modeling of speech production: methods for modeling soft-tissue articulators. , 1995, The Journal of the Acoustical Society of America.

[47]  D. Sinder,et al.  Speech synthesis using an aeroacoustic fricative model , 1999 .

[48]  L Saltzman Elliot,et al.  A Dynamical Approach to Gestural Patterning in Speech Production , 1989 .