Low-level articulatory synthesis: A working text-to-speech solution and a linguistic tool 1

Abstract A complete text-to-speech system has been created by the authors, based on a tube resonance model of the vocal tract and a development of Carré’s “Distinctive Region Model”, which is in turn based on the formant-sensitivity findings of Fant and Pauli (1974), to control the tube. In order to achieve this goal, significant long-term linguistic research has been involved, including rhythm and intonation studies, as well as the development of low-level articulatory data and rules to drive the model, together with the necessary tools, parsers, dictionaries and so on. The tools and the current system are available under a General Public License, and are described here, with further references in the paper, including samples of the speech produced, and figures illustrating the system description. Résumé Un système de synthèse vocale complet a été créé par les auteurs, basé sur un modèle de résonance tubulaire du système vocal, et, pour contrôler le tube, sur un développement du modèle aux régions distinctes de René Carré, qui est à son tour basé sur les résultats de Fant and Pauli (1974) au sujet de la sensibilité des formants. Pour atteindre cet objectif, des recherches linguistiques à long terme ont été menées, y compris des études de rythme et d'intonation, ainsi que le développement de données articulatoires de bas niveau et de règles pour faire fonctionner le modèle, ainsi que les outils, les analyseurs syntaxiques, les dictionnaires, etc. Les outils et le système actuel sont disponibles sous une Licence Publique Générale; ils sont décrits ici. D'autres références figurent dans l'article, y compris des exemples de la parole synthétisée et des figures illustrant la description du système.

[1]  Elizabeth T. Uldall,et al.  Transitions in Fricative Noise , 1963 .

[2]  Joseph P. Olive,et al.  Text-to-speech synthesis , 1995, AT&T Technical Journal.

[3]  J. D. Pijper,et al.  Modelling British English Intonation , 1998 .

[4]  I H Witten A Flexible Scheme for Assigning Timing and Pitch To Synthetic Speech , 1977, Language and speech.

[5]  都築 正喜 Sound Spectrograph による音声の新表記法 , 1992 .

[6]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[7]  G. Allen The Location of Rhythmic Stress Beats in English: an Experimental Study I , 1972, Language and speech.

[8]  Noam Chomsky Current Issues in Linguistic Theory , 1964 .

[9]  Ian H. Witten,et al.  Isochrony in English Speech: its Statistical Validity and Linguistic Relevance , 1984 .

[10]  Harold Edward Palmer,et al.  English through actions , 1959 .

[11]  Phil Clendeninn The Vocoder , 1940, Nature.

[12]  Pierre Delattre Coarticulation and the Locus Theory , 1967 .

[13]  P. Strevens Spectra of Fricative Noise in Human Speech , 1960 .

[14]  H. Hoffman Study of Some Cues in the Perception of the Voiced Stop Consonants , 1958 .

[15]  P. Kuhl,et al.  Early Speech Perception and Later Language Development: Implications for the "Critical Period" , 2005 .

[16]  Ian H. Witten,et al.  Some results from a preliminary study of British English speech rhythm , 1977 .

[17]  P. Birkholz Modeling Consonant-Vowel Coarticulation for Articulatory Speech Synthesis , 2013, PloS one.

[18]  J. Holmes,et al.  Speech Synthesis by Rule , 1964 .

[19]  K. Pike,et al.  The intonation of American English , 1946 .

[20]  Κ. Ν. Stevens On the Relations between Speech Movements and Speech Perception , 1968 .

[21]  H. K. Dunn The Calculation of Vowel Resonances, and an Electrical Vocal Tract , 1950 .

[22]  David B. Pisoni,et al.  Text-to-speech: the mitalk system , 1987 .

[23]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[24]  W. G. Radley Visible Speech , 1948, Nature.

[25]  David R. Hill A PROGRAM STRUCTURE FOR EVENT-BASED SPEECH SYNTHESIS BY RULES WITHIN A FLEXIBLE SEGMENTAL FRAMEWORK , 1978 .

[26]  R. E. English,et al.  Towards Articulatory Speech Synthesis with a Dynamic 3 D Finite Element Tongue Model , 2006 .

[27]  G D Allen,et al.  The Location of Rhythmic Stress Beats in English : an Experimental Study II , 1972, Language and speech.

[28]  Brad H. Story,et al.  Phrase-level speech simulation with an airway modulation model of speech production , 2013, Comput. Speech Lang..

[29]  A. Liberman,et al.  Some Experiments on the Perception of Synthetic Speech Sounds , 1952 .

[30]  E. Vatikiotis-Bateson,et al.  ArtiSynth : A Biomechanical Simulation Platform for the Vocal Tract and Upper Airway , 1986 .

[31]  P. Kuhl A new view of language acquisition. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[32]  René Carré,et al.  Distinctive regions in acoustic tubes. Speech production modelling , 1992 .

[33]  Transitions in Fricative Noise , 1964 .

[34]  J. Hart,et al.  On the anatomy of intonation , 1968 .

[35]  Ian H. Witten,et al.  A Statistical approach to the problem of isochrony in spoken British english , 1978 .

[36]  Nico Willems,et al.  A synthesis scheme for British English intonation , 1988 .

[37]  Wiktor Jassem,et al.  The Formants of Fricative Consonants , 1965 .

[38]  D. Broadbent,et al.  Information Conveyed by Vowels , 1957 .

[39]  Craig-Richard Taube-Schock Synthesizing intonation for computer speech output , 1994 .

[40]  P. Green,et al.  Consonant-Vowel Transitions. A Spectrographic Study , 1959 .

[41]  A. Liberman,et al.  Minimal Rules for Synthesizing Speech , 1959 .

[42]  L. Lisker Minimal Cues for Separating /w, r, l, y/ in Intervocalic Position , 1957 .

[43]  D. Abercrombie,et al.  Elements of General Phonetics , 1967 .

[44]  M. Halliday A course in spoken English : intonation , 1970 .

[45]  David R. Hill,et al.  An Experiment on the Perception of Intonational Features , 1977, Int. J. Man Mach. Stud..

[46]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[47]  Douglas D. O'Shaughnessy Fundamental frequency by rule for a text-to-speech system , 1977 .

[48]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[49]  Leonard C. Manzara The tube resonance model speech synthesizer , 2005 .

[50]  Kurt E. Dusterhoff,et al.  Synthesizing fundamental frequency using models automatically trained from data , 2000 .

[51]  Homer Dudley,et al.  A Synthetic Speaker , 1939, Science.

[52]  David Abercrombie English phonetic texts , 1964 .

[53]  John Hart,et al.  A Perceptual Study of Intonation , 1990 .

[54]  Identification Of Control Parameters In An Articulatory Vocal Tract Model, With Applications To The Synthesis Of Singing , 1990 .

[55]  G. Fant Acoustic theory of speech production : with calculations based on X-ray studies of Russian articulations , 1961 .

[56]  A. Liberman,et al.  Acoustic Cues for the Perception of Initial /w, j, r, l/ in English , 1957 .

[57]  A. Liberman,et al.  Acoustic Loci and Transitional Cues for Consonants , 1954 .