Speech vocoding for laboratory phonology

Using phonological speech vocoding, we propose a platform for exploring relations between phonology and speech processing, and in broader terms, for exploring relations between the abstract and physical structures of a speech signal. Our goal is to make a step towards bridging phonology and speech processing and to contribute to the program of Laboratory Phonology. We show three application examples for laboratory phonology: compositional phonological speech modelling, a comparison of phonological systems and an experimental phonological parametric text-to-speech (TTS) system. The featural representations of the following three phonological systems are considered in this work: (i) Government Phonology (GP), (ii) the Sound Pattern of English (SPE), and (iii) the extended SPE (eSPE). Comparing GP- and eSPE-based vocoded speech, we conclude that the latter achieves slightly better results than the former. However, GP - the most compact phonological speech representation - performs comparably to the systems with a higher number of phonological features. The parametric TTS based on phonological speech representation, and trained from an unlabelled audiobook in an unsupervised manner, achieves intelligibility of 85% of the state-of-the-art parametric speech synthesis. We envision that the presented approach paves the way for researchers in both fields to form meaningful hypotheses that are explicitly testable using the concepts developed and exemplified in this paper. On the one hand, laboratory phonologists might test the applied concepts of their theoretical models, and on the other hand, the speech processing community may utilize the concepts developed for the theoretical phonological models for improvements of the current state-of-the-art applications.

[1]  Simon King,et al.  Detection of phonological features in continuous speech using neural networks , 2000, Comput. Speech Lang..

[2]  Koichi Shinoda,et al.  Acoustic modeling based on the MDL principle for speech recognition , 1997, EUROSPEECH.

[3]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[4]  George R. Doddington,et al.  A phonetic vocoder , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[5]  Dau-Cheng Lyu,et al.  Experiments on Cross-Language Attribute Detection and Phone Recognition With Minimal Target-Specific Training Data , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Milos Cernak,et al.  A simple continuous excitation model for parametric vocoding , 2015 .

[7]  Daniel Hirst,et al.  The analysis by synthesis of speech melody: from data to models. , 2011 .

[8]  Milos Cernak,et al.  Composition of Deep and Spiking Neural Networks for Very Low Bit Rate Speech Coding , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Chin-Hui Lee,et al.  Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Richard V. Cox,et al.  A very low bit rate speech coder based on a recognition/synthesis paradigm , 2001, IEEE Trans. Speech Audio Process..

[11]  Keiichi Tokuda,et al.  Speaker adaptation and the evaluation of speaker similarity in the EMIME speech-to-speech translation project , 2010, SSW.

[12]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[13]  Milos Cernak,et al.  Phonological vocoding using artificial neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Philipos C. Loizou,et al.  Speech Quality Assessment , 2011, Multimedia Analysis, Processing and Communications.

[15]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[16]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[17]  Jean Lowenstamm,et al.  Constituent structure and government in phonology , 1990, Phonology.

[18]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[19]  Elizabeth Caroline Sagey,et al.  The representation of features and relations in non-linear phonology , 1986 .

[20]  John Harris,et al.  English Sound Structure , 1994 .

[21]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[22]  Larry M. Hyman,et al.  How autosegmental is phonology? , 2013 .

[23]  Junichi Yamagishi,et al.  The SIWIS Database: A Multilingual Speech Database with Acted Emphasis , 2016, INTERSPEECH.

[24]  David Poeppel,et al.  Analysis by Synthesis: A (Re-)Emerging Program of Research for Language and Vision , 2010, Biolinguistics.

[25]  Colin C. Goodyear,et al.  Articulatory copy synthesis using a nine-parameter vocal tract model , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[26]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[27]  David Poeppel,et al.  Cortical oscillations and speech processing: emerging computational principles and operations , 2012, Nature Neuroscience.

[28]  Keiichi Tokuda,et al.  A very low bit rate speech coder using HMM-based speech recognition/synthesis techniques , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[29]  A. Simpson,et al.  Acoustic analysis of German vowels in the Kiel Corpus of Read Speech , 1997 .

[30]  J. Holmes,et al.  The influence of glottal waveform on the naturalness of speech from a parallel formant synthesizer , 1973 .

[31]  Mary E. Beckman,et al.  Conceptual Foundations of Phonology as a Laboratory Science (reprint) , 2011 .

[32]  P. Ladefoged A course in phonetics , 1975 .

[33]  Roman Jakobson,et al.  Fundamentals of Language , 1957 .

[34]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[35]  W. Bastiaan Kleijn,et al.  Speech Quality Assessment , 2008 .

[36]  Yves Laprie,et al.  Articulatory copy synthesis from cine x-ray films , 2013, INTERSPEECH.

[37]  Gérard Chollet,et al.  Segmental vocoder-going beyond the phonetic approach , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[38]  Ren-Hua Wang,et al.  USTC System for Blizzard Challenge 2006 an Improved HMM-based Speech Synthesis Method , 2006, Blizzard Challenge.

[39]  Bhiksha Raj,et al.  Compositional Models for Audio Processing: Uncovering the structure of sound mixtures , 2015, IEEE Signal Processing Magazine.