Composition of Deep and Spiking Neural Networks for Very Low Bit Rate Speech Coding

Most current very low bit rate (VLBR) speech coding systems use hidden Markov model (HMM) based speech recognition and synthesis techniques. This allows transmission of information (such as phonemes) segment by segment; this decreases the bit rate. However, an encoder based on a phoneme speech recognition may create bursts of segmental errors; these would be further propagated to any suprasegmental (such as syllable) information coding. Together with the errors of voicing detection in pitch parametrization, HMM-based speech coding leads to speech discontinuities and unnatural speech sound artifacts. In this paper, we propose a novel VLBR speech coding framework based on neural networks (NNs) for end-to-end speech analysis and synthesis without HMMs. The speech coding framework relies on a phonological (subphonetic) representation of speech. It is designed as a composition of deep and spiking NNs: a bank of phonological analyzers at the transmitter, and a phonological synthesizer at the receiver. These are both realized as deep NNs, along with a spiking NN as an incremental and robust encoder of syllable boundaries for coding of continuous fundamental frequency (F0). A combination of phonological features defines much more sound patterns than phonetic features defined by HMM-based speech coders; this finer analysis/synthesis code contributes to smoother encoded speech. Listeners significantly prefer the NN-based approach due to fewer discontinuities and speech artifacts of the encoded speech. A single forward pass is required during the speech encoding and decoding. The proposed VLBR speech coding operates at a bit rate of approximately 360 bits/s.

[1]  Chin-Hui Lee,et al.  Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[3]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Rohit Prabhavalkar,et al.  Compressing deep neural networks using a rank-constrained topology , 2015, INTERSPEECH.

[5]  Dau-Cheng Lyu,et al.  Experiments on Cross-Language Attribute Detection and Phone Recognition With Minimal Target-Specific Training Data , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Tara N. Sainath,et al.  Structured Transforms for Small-Footprint Deep Learning , 2015, NIPS.

[7]  Milos Cernak,et al.  A simple continuous excitation model for parametric vocoding , 2015 .

[8]  Torsten Dau,et al.  Speech Intelligibility Evaluation for Mobile Phones. , 2015 .

[9]  Keiichi Tokuda,et al.  Speaker adaptation and the evaluation of speaker similarity in the EMIME speech-to-speech translation project , 2010, SSW.

[10]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[11]  George R. Doddington,et al.  A phonetic vocoder , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[12]  Steven Greenberg,et al.  LINGUISTIC DISSECTION OF SWITCHBOARD-CORPUS AUTOMATIC SPEECH RECOGNITION SYSTEMS , 2000 .

[13]  Liang Lu,et al.  Small-Footprint Deep Neural Networks with Highway Connections for Speech Recognition , 2015, INTERSPEECH.

[14]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[15]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Frank K. Soong,et al.  On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Thomas F. Quatieri,et al.  Multisensor very lowbit rate speech coding using segment quantization , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[19]  Kai Yu,et al.  From discontinuous to continuous F0 modelling in HMM-based speech synthesis , 2010, SSW.

[20]  Shigeo Morishima,et al.  Speech coding based on a multi-layer neural network , 1990, IEEE International Conference on Communications, Including Supercomm Technical Sessions.

[21]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[22]  Sarah Eichmann,et al.  English Sound Structure , 2016 .

[23]  José Mira,et al.  Engineering Applications of Bio-Inspired Artificial Neural Networks , 1999, Lecture Notes in Computer Science.

[24]  Milos Cernak,et al.  On structured sparsity of phonological posteriors for linguistic parsing , 2016, Speech Commun..

[25]  João P. Cabral,et al.  Using Noisy Speech to Study the Robustness of a Continuous F 0 Modelling Method in HMM-based Speech Synthesis , 2012 .

[26]  M. Sahani,et al.  Demodulation as Probabilistic Inference , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  S. Roucos,et al.  Segment quantization for very-low-rate speech coding , 1982, ICASSP.

[29]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[30]  Milos Cernak,et al.  Speech vocoding for laboratory phonology , 2017, Comput. Speech Lang..

[31]  Yang Zhen Prediction in speech coding: the modification of the coding of LPC parameters and nonlinear estimation technique by using ANN , 1996, Proceedings of Third International Conference on Signal Processing (ICSP'96).

[32]  Hao Jiang,et al.  A robust 800 bps MBE coder with VQ and MLP , 1998, ICCT'98. 1998 International Conference on Communication Technology. Proceedings (IEEE Cat. No.98EX243).

[33]  Vahid Tabataba Vakili,et al.  Complexity Reduction of LD-CELP Speech Coding in Prediction of Gain Using Neural Networks , 2009 .

[34]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[35]  S. Dimolitsas,et al.  Current objectives in 4-kb/s wireline-quality speech coding standardization , 1994, IEEE Signal Processing Letters.

[36]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[37]  Bishnu S. Atal,et al.  A new model of LPC excitation for producing natural-sounding speech at low bit rates , 1982, ICASSP.

[38]  Joon-Hyuk Chang,et al.  Packet Loss Concealment Based on Deep Neural Networks for Digital Speech Transmission , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39]  Lizhong Wu,et al.  Fully vector-quantized neural network-based code-excited nonlinear predictive speech coding , 1994, IEEE Trans. Speech Audio Process..

[40]  Robert M. Gray,et al.  Matrix quantizer design for LPC speech using the generalized Llyod algorithm , 1985, IEEE Trans. Acoust. Speech Signal Process..

[41]  Bruno Gas,et al.  Discriminative coding with predictive neural networks , 1999 .

[42]  Geneviève Baudoin,et al.  Corpus based very low bit rate speech coding , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[43]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[44]  METHODS FOR SUBJECTIVE DETERMINATION OF TRANSMISSION QUALITY Summary , 2022 .

[45]  Manfred R. Schroeder,et al.  Code-excited linear prediction(CELP): High-quality speech at very low bit rates , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[46]  Milos Cernak,et al.  Syllable-based pitch encoding for low bit rate speech coding with recognition/synthesis architecture , 2013, INTERSPEECH.

[47]  Richard M. Schwartz,et al.  A segment vocoder at 150 b/s , 1983, ICASSP.

[48]  Keiichi Tokuda,et al.  A very low bit rate speech coder using HMM-based speech recognition/synthesis techniques , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[49]  Shaul Markovitch,et al.  Anytime Learning of Decision Trees , 2007, J. Mach. Learn. Res..

[50]  Masaaki Honda,et al.  LPC speech coding based on variable-length segment quantization , 1988, IEEE Trans. Acoust. Speech Signal Process..

[51]  Milos Cernak,et al.  Neuromorphic based oscillatory device for incremental syllable boundary detection , 2015, INTERSPEECH.

[52]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[53]  Jae S. Lim,et al.  Multiband excitation vocoder , 1988, IEEE Transactions on Acoustics, Speech, and Signal Processing.

[54]  Richard V. Cox,et al.  A very low bit rate speech coder based on a recognition/synthesis paradigm , 2001, IEEE Trans. Speech Audio Process..

[55]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[56]  Ankoor S. Shah,et al.  An oscillatory hierarchy controlling neuronal excitability and stimulus processing in the auditory cortex. , 2005, Journal of neurophysiology.

[57]  S. Hunt A nonlinear adaptive predictor for speech compression , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[58]  Jerry D. Gibson,et al.  Speech Compression , 2016, Inf..

[59]  Philipos C. Loizou,et al.  Speech Quality Assessment , 2011, Multimedia Analysis, Processing and Communications.

[60]  Milos Cernak,et al.  PhonVoc: A Phonetic and Phonological Vocoding Toolkit , 2016, INTERSPEECH.

[61]  Milos Cernak,et al.  Incremental Syllable-Context Phonetic Vocoding , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[62]  Nicholas D. Lane,et al.  DeepEar: robust smartphone audio sensing in unconstrained acoustic environments using deep learning , 2015, UbiComp.

[63]  Richard E. Turner,et al.  A role for amplitude modulation phase relationships in speech rhythm perception. , 2014, The Journal of the Acoustical Society of America.

[64]  D. Wong,et al.  Very low data rate speech compression with LPC vector and matrix quantization , 1983, ICASSP.

[65]  Geoffrey E. Hinton,et al.  Deep Belief Networks for phone recognition , 2009 .

[66]  Kai Yu,et al.  Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[67]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[68]  Milos Cernak,et al.  Phonological vocoding using artificial neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[69]  Miroslav Líška,et al.  SLOVAK UNIVERSITY OF TECHNOLOGY IN BRATISLAVA , 2010 .

[70]  Philip N. Garner,et al.  DNN-Based Speech Synthesis: Importance of Input Features and Training Data , 2015, SPECOM.

[71]  Alexandre Hyafil,et al.  Speech encoding by coupled cortical theta and gamma oscillations , 2015, eLife.

[72]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[73]  Marcos Faúndez-Zanuy Adaptive Hybrid Speech Coding with a MLP/LPC Structure , 1999, IWANN.

[74]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[75]  Biing-Hwang Juang,et al.  An 800 bit/s vector quantization LPC vocoder , 1982 .

[76]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[77]  Heiga Zen,et al.  Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis , 2011, Speech Commun..

[78]  Milos Cernak,et al.  On compressibility of neural network phonological features for low bit rate speech coding , 2015, INTERSPEECH.

[79]  C. C. Goodyear,et al.  A CELP codebook and search technique using a Hopfield net , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.