On compressibility of neural network phonological features for low bit rate speech coding

Phonological features extracted by neural network have shown interesting potential for low bit rate speech vocoding. The span of phonological features is wider than the span of phonetic features, and thus fewer frames need to be transmitted. Moreover, the binary nature of phonological features enables a higher compression ratio at minor quality cost. In this paper, we study the compressibility and structured sparsity of the phonological features. We propose a compressive sampling framework for speech coding and sparse reconstruction for decoding prior to synthesis. Compressive sampling is found to be a principled way for compression in contrast to the conventional pruning approach; it leads to $50$% reduction in the bit-rate for better or equal quality of the decoded speech. Furthermore, exploiting the structured sparsity and binary characteristic of these features have shown to enable very low bit-rate coding at 700 bps with negligible quality loss; this coding scheme imposes no latency. If we consider a latency of $256$ ms for supra-segmental structures, the rate of $250-350$ bps is achieved.

[1]  Stephen P. Boyd,et al.  Disciplined Convex Programming , 2006 .

[2]  Piotr Indyk,et al.  Combining geometry and combinatorics: A unified approach to sparse signal recovery , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[3]  Milos Cernak,et al.  Incremental Syllable-Context Phonetic Vocoding , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Guillaume Gravier,et al.  Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News , 2004, LREC.

[5]  George R. Doddington,et al.  A phonetic vocoder , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[6]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[7]  Laurent Jacques,et al.  Dequantizing Compressed Sensing: When Oversampling and Non-Gaussian Constraints Combine , 2009, IEEE Transactions on Information Theory.

[8]  E.J. Candes,et al.  An Introduction To Compressive Sampling , 2008, IEEE Signal Processing Magazine.

[9]  Richard V. Cox,et al.  A very low bit rate speech coder based on a recognition/synthesis paradigm , 2001, IEEE Trans. Speech Audio Process..

[10]  Volkan Cevher,et al.  Model-based compressive sensing for multi-party distant speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Milos Cernak,et al.  A simple continuous excitation model for parametric vocoding , 2015 .

[12]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[13]  METHODS FOR SUBJECTIVE DETERMINATION OF TRANSMISSION QUALITY Summary , 2022 .

[14]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[15]  Volkan Cevher,et al.  Sparse Signal Recovery and Acquisition with Graphical Models , 2010, IEEE Signal Processing Magazine.

[16]  Milos Cernak,et al.  Phonological vocoding using artificial neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[18]  Keiichi Tokuda,et al.  A very low bit rate speech coder using HMM-based speech recognition/synthesis techniques , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[19]  Milos Cernak,et al.  Syllable-based pitch encoding for low bit rate speech coding with recognition/synthesis architecture , 2013, INTERSPEECH.