Multi-frame Quantization of LSF Parameters Using a Deep Autoencoder and Pyramid Vector Quantizer

This paper presents a multi-frame quantization of line spectral frequency (LSF) parameters using a deep autoencoder (DAE) and pyramid vector quantizer (PVQ). The object is to provide sophisticated LSF quantization for the ultra-low bit rate speech coders with moderate delay. For the compression and de-correlation of multiple LSF frames, a DAE possessing linear coder-layer units with Gaussian noise is used. The DAE demonstrates a high degree of modelling flexibility for multiple LSF frames. To quantize the coder-layer vector effectively, a PVQ is considered. Comparing the discrete cosine model (DCM), the DAE-based compression shows better modelling accuracy of multi-frame LSF parameters and possesses an advantage in that the coder-layer dimensions could be any value. The compressed coder-layer dimensions of the DAE govern the trade-off between the modelling distortion and the coder-layer quantization distortion. The experimental results show that the proposed algorithm with determined optimal coder-layer dimension outperforms the DCM-based multi-frame LSF quantization approach in terms of spectral distortion (SD) performance and robustness across different speech segments.

[1]  Paris Smaragdis,et al.  Experiments on deep learning for speech denoising , 2014, INTERSPEECH.

[2]  Qiuyun Hao,et al.  400bps High-Quality Speech Coding Algorithm , 2016, 2016 International Symposium on Computer, Consumer and Control (IS3C).

[3]  Ahmet M. Kondoz,et al.  Digital Speech: Coding for Low Bit Rate Communication Systems , 1995 .

[4]  Yaxing Li,et al.  Deep neural network-based linear predictive parameter estimations for speech enhancement , 2017, IET Signal Process..

[5]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[6]  Pengfei Duan,et al.  Multi-frame Coding of LSF Parameters Using Block-Constrained Trellis Coded Vector Quantization , 2018, INTERSPEECH.

[7]  Hironobu Fujiyoshi,et al.  To Be Bernoulli or to Be Gaussian, for a Restricted Boltzmann Machine , 2014, 2014 22nd International Conference on Pattern Recognition.

[8]  Joel Max,et al.  Quantizing for minimum distortion , 1960, IRE Trans. Inf. Theory.

[9]  Laurent Girin,et al.  Long-Term Quantization of Speech LSF Parameters , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[10]  Yaxing Li,et al.  Artificial bandwidth extension using deep neural network-based spectral envelope estimation and enhanced excitation estimation , 2016, IET Signal Process..

[11]  Roberto Roncella,et al.  A pyramid vector quantizer chip for HDTV applications , 1997, Eur. Trans. Telecommun..

[12]  Peng Zhang,et al.  A variable-bit-rate speech coding algorithm based on enhanced mixed excitation linear prediction , 2016, 2016 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI).

[13]  Heiga Zen,et al.  Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends , 2015, IEEE Signal Processing Magazine.

[14]  Qiang Li,et al.  A 600bps Vocoder Algorithm Based on MELP , 2017 .

[15]  Peter Glöckner,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2013 .

[16]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.