论文信息 - Tensor-To-Vector Regression for Multi-Channel Speech Enhancement Based on Tensor-Train Network

Tensor-To-Vector Regression for Multi-Channel Speech Enhancement Based on Tensor-Train Network

We propose a tensor-to-vector regression approach to multi-channel speech enhancement in order to address the issue of input size explosion and hidden-layer size expansion. The key idea is to cast the conventional deep neural network (DNN) based vector-to-vector regression formulation under a tensor-train network (TTN) framework. TTN is a recently emerged solution for compact representation of deep models with fully connected hidden layers. Thus TTN maintains DNN’s expressive power yet involves a much smaller amount of trainable parameters. Furthermore, TTN can handle a multi-dimensional tensor input by design, which exactly matches the desired setting in multi-channel speech enhancement. We first provide a theoretical extension from DNN to TTN based regression. Next, we show that TTN can attain speech enhancement quality comparable with that for DNN but with much fewer parameters, e.g., a reduction from 27 million to only 5 million parameters is observed in a single-channel scenario. TTN also improves PESQ over DNN from 2.86 to 2.96 by slightly increasing the number of trainable parameters. Finally, in 8-channel conditions, a PESQ of 3.12 is achieved using 20 million parameters for TTN, whereas a DNN with 68 million parameters can only attain a PESQ of 3.06.

[1] David Malah,et al. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[2] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3] Alexander Novikov,et al. Tensorizing Neural Networks , 2015, NIPS.

[4] DeLiang Wang,et al. Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5] Liang Lu,et al. Deep beamforming networks for multi-channel speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Andries P. Hekstra,et al. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[7] Dong Wang,et al. Bottleneck features based on gammatone frequency cepstral coefficients , 2013, INTERSPEECH.

[8] Björn W. Schuller,et al. Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[9] Li-Rong Dai,et al. A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10] Chin-Hui Lee,et al. Performance Analysis for Tensor-Train Decomposition to Deep Neural Network Based Vector-to-Vector Regression , 2020, 2020 54th Annual Conference on Information Sciences and Systems (CISS).

[11] Liqing Zhang,et al. Saliency Detection: A Spectral Residual Approach , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[12] Dong Wang,et al. Subspace models for bottleneck features , 2013, INTERSPEECH.

[13] E. Lehmann,et al. Prediction of energy decay in room impulse responses simulated with an image-source model. , 2008, The Journal of the Acoustical Society of America.

[14] Kun Li,et al. Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Yi Jiang,et al. Auditory features based on Gammatone filters for robust speech recognition , 2013, 2013 IEEE International Symposium on Circuits and Systems (ISCAS2013).

[16] Chin-Hui Lee,et al. Two-Stage Enhancement of Noisy and Reverberant Microphone Array Speech for Automatic Speech Recognition Systems Trained with Only Clean Speech , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[17] Harry L. Van Trees,et al. Optimum Array Processing: Part IV of Detection, Estimation, and Modulation Theory , 2002 .

[18] Chin-Hui Lee,et al. Convolutional-Recurrent Neural Networks for Speech Enhancement , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Boaz Rafaely,et al. Microphone Array Signal Processing , 2008 .

[20] Dong Yu,et al. Exploiting sparseness in deep neural networks for large vocabulary speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Zhong-Qiu Wang,et al. All-Neural Multi-Channel Speech Enhancement , 2018, INTERSPEECH.

[22] Ivan Oseledets,et al. Tensor-Train Decomposition , 2011, SIAM J. Sci. Comput..

[23] Antonio Bonafonte,et al. SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[24] Yoshio Hirose,et al. Backpropagation algorithm which varies the number of hidden units , 1989, International 1989 Joint Conference on Neural Networks.

[25] Janet M. Baker,et al. The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[26] DeLiang Wang,et al. A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[27] Tong Wang,et al. A reverberation-time-aware DNN approach leveraging spatial information for microphone array dereverberation , 2017, EURASIP J. Adv. Signal Process..

[28] Jun Du,et al. A Theory on Deep Neural Network Based Vector-to-Vector Regression With an Illustration of Its Expressive Power in Speech Enhancement , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29] Yifan Gong,et al. Restructuring of deep neural network acoustic models with singular value decomposition , 2013, INTERSPEECH.