Resource Efficient Deep Eigenvector Beamforming

We propose binary neural networks (BNN s) for acoustic beamforming. This makes the speech enhancement approach resource efficient and applicable for embedded applications. Using CHiME4 data, we use BNN s to estimate the speech presence probability mask for GEV-PAN beamformers. By doing so, we achieve audio quality and ASR scores on par to single-precision deep neural networks (DNNs), while the computational requirements and the memory footprint are significantly reduced.

[1]  Takuya Yoshioka,et al.  Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[3]  Yoshua Bengio,et al.  BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 , 2016, ArXiv.

[4]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[5]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Methods for objective and subjective assessment of quality Perceptual evaluation of speech quality ( PESQ ) : An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2002 .

[7]  Igor Carron,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016 .

[8]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[9]  Reinhold Häb-Umbach,et al.  Speech enhancement with a new generalized eigenvector blocking matrix for application in a generalized sidelobe canceller , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Yoshua Bengio,et al.  Training deep neural networks with low precision multiplications , 2014 .

[11]  Bin Liu,et al.  Ternary Weight Networks , 2016, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13]  Alexander G. Anderson,et al.  The High-Dimensional Geometry of Binary Neural Networks , 2017, ICLR.

[14]  Franz Pernkopf,et al.  DNN-based speech mask estimation for eigenvector beamforming , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, ArXiv.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Emmanuel Vincent,et al.  Improved Perceptual Metrics for the Evaluation of Audio Source Separation , 2012, LVA/ICA.

[18]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[19]  Reinhold Häb-Umbach,et al.  BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[20]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[21]  Philip Heng Wai Leong,et al.  FINN: A Framework for Fast, Scalable Binarized Neural Network Inference , 2016, FPGA.

[22]  Franz Pernkopf,et al.  Eigenvector-based Speech Mask Estimation using a Logistic Regression for Multi-Channel Speech Enhancement , 2017 .

[23]  Franz Pernkopf,et al.  Discrete-Valued Neural Networks Using Variational Inference , 2018 .

[24]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[25]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[26]  Franz Pernkopf,et al.  Eigenvector-Based Speech Mask Estimation Using Logistic Regression , 2017, INTERSPEECH.

[27]  Reinhold Häb-Umbach,et al.  Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.