Supervised source enhancement composed of nonnegative auto-encoders and complementarity subtraction

A method for constructing deep neural networks (DNNs) for accurate supervised source enhancement is proposed. Attempts were made in previous studies to estimate the power spectral densities (PSDs) of sound sources, which are used to estimate Wiener filters for source enhancement, from the output of multiple beamformings using DNNs. Although performance improved, it was not possible to guarantee accurate PSD estimation since the trained DNNs were treated as black boxes. The proposed DNN construction method uses non-negative auto-encoders and complementarity subtraction. This study also reveals that auto-encoders whose weights are non-negative correspond to non-negative matrix factorization (NMF), which decomposes source PSDs into non-negative spectral bases and their activations. It further introduces a complementarity subtraction method for estimating PSDs accurately. Through several experiments, it was confirmed that the signal-to-interference plus noise ratio improved by approximately 12 dB for datasets captured in various noisy/reverberant rooms.

[1]  Yusuke Hioka,et al.  Pinpoint extraction of distant sound source based on DNN mapping from multiple beamforming outputs to prior SNR , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Masakiyo Fujimoto,et al.  Real-time integration of statistical model-based speech enhancement with unsupervised noise PSD estimation using microphone array , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Andrzej Cichocki,et al.  Nonnegative Matrix and Tensor Factorization T , 2007 .

[4]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[5]  Thomas Hain,et al.  Using neural network front-ends on far field multiple microphones based speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Yusuke Hioka,et al.  Underdetermined Sound Source Separation Using Power Spectrum Density Estimated by Combination of Directivity Gain , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Paris Smaragdis,et al.  Adaptive Denoising Autoencoders: A Fine-Tuning Scheme to Learn from Test Mixtures , 2015, LVA/ICA.

[8]  R. Zelinski,et al.  A microphone array with adaptive post-filtering for noise reduction in reverberant rooms , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[9]  D. E. Rumelhart,et al.  chapter Parallel Distributed Processing, Exploration in the Microstructure of Cognition , 1986 .

[10]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[11]  Harry L. Van Trees,et al.  Optimum Array Processing , 2002 .

[12]  O. L. Frost,et al.  An algorithm for linearly constrained adaptive array processing , 1972 .

[13]  P. Smaragdis,et al.  Non-negative matrix factorization for polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[14]  Nancy Bertin,et al.  Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis , 2009, Neural Computation.

[15]  Steve Renals,et al.  Neural networks for distant speech recognition , 2014, 2014 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA).

[16]  Yusuke Hioka,et al.  PSD estimation in beamspace using property of M-matrix , 2016, 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC).

[17]  Yusuke Hioka,et al.  Application of neural network to source PSD estimation for wiener filter based array sound source enhancement , 2016, 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC).

[18]  Masakiyo Fujimoto,et al.  Exploring multi-channel features for denoising-autoencoder-based speech enhancement , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Hirokazu Kameoka,et al.  Efficient algorithms for multichannel extensions of Itakura-Saito nonnegative matrix factorization , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Don H. Johnson,et al.  Array Signal Processing: Concepts and Techniques , 1993 .

[21]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[22]  Yusuke Hioka,et al.  Integrated approach of feature extraction and sound source enhancement based on maximization of mutual information , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Hervé Bourlard,et al.  Microphone array post-filter based on noise field coherence , 2003, IEEE Trans. Speech Audio Process..

[24]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[25]  Yusuke Hioka,et al.  Optimal Microphone Array Observation for Clear Recording of Distant Sound Sources , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Christian Ritz,et al.  Spectral mask estimation using deep neural networks for inter-sensor data ratio model based robust DOA estimation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Yannick Mahieux,et al.  Analysis of noise reduction and dereverberation techniques based on microphone arrays with postfiltering , 1998, IEEE Trans. Speech Audio Process..

[28]  Yusuke Hioka,et al.  Post-filter design for speech enhancement in various noisy environments , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[29]  Michael S. Brandstein,et al.  Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.