论文信息 - Convolutive Transfer Function Invariant SDR Training Criteria for Multi-Channel Reverberant Speech Separation

Convolutive Transfer Function Invariant SDR Training Criteria for Multi-Channel Reverberant Speech Separation

Time-domain training criteria have proven to be very effective for the separation of single-channel non-reverberant speech mixtures. Likewise, mask-based beamforming has shown impressive performance in multi-channel reverberant speech enhancement and source separation. Here, we propose to combine neural network supported multi-channel source separation with a time-domain training objective function. For the objective we propose to use a convolutive transfer function invariant Signal-to-Distortion Ratio (CI-SDR) based loss. While this is a well-known evaluation metric (BSS Eval), it has not been used as a training objective before. To show the effectiveness, we demonstrate the performance on LibriSpeech based reverberant mixtures. On this task, the proposed system approaches the error rate obtained on single-source non-reverberant input, i.e., LibriSpeech test clean, with a difference of only 1.2 percentage points, thus outperforming a conventional permutation invariant training based system and alternative objectives like Scale Invariant Signal-to-Distortion Ratio by a large margin.

[1] B.D. Van Veen,et al. Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[2] Reinhold Haeb-Umbach,et al. SMS-WSJ: Database, performance measures, and baseline recipe for multi-channel source separation and recognition , 2019, ArXiv.

[3] Zhong-Qiu Wang,et al. Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Xiaofei Wang,et al. A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[5] Rémi Gribonval,et al. Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6] Jont B. Allen,et al. Image method for efficiently simulating small‐room acoustics , 1976 .

[7] Jonathan Le Roux,et al. MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[8] Jonathan Le Roux,et al. SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Reinhold Häb-Umbach,et al. Beamnet: End-to-end training of a beamformer-supported multi-channel ASR system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Dong Yu,et al. Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11] Zhuo Chen,et al. Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Shinji Watanabe,et al. ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[13] Pasi Pertilä,et al. Distant speech separation using predicted time-frequency masks from spatial features , 2015, Speech Commun..

[14] Reinhold Häb-Umbach,et al. Optimizing neural-network supported acoustic beamforming by algorithmic differentiation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Tomohiro Nakatani,et al. Probabilistic spatial dictionary based online adaptive beamforming for meeting recognition in noisy and reverberant environments , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Reinhold Häb-Umbach,et al. A generic neural acoustic beamforming architecture for robust multi-channel speech processing , 2017, Comput. Speech Lang..

[18] Jonathan Le Roux,et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Te-Won Lee,et al. Blind Source Separation Exploiting Higher-Order Frequency Dependencies , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[20] H. Sebastian Seung,et al. Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[21] Reinhold Haeb-Umbach,et al. Integration of Neural Networks and Probabilistic Spatial Models for Acoustic Blind Source Separation , 2019, IEEE Journal of Selected Topics in Signal Processing.

[22] Reinhold Haeb-Umbach,et al. Demystifying TasNet: A Dissecting Approach , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] Pierre Comon,et al. Independent component analysis, A new concept? , 1994, Signal Process..

[24] Andries P. Hekstra,et al. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[25] Jesper Jensen,et al. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[26] R. Mises,et al. Praktische Verfahren der Gleichungsauflösung . , 1929 .

[27] Hiroshi Sawada,et al. Normalized observation vector clustering approach for sparse source separation , 2006, 2006 14th European Signal Processing Conference.

[28] Nima Mesgarani,et al. TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] Jesper Jensen,et al. On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30] Tara N. Sainath,et al. Performance of Mask Based Statistical Beamforming in a Smart Home Scenario , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Scott Rickard,et al. Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.