Online Audio-Visual Speech Separation with Generative Adversarial Training

Audio-visual speech separation has been demonstrated to be effective in solving the cocktail party problem. However, most of the models cannot meet online processing, which limits their application in video communication and human-robot interaction. Besides, SI-SNR, the most popular training loss function in speech separation, results in some artifacts in the separated audio, which would harm downstream applications, such as automatic speech recognition (ASR). In this paper, we propose an online audio-visual speech separation model with generative adversarial training to solve the two problems mentioned above. We build our generator (i.e., audio-visual speech separator) with causal temporal convolutional network block and propose a streaming inference strategy, which allows our model to do speech separation in an online manner. The discriminator is involved in optimizing the generator, which can reduce the negative effects of SI-SNR. Experiments on simulated 2-speaker mixtures based on challenging audio-visual dataset LRS2 show that our model outperforms the state-of-the-art audio-only model Conv-TasNet and audio-visual model advr-AVSS under the same model size. We test the running time of our model on GPU and CPU, and results show that our model meets online processing. The demo and code can be found at https://github.com/aispeech-lab/oavss.

[1]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Nobutaka Ono,et al.  Low-latency real-time blind source separation for hearing aids based on time-domain implementation of online independent vector analysis with truncation of non-causal components , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[4]  Julie A. Adams,et al.  Towards reaction and response time metrics for real-world human-robot interaction , 2017, 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN).

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Joon Son Chung,et al.  The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[7]  Peng Zhang,et al.  Audio-visual Speech Separation with Adversarially Disentangled Visual Representation , 2020, ArXiv.

[8]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[9]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[10]  Nima Mesgarani,et al.  TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Zhiyao Duan,et al.  Audio–Visual Deep Clustering for Speech Separation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[13]  Tomohiro Nakatani,et al.  Improving Noise Robust Automatic Speech Recognition with Single-Channel Time-Domain Enhancement Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[16]  Peng Zhang,et al.  A Unified Framework for Low-Latency Speaker Extraction in Cocktail Party Environments , 2020, INTERSPEECH.

[17]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Nima Mesgarani,et al.  Online Deep Attractor Network for Real-time Single-channel Speech Separation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[20]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[22]  Li-Rong Dai,et al.  Source-Aware Context Network for Single-Channel Multi-Speaker Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).