Asteroid: the PyTorch-based audio source separation toolkit for researchers

This paper describes Asteroid, the PyTorch-based audio source separation toolkit for researchers. Inspired by the most successful neural source separation systems, it provides all neural building blocks required to build such a system. To improve reproducibility, Kaldi-style recipes on common audio source separation datasets are also provided. This paper describes the software architecture of Asteroid and its most important features. By showing experimental results obtained with Asteroid's recipes, we show that our implementations are at least on par with most results reported in reference papers. The toolkit is publicly available at this https URL .

[1]  Sunit Sivasankaran,et al.  Analyzing the impact of speaker localization errors on speech separation for automatic speech recognition , 2019, 2020 28th European Signal Processing Conference (EUSIPCO).

[2]  Antoine Deleforge,et al.  LibriMix: An Open-Source Dataset for Generalizable Speech Separation , 2020, 2005.11262.

[3]  Zhong-Qiu Wang,et al.  Alternative Objective Functions for Deep Clustering , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Deep Sen,et al.  Iterative Phase Estimation for the Synthesis of Separated Sources From Single-Channel Mixtures , 2010, IEEE Signal Processing Letters.

[5]  Bryan Pardo,et al.  The Northwestern University Source Separation Library , 2018, ISMIR.

[6]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Xabier Jaureguiberry,et al.  The Flexible Audio Source Separation Toolbox Version 2.0 , 2014, ICASSP 2014.

[8]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[9]  Hiroshi G. Okuno,et al.  An open source software system for robot audition HARK and its evaluation , 2008, Humanoids 2008 - 8th IEEE-RAS International Conference on Humanoid Robots.

[10]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[11]  Jonathan Le Roux,et al.  WHAM!: Extending Speech Separation to Noisy Environments , 2019, INTERSPEECH.

[12]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Johannes Gehrke,et al.  The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Speech Quality and Testing Framework , 2020, ArXiv.

[14]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[15]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[16]  Jonathan Le Roux,et al.  Universal Sound Separation , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[17]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Fabian-Robert Stöter,et al.  Open-Unmix - A Reference Implementation for Music Source Separation , 2019, J. Open Source Softw..

[19]  Emmanuel Vincent,et al.  Audio Source Separation and Speech Enhancement , 2018 .

[20]  Michael I. Mandel,et al.  Onssen: an open-source speech separation and enhancement library , 2019, ArXiv.

[21]  Reinhold Häb-Umbach,et al.  Tight Integration of Spatial and Spectral Features for BSS with Deep Clustering Embeddings , 2017, INTERSPEECH.

[22]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Yong Xu,et al.  A comprehensive study of speech separation: spectrogram vs waveform separation , 2019, INTERSPEECH.

[24]  Nima Mesgarani,et al.  TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Antoine Deleforge,et al.  Filterbank Design for End-to-end Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Reinhold Haeb-Umbach,et al.  SMS-WSJ: Database, performance measures, and baseline recipe for multi-channel source separation and recognition , 2019, ArXiv.

[27]  Jonathan Le Roux,et al.  WHAMR!: Noisy and Reverberant Single-Channel Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  François Michaud,et al.  The ManyEars open framework , 2013, Autonomous Robots.

[29]  Reinhold Haeb-Umbach,et al.  Demystifying TasNet: A Dissecting Approach , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Neil Zeghidour,et al.  Wavesplit: End-to-End Speech Separation by Speaker Clustering , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  Fabian-Robert Stöter,et al.  MUSDB18 - a corpus for music separation , 2017 .

[33]  Angel Manuel Gomez,et al.  A Deep Learning Loss Function Based on the Perceptual Evaluation of the Speech Quality , 2018, IEEE Signal Processing Letters.

[34]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[35]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Nima Mesgarani,et al.  Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[38]  Takuya Yoshioka,et al.  Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Justin Salamon,et al.  What’s all the Fuss about Free Universal Sound Separation Data? , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Björn Schuller,et al.  Blind Enhancement of the Rhythmic and Harmonic Sections by NMF: Does it help? , 2009 .

[41]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[42]  Efthymios Tzinis,et al.  Two-Step Sound Source Separation: Training On Learned Latent Targets , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Peter L. Søndergaard,et al.  A fast Griffin-Lim algorithm , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[44]  Jonathan Le Roux,et al.  Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[45]  Timo Gerkmann,et al.  A Multi-Phase Gammatone Filterbank for Speech Separation Via Tasnet , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).