Audiovisual Singing Voice Separation

Separating a song into vocal and accompaniment components is an active research topic, and recent years witnessed an increased performance from supervised training using deep learning techniques. We propose to apply the visual information corresponding to the singers’ vocal activities to further improve the quality of the separated vocal signals. The video frontend model takes the input of mouth movement and fuses it into the feature embeddings of an audio-based separation framework. To facilitate the network to learn audiovisual correlation of singing activities, we add extra vocal signals irrelevant to the mouth movement to the audio mixture during training. We create two audiovisual singing performance datasets for training and evaluation, respectively, one curated from audition recordings on the Internet, and the other recorded in house. The proposed method outperforms audio-based methods in terms of separation quality on most test recordings. This advantage is especially pronounced when there are backing vocals in the accompaniment, which poses a great challenge for audio-only methods.

[1]  Chuang Gan,et al.  Music Gesture for Visual Sound Separation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Bochen Li,et al.  Skeleton Plays Piano: Online Generation of Pianist Body Movements from MIDI Performance , 2018, ISMIR.

[3]  Gaurav Sharma,et al.  See and listen: Score-informed association of sound tracks to players in chamber music performance videos , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Rémi Gribonval,et al.  Adaptation of Bayesian Models for Single-Channel Source Separation and its Application to Voice/Music Separation in Popular Songs , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Hiromasa Fujihara,et al.  Automatic Synchronization between Lyrics and Music CD Recordings Based on Viterbi Alignment of Segregated Vocal Signals , 2006, Eighth IEEE International Symposium on Multimedia (ISM'06).

[6]  Judith Holler,et al.  Do you see what I’m singing? Visuospatial movement biases pitch perception , 2013, Brain and Cognition.

[7]  Wei-Ho Tsai,et al.  Automatic Singing Performance Evaluation Using Accompanied Vocals as Reference Bases , 2015, J. Inf. Sci. Eng..

[8]  Naoya Takahashi,et al.  Mmdenselstm: An Efficient Combination of Convolutional and Recurrent Neural Networks for Audio Source Separation , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[9]  Fabian-Robert Stöter,et al.  Open-Unmix - A Reference Implementation for Music Source Separation , 2019, J. Open Source Softw..

[10]  Jonathan Le Roux,et al.  Deep clustering and conventional networks for music separation: Stronger together , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Gaurav Sharma,et al.  Visually informed multi-pitch analysis of string ensembles , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Bochen Li,et al.  Query by Video: Cross-modal Music Retrieval , 2019, ISMIR.

[14]  Yu Tsao,et al.  Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks , 2017, IEEE Transactions on Emerging Topics in Computational Intelligence.

[15]  Xavier Serra,et al.  End-to-end music source separation: is it possible in the waveform domain? , 2018, INTERSPEECH.

[16]  Maja Pantic,et al.  End-to-End Audiovisual Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[18]  DeLiang Wang,et al.  A Tandem Algorithm for Singing Pitch Extraction and Voice Separation From Music Accompaniment , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Qiuqiang Kong,et al.  CatNet: music source separation system with mix-audio augmentation , 2021, ArXiv.

[20]  Nima Mesgarani,et al.  TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Gaurav Sharma,et al.  Video-Based Vibrato Detection and Analysis for Polyphonic String Music , 2017, ISMIR.

[22]  Romain Hennequin,et al.  SPLEETER: A FAST AND STATE-OF-THE ART MUSIC SOURCE SEPARATION TOOL WITH PRE-TRAINED MODELS , 2019 .

[23]  Bochen Li,et al.  AUDIO-VISUAL SOURCE ASSOCIATION FOR STRING ENSEMBLES THROUGH MULTI-MODAL VIBRATO ANALYSIS , 2017 .

[24]  Zhiyao Duan,et al.  Audio–Visual Deep Clustering for Speech Separation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Alan Hanjalic,et al.  Vision-based Detection of Acoustic Timed Events: a Case Study on Clarinet Note Onsets , 2017, ArXiv.

[26]  Tillman Weyde,et al.  Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[27]  Naoya Takahashi,et al.  Multi-Scale multi-band densenets for audio source separation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[28]  Tuomas Virtanen,et al.  Recognition of phonemes and words in singing , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Gaël Richard,et al.  ENST-Drums: an extensive audio-visual database for drum signals processing , 2006, ISMIR.

[30]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[31]  Paris Smaragdis,et al.  Singing-voice separation from monaural recordings using robust principal component analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Yi-Hsuan Yang,et al.  Denoising Auto-Encoder with Recurrent Skip Connections and Residual Regression for Music Source Separation , 2018, 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA).

[33]  Bryan Pardo,et al.  A simple music/voice separation method based on the extraction of the repeating musical structure , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Efthymios Tzinis,et al.  Improving On-Screen Sound Separation for Open Domain Videos with Audio-Visual Self-attention , 2021, ArXiv.

[35]  Chenliang Xu,et al.  Online Audio-Visual Source Association for Chamber Music Performances , 2019, Trans. Int. Soc. Music. Inf. Retr..

[36]  Gautham J. Mysore,et al.  Fast and easy crowdsourced perceptual audio evaluation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Franck Giron,et al.  Improving music source separation based on deep neural networks through data augmentation and network blending , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Chenliang Xu,et al.  Deep Cross-Modal Audio-Visual Generation , 2017, ACM Multimedia.

[39]  Joon Son Chung,et al.  The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[40]  Changshui Zhang,et al.  Listen and Look: Audio–Visual Matching Assisted Speech Source Separation , 2018, IEEE Signal Processing Letters.

[41]  Patrick Pérez,et al.  Motion informed audio source separation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Nicolas Usunier,et al.  Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed , 2019, ArXiv.

[43]  E. Altenmüller,et al.  Rapid pitch correction in choir singers. , 2009, The Journal of the Acoustical Society of America.

[44]  Gautham J. Mysore,et al.  Crowdsourced Pairwise-Comparison for Source Separation Evaluation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Antoine Liutkus,et al.  The 2018 Signal Separation Evaluation Campaign , 2018, LVA/ICA.

[46]  Slim Essid,et al.  Audiovisual Analysis of Music Performances: Overview of an Emerging Field , 2019, IEEE Signal Processing Magazine.

[47]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[48]  Shankar Vembu,et al.  Separation of Vocals from Polyphonic Audio Recordings , 2005, ISMIR.

[49]  Efthymios Tzinis,et al.  Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds , 2020, ICLR.

[50]  Gaurav Sharma,et al.  Creating a Multitrack Classical Music Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications , 2016, IEEE Transactions on Multimedia.

[51]  Emilia Gómez,et al.  Monoaural Audio Source Separation Using Deep Convolutional Neural Networks , 2017, LVA/ICA.

[52]  Soonyoung Jung,et al.  Lasaft: Latent Source Attentive Frequency Transformation For Conditioned Source Separation , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[54]  Daniel P. W. Ellis,et al.  USING VOICE SEGMENTS TO IMPROVE ARTIST CLASSIFICATION OF MUSIC , 2002 .

[55]  Paris Smaragdis,et al.  Singing-Voice Separation from Monaural Recordings using Deep Recurrent Neural Networks , 2014, ISMIR.

[56]  Yi-Hsuan Yang,et al.  Vocal activity informed singing voice separation with the iKala dataset , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[57]  Naoya Takahashi,et al.  PhaseNet: Discretized Phase Modeling with Deep Neural Networks for Audio Source Separation , 2018, INTERSPEECH.

[58]  Simon Dixon,et al.  Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[59]  Chuang Gan,et al.  The Sound of Motions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[60]  Tianjun Ma,et al.  WildMix Dataset and Spectro-Temporal Transformer Model for Monoaural Audio Source Separation , 2019, ArXiv.

[61]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[62]  Naoya Takahashi,et al.  D3Net: Densely connected multidilated DenseNet for music source separation , 2020, ArXiv.

[63]  Kristen Grauman,et al.  Co-Separating Sounds of Visual Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[64]  Hiromasa Fujihara,et al.  A Music Information Retrieval System Based on Singing Voice Timbre , 2007, ISMIR.

[65]  T. Landis,et al.  Singing with and without words: hemispheric asymmetries in motor control. , 1994, Journal of Clinical and Experimental Neuropsychology.

[66]  Investigating Deep Neural Transformations for Spectrogram-based Musical Source Separation , 2019, ArXiv.