Guiding audio source separation by video object information

In this work we propose novel joint and sequential multimodal approaches for the task of single channel audio source separation in videos. This is done within the popular non-negative matrix factorization framework using information about the sounding object's motion. Specifically, we present methods that utilize non-negative least squares formulation to couple motion and audio information. The proposed techniques generalize recent work carried out on NMF-based motion-informed source separation and easily extend to video data. Experiments with two distinct multimodal datasets of string instrument performance recordings illustrate their advantages over the existing methods.

[1]  Thomas Brox,et al.  Motion Trajectory Segmentation via Minimum Cost Multicuts , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Louis Chevallier,et al.  An interactive audio source separation framework based on non-negative matrix factorization , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Jérôme Idier,et al.  Algorithms for Nonnegative Matrix Factorization with the β-Divergence , 2010, Neural Computation.

[4]  Mark D. Plumbley,et al.  Score-Informed Source Separation for Musical Audio Recordings: An overview , 2014, IEEE Signal Processing Magazine.

[5]  Christian Jutten,et al.  Two multimodal approaches for single microphone source separation , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[6]  Alexey Ozerov,et al.  Text-Informed Audio Source Separation. Example-Based Approach Using Non-Negative Matrix Partial Co-Factorization , 2014, Journal of Signal Processing Systems.

[7]  Rafael Ramirez,et al.  The Sense of Ensemble: a Machine Learning Approach to Expressive Performance Modelling in String Quartets , 2014 .

[8]  Gaurav Sharma,et al.  Creating A Musical Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications , 2016, ArXiv.

[9]  Pierre Vandergheynst,et al.  Blind Audiovisual Source Separation Based on Sparse Redundant Representations , 2010, IEEE Transactions on Multimedia.

[10]  Patrick Pérez,et al.  Motion informed audio source separation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[12]  Yoav Y. Schechner,et al.  Harmony in Motion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[15]  Le Roux Sparse NMF – half-baked or well done? , 2015 .

[16]  Mark D. Plumbley,et al.  INVESTIGATING SINGLE-CHANNEL AUDIO SOURCE SEPARATION METHODS BASED ON NON-NEGATIVE MATRIX FACTORIZATION , 2006 .

[17]  Meinard Müller,et al.  Reverse Engineering the Amen Break — Score-Informed Separation and Restoration Applied to Drum Recordings , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Paris Smaragdis,et al.  AUDIO/VISUAL INDEPENDENT COMPONENTS , 2003 .

[19]  Volker Gnann SOURCE-FILTER BASED CLUSTERING FOR MONAURAL BLIND SOURCE SEPARATION , 2009 .

[20]  Dan Barry,et al.  Clustering NMF basis functions using Shifted NMF for monaural sound source separation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Xin Guo,et al.  NMF-based blind source separation using a linear predictive coding error clustering criterion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).