Sound Source Tracking and Formation using Normalized Cuts

The goal of computational auditory scene analysis (CASA) is to create computer systems that can take as input a mixture of sounds and form packages of acoustic evidence such that each package most likely has arisen from a single sound source. We formulate sound source tracking and formation as a graph partitioning problem and solve it using the normalized cut which is a global criterion for segmenting graphs that has been used in computer vision. It measures both the total dissimilarity between the different groups as well as the total similarity within groups. We describe how this formulation can be used with sinusoidal modeling, a common technique for sound analysis, manipulation and synthesis. Several examples showing the potential of this approach are provided.

[1]  S. H. Srinivasan Auditory blobs , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Keansub Lee,et al.  Minimal-impact audio-based personal archives , 2004, CARPE'04.

[3]  Q. Summerfield Book Review: Auditory Scene Analysis: The Perceptual Organization of Sound , 1992 .

[4]  Emmanuel Vincent,et al.  Musical source separation using time-frequency source priors , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  David K. Mellinger,et al.  Event formation and separation in musical sound , 1992 .

[6]  Mathieu Lagrange,et al.  On the equivalence of phase-based methods for the estimation of instantaneous frequency , 2006, 2006 14th European Signal Processing Conference.

[7]  Mohan S. Kankanhalli,et al.  Harmonicity and dynamics based audio separation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[8]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[9]  Albert S. Bregman,et al.  The Auditory Scene. (Book Reviews: Auditory Scene Analysis. The Perceptual Organization of Sound.) , 1990 .

[10]  Shlomo Dubnov,et al.  Audio Segmentation by Singular Value Clustering , 2004, ICMC.

[11]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[12]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Lie Lu,et al.  Unsupervised content discovery in composite audio , 2005, MULTIMEDIA '05.

[14]  Michael I. Jordan,et al.  Blind One-microphone Speech Separation: A Spectral Learning Approach , 2004, NIPS.