Unsupervised analysis of polyphonic music by sparse coding

We investigate a data-driven approach to the analysis and transcription of polyphonic music, using a probabilistic model which is able to find sparse linear decompositions of a sequence of short-term Fourier spectra. The resulting system represents each input spectrum as a weighted sum of a small number of "atomic" spectra chosen from a larger dictionary; this dictionary is, in turn, learned from the data in such a way as to represent the given training set in an (information theoretically) efficient way. When exposed to examples of polyphonic music, most of the dictionary elements take on the spectral characteristics of individual notes in the music, so that the sparse decomposition can be used to identify the notes in a polyphonic mixture. Our approach differs from other methods of polyphonic analysis based on spectral decomposition by combining all of the following: a) a formulation in terms of an explicitly given probabilistic model, in which the process estimating which notes are present corresponds naturally with the inference of latent variables in the model; b) a particularly simple generative model, motivated by very general considerations about efficient coding, that makes very few assumptions about the musical origins of the signals being processed; and c) the ability to learn a dictionary of atomic spectra (most of which converge to harmonic spectral profiles associated with specific notes) from polyphonic examples alone-no separate training on monophonic examples is required.

[1]  K. Kreutz-Delgado,et al.  Convex/Schur-Convex (CSC) Log-Priors and Sparse Coding , 1999 .

[2]  Xavier Rodet,et al.  Music Transcription with ISA and HMM , 2004, ICA.

[3]  Aapo Hyvärinen,et al.  Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces , 2000, Neural Computation.

[4]  Philippe Lepain Polyphonic Pitch Extraction from Musical Signals , 1999 .

[5]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[6]  B. Olshausen Learning linear, sparse, factorial codes , 1996 .

[7]  Terrence J. Sejnowski,et al.  Learning Overcomplete Representations , 2000, Neural Computation.

[8]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[9]  Bhaskar D. Rao,et al.  Sparse signal reconstruction from limited data using FOCUSS: a re-weighted minimum norm algorithm , 1997, IEEE Trans. Signal Process..

[10]  F. Attneave Some informational aspects of visual perception. , 1954, Psychological review.

[11]  J. V. van Hateren,et al.  Independent component filters of natural images compared with simple cells in primary visual cortex , 1998, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[12]  Mark D. Plumbley,et al.  Unsupervised onset detection : A probabilistic approach using ICA and a hidden Markov classifier , 2003 .

[13]  Anssi Klapuri,et al.  Robust Multipitch Estimation for the Analysis and Manipulation of Polyphonic Musical Signals , 2000 .

[14]  Mark B. Sandler,et al.  A tutorial on onset detection in music signals , 2005, IEEE Transactions on Speech and Audio Processing.

[15]  Anssi Klapuri,et al.  Multipitch estimation and sound separation by the spectral smoothness principle , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[16]  Michael A. Casey,et al.  Separation of Mixed Audio Sources By Independent Subspace Analysis , 2000, ICMC.

[17]  Christopher Raphael,et al.  Automatic Transcription of Piano Music , 2002, ISMIR.

[18]  H Barlow,et al.  Redundancy reduction revisited , 2001, Network.

[19]  Robert W. Young,et al.  Inharmonicity of Plain Wire Piano Strings , 1952 .

[20]  Michael S. Lewicki,et al.  Efficient coding of natural sounds , 2002, Nature Neuroscience.

[21]  Richard M. Everson,et al.  Independent Component Analysis: A Flexible Nonlinearity and Decorrelating Manifold Approach , 1999, Neural Computation.

[22]  Sue L. Denham,et al.  A temporal-analysis-based pitch estimation system for noisy speech with a comparative study of performance of recent systems , 2004, IEEE Transactions on Neural Networks.

[23]  K. Jarrod Millman,et al.  Learning Sparse Codes with a Mixture-of-Gaussians Prior , 1999, NIPS.

[24]  Jean-François Cardoso,et al.  Equivariant adaptive source separation , 1996, IEEE Trans. Signal Process..

[25]  Mark D. Plumbley,et al.  Polyphonic transcription by non-negative sparse coding of power spectra , 2004, ISMIR.

[26]  Matija Marolt,et al.  A connectionist approach to automatic transcription of polyphonic piano music , 2004, IEEE Transactions on Multimedia.

[27]  P. Smaragdis,et al.  Non-negative matrix factorization for polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[28]  Aapo Hyvärinen,et al.  Sparse Code Shrinkage: Denoising of Nongaussian Data by Maximum Likelihood Estimation , 1999, Neural Computation.

[29]  Eero P. Simoncelli Vision and the statistics of the visual environment , 2003, Current Opinion in Neurobiology.

[30]  Samer A. Abdallah,et al.  Towards music perception by redundancy reduction and unsupervised learning in probabilistic models , 2002 .

[31]  D. Ruderman,et al.  Independent component analysis of natural image sequences yields spatio-temporal filters similar to simple cells in primary visual cortex , 1998, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[32]  Andrew D. Sterian,et al.  Model-based segmentation of time-frequency images for musical transcription. , 1999 .

[33]  Rajesh P. N. Rao,et al.  Probabilistic Models of the Brain: Perception and Neural Function , 2002 .

[34]  Eric D. Scheirer,et al.  SAOL: The MPEG-4 Structured Audio Orchestra Language , 1999, Computer Music Journal.

[35]  Eric J. Anderson Limitations of Short-Time Fourier Transforms in Polyphonic Pitch Recognition , 1997 .

[36]  Ernst Mach,et al.  Sensations of tone. , 1897 .

[37]  Bhaskar D. Rao,et al.  FOCUSS-based dictionary learning algorithms , 2000, SPIE Optics + Photonics.

[38]  Mark D. Plumbley Algorithms for nonnegative independent component analysis , 2003, IEEE Trans. Neural Networks.

[39]  Mark D. Plumbley,et al.  Polyphonic music transcription by non-negative sparse coding of power spectra , 2004 .

[40]  L. Rossi,et al.  A novel approach for identifying polyphonic piano signals , 1996 .

[41]  Mike E. Davies,et al.  Unsupervised learning of sparse and shift-invariant decompositions of polyphonic music , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[42]  James Anderson Moorer,et al.  On the segmentation and analysis of continuous musical sound by digital computer , 1975 .

[43]  Patrik O. Hoyer,et al.  Non-negative sparse coding , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[44]  David Barber,et al.  Generative model based polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[45]  Liubomire G. Iordanov The Principal Component Structure of Natural Sound , 1999, NIPS 1999.

[46]  Liqing Zhang,et al.  Self-adaptive blind source separation based on activation functions adaptation , 2004, IEEE Transactions on Neural Networks.

[47]  Emilios Cambouropoulos,et al.  Towards a General Computational Theory of Musical Structure , 1998 .

[48]  Eero P. Simoncelli,et al.  Natural Sound Statistics and Divisive Normalization in the Auditory System , 2000, NIPS.

[49]  Keith D. Martin,et al.  A Blackboard System for Automatic Transcription of Simple Polyphonic Music , 1996 .

[50]  William H. Press,et al.  Numerical recipes in C , 2002 .

[51]  J. H. Hateren,et al.  Independent component filters of natural images compared with simple cells in primary visual cortex , 1998 .

[52]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[53]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[54]  Kunio Kashino,et al.  Application of the Bayesian probability network to music scene analysis , 1998 .

[55]  Horace Barlow,et al.  Banishing the homunculus , 1996 .

[56]  Michael A. Casey,et al.  Auditory group theory with applications to statistical basis methods for structured audio , 1998 .

[57]  Juan Pablo,et al.  Towards the automated analysis of simple polyphonic music : a knowledge-based approach , 2003 .

[58]  Hyvarinen Sparse code shrinkage: denoising of nongaussian data by maximum likelihood estimation , 1999, Neural computation.

[59]  George Francis Harpur,et al.  Low Entropy Coding with Unsupervised Neural Networks , 1997 .

[60]  Bruno A. Olshausen,et al.  PROBABILISTIC FRAMEWORK FOR THE ADAPTATION AND COMPARISON OF IMAGE CODES , 1999 .

[61]  Joos Vandewalle,et al.  Independent component analysis and (simultaneous) third-order tensor diagonalization , 2001, IEEE Trans. Signal Process..

[62]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.