Transcribing Multi-Instrument Polyphonic Music With Hierarchical Eigeninstruments

This paper presents a general probabilistic model for transcribing single-channel music recordings containing multiple polyphonic instrument sources. The system requires no prior knowledge of the instruments present in the mixture (other than the number), although it can benefit from information about instrument type if available. In contrast to many existing polyphonic transcription systems, our approach explicitly models the individual instruments and is thereby able to assign detected notes to their respective sources. We use training instruments to learn a set of linear manifolds in model parameter space which are then used during transcription to constrain the properties of models fit to the target mixture. This leads to a hierarchical mixture-of-subspaces design which makes it possible to supply the system with prior knowledge at different levels of abstraction. The proposed technique is evaluated on both recorded and synthesized mixtures containing two, three, four, and five instruments each. We compare our approach in terms of transcription with (i.e., detected pitches must be associated with the correct instrument) and without source-assignment to another multi-instrument transcription system as well as a baseline non-negative matrix factorization (NMF) algorithm. For two-instrument mixtures evaluated with source-assignment, we obtain average frame-level F-measures of up to 0.52 in the completely blind transcription setting (i.e., no prior knowledge of the instruments in the mixture) and up to 0.67 if we assume knowledge of the basic instrument types. For transcription without source assignment, these numbers rise to 0.76 and 0.83, respectively.

[1]  Anssi Klapuri,et al.  Musical Instrument Recognition in Polyphonic Audio Using Source-Filter Model for Sound Separation , 2009, ISMIR.

[2]  M.P. Ryynanen,et al.  Polyphonic music transcription using note event modeling , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[3]  Bhiksha Raj,et al.  Probabilistic Latent Variable Models as Nonnegative Factorizations , 2008, Comput. Intell. Neurosci..

[4]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[5]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[6]  Emmanuel Vincent,et al.  Harmonic and inharmonic Nonnegative Matrix Factorization for Polyphonic Pitch transcription , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Masataka Goto,et al.  A real-time music-scene-description system: predominant-F0 estimation for detecting melody and bass lines in real-world audio signals , 2004, Speech Commun..

[8]  A. Klapuri,et al.  Analysis of polyphonic audio using source-filter model and non-negative matrix factorization , 2006 .

[9]  Emmanuel Vincent,et al.  Adaptive Harmonic Spectral Decomposition for Multiple Pitch Estimation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[11]  Daniel P. W. Ellis,et al.  Speech separation using speaker-adapted eigenvoice speech models , 2010, Comput. Speech Lang..

[12]  Mikkel N. Schmidt,et al.  Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[13]  Nancy Bertin,et al.  Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis , 2009, Neural Computation.

[14]  Anssi Klapuri,et al.  Multiple fundamental frequency estimation based on harmonicity and spectral smoothness , 2003, IEEE Trans. Speech Audio Process..

[15]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[16]  Kunio Kashino,et al.  Organization of Hierarchical Perceptual Sounds: Music Scene Analysis with Autonomous Processing Modules and a Quantitative Information Integration Mechanism , 1995, IJCAI.

[17]  Emmanuel Vincent,et al.  Instrument-Specific Harmonic Atoms for Mid-Level Music Representation , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Hirokazu Kameoka,et al.  A Multipitch Analyzer Based on Harmonic Temporal Structured Clustering , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Fillia Makedon,et al.  Learning from Incomplete Ratings Using Non-negative Matrix Factorization , 2006, SDM.

[20]  P. Paatero,et al.  Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values† , 1994 .

[21]  Masahiro Nakano,et al.  Monophonic Instrument Sound Segregation by Clustering NMF Components Based on Basis Similarity and Gain Disjointness , 2010, ISMIR.

[22]  Daniel P. W. Ellis,et al.  Multi-voice polyphonic music transcription using eigeninstruments , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[23]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[24]  Tuomas Virtanen,et al.  Separation of drums from polyphonic music using non-negative matrix factorization and support vector machine , 2005, 2005 13th European Signal Processing Conference.

[25]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[26]  J. Eggert,et al.  Sparse coding and NMF , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[27]  Daniel P. W. Ellis,et al.  A Discriminative Model for Polyphonic Piano Transcription , 2007, EURASIP J. Adv. Signal Process..

[28]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[29]  Mark D. Plumbley,et al.  Polyphonic music transcription by non-negative sparse coding of power spectra , 2004 .

[30]  Bryan Pardo,et al.  Harmonically Informed Multi-Pitch Tracking , 2009, ISMIR.

[31]  Éric Gaussier,et al.  Relation between PLSA and NMF and implications , 2005, SIGIR '05.

[32]  P. Smaragdis,et al.  Non-negative matrix factorization for polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[33]  Daniel P. W. Ellis,et al.  A Probabilistic Subspace Model for Multi-instrument Polyphonic Transcription , 2010, ISMIR.

[34]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Bryan Pardo,et al.  Song-level multi-pitch tracking by heavily constrained clustering , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  Emmanuel Vincent,et al.  Enforcing Harmonicity and Smoothness in Bayesian Non-Negative Matrix Factorization Applied to Polyphonic Music Transcription , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  C. Févotte,et al.  Automatic Relevance Determination in Nonnegative Matrix Factorization , 2009 .

[38]  Bhiksha Raj,et al.  A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds , 2009, NIPS.

[39]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[40]  Xavier Rodet,et al.  MULTIPLE-F0 TRACKING BASED ON A HIGH-ORDER HMM MODEL , 2008 .

[41]  Geoffrey E. Hinton,et al.  Modeling the manifolds of images of handwritten digits , 1997, IEEE Trans. Neural Networks.

[42]  Xavier Rodet,et al.  Music Transcription with ISA and HMM , 2004, ICA.

[43]  Simon J. Godsill,et al.  Bayesian extensions to non-negative matrix factorisation for audio signal modelling , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[44]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.