Automatic transcription and separation of the main melody in polyphonic music signals

We propose to address the problem of melody extraction along with the monaural lead instrument and accompaniment separation problem. The first task is related to Music Information Retrieval (MIR), since it aims at indexing the audio music signals with their melody. The separation problem is related to Blind Audio Source Separation (BASS), as it aims at breaking an audio mixture into several source tracks. Leading instrument source separation and main melody extraction are addressed within a unified framework. The lead instrument is modelled thanks to a source/filter production model. Its signal is generated by two hidden states, the filter state and the source state. The proposed signal spectral model therefore explicitly uses pitches both to separate the lead instrument from the others and to transcribe the pitch sequence played by that instrument, the "main melody". This model gives rise to two alternative models, a Gaussian Scaled Mixture Model (GSMM) and the Instantaneous Mixture Model (IMM). The accompaniment is modelled with a more general spectral model. Five systems are proposed. Three systems detect the fundamental frequency sequence of the lead instrument, i.e. they estimate the main melody. A system returns a musical melody transcription and the last system separates the lead instrument from the accompaniment. The results in melody transcription and source separation are at the state of the art, as shown by our participations to international evaluation campaigns (MIREX'08, MIREX'09 and SiSEC'08). The proposed extension of previous source separation works using "MIR" knowledge is therefore a very successful combination.

[1]  E. B. Newman,et al.  A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .

[2]  Ali Taylan Cemgil,et al.  Monte Carlo Methods for Tempo Tracking and Rhythm Quantization , 2011, J. Artif. Intell. Res..

[3]  M. Marolt Audio Melody Extraction Based on Timbral Similarity of Melodic Fragments , 2005, EUROCON 2005 - The International Conference on "Computer as a Tool".

[4]  Steffen Pauws,et al.  CubyHum: a fully operational "query by humming" system , 2002, ISMIR.

[5]  Juan Pablo Bello,et al.  A Robust Mid-Level Representation for Harmonic Content in Music Signals , 2005, ISMIR.

[6]  R M Warren,et al.  Elimination of biases in loudness judgments for tones. , 1970, The Journal of the Acoustical Society of America.

[7]  Geoffroy Peeters Beat-Marker Location Using a Probabilistic Framework and Linear Discriminant Analysis , 2009 .

[8]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[9]  Graham E. Poliner,et al.  Melody Transcription From Music Audio: Approaches and Evaluation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Daniel P. W. Ellis,et al.  A Quantitative Comparison of Different Approaches for Melody Extraction from Polyphonic Audio Recordings , 2006 .

[11]  Gaël Richard,et al.  Singer melody extraction in polyphonic signals using source separation methods , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[13]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.

[14]  N. Meyers,et al.  H = W. , 1964, Proceedings of the National Academy of Sciences of the United States of America.

[15]  D. J. Hermes,et al.  Measurement of pitch by subharmonic summation. , 1988, The Journal of the Acoustical Society of America.

[16]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[17]  Özgür Yilmaz,et al.  Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[18]  Bryan Pardo,et al.  Harmonically Informed Multi-Pitch Tracking , 2009, ISMIR.

[19]  Matija Marolt,et al.  A connectionist approach to automatic transcription of polyphonic piano music , 2004, IEEE Transactions on Multimedia.

[20]  Daniel P. W. Ellis,et al.  Speech separation using speaker-adapted eigenvoice speech models , 2010, Comput. Speech Lang..

[21]  DeLiang Wang,et al.  Separation of Singing Voice From Music Accompaniment for Monaural Recordings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Judith C. Brown Calculation of a constant Q spectral transform , 1991 .

[23]  Nancy Bertin,et al.  Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis , 2009, Neural Computation.

[24]  Christian Jutten,et al.  Complex-valued sparse representation based on smoothed ℓ0 norm , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  A. Kondoz,et al.  Comparison of subjective and objective evaluation methods for audio source separation , 2008 .

[26]  Richard F. Lyon,et al.  Auditory model inversion for sound separation , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Andreas Jakobsson,et al.  Multi-Pitch Estimation , 2009, Multi-Pitch Estimation.

[28]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[29]  Karin Dressler EXTRACTION OF THE MELODY PITCH CONTOUR FROM POLYPHONIC AUDIO , 2005 .

[30]  Emmanuel Vincent,et al.  The 2008 Signal Separation Evaluation Campaign: A Community-Based Approach to Large-Scale Evaluation , 2009, ICA.

[31]  Anssi Klapuri,et al.  Modelling of note events for singing transcription , 2004, SAPA@INTERSPEECH.

[32]  Mathieu Lagrange,et al.  Multimodal similarity between musical streams for cover version detection , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Masaaki Honda,et al.  Sinusoidal model based on instantaneous frequency attractors , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Mark D. Plumbley,et al.  Polyphonic music transcription by non-negative sparse coding of power spectra , 2004 .

[35]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[36]  Emmanuel Vincent,et al.  Musical source separation using time-frequency source priors , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Rémi Gribonval,et al.  A Robust Method to Count and Locate Audio Sources in a Multichannel Underdetermined Mixture , 2010, IEEE Transactions on Signal Processing.

[38]  Thomas Sikora,et al.  Automatic Generation of Lead Sheets from Polyphonic Music Signals , 2009, ISMIR.

[39]  Peter Desain,et al.  Rhythm Quantization for Transcription , 2000, Computer Music Journal.

[40]  Christian Jutten,et al.  Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture , 1991, Signal Process..

[41]  Valentin Emiya,et al.  Perceptually-Based Evaluation of the Errors Usually Made When Automatically Transcribing Music , 2008, ISMIR.

[42]  Anssi Klapuri,et al.  Automatic Transcription of Melody, Bass Line, and Chords in Polyphonic Music , 2008, Computer Music Journal.

[43]  José Manuel Iñesta Quereda,et al.  Melody characterization by a genetic fuzzy system , 2008 .

[44]  Alexey Ozerov,et al.  Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[45]  R. Meddis Simulation of mechanical to neural transduction in the auditory receptor. , 1986, The Journal of the Acoustical Society of America.

[46]  Rui Pedro Paiva,et al.  Melody Detection in Polyphonic Audio , 2009 .

[47]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[48]  John F. Kolen,et al.  Resonance and the Perception of Musical Meter , 1994, Connect. Sci..

[49]  Rui Pedro Paiva On the Detection of Melody Notes in Polyphonic Audio , 2005, ISMIR.

[50]  Xavier Serra,et al.  Chroma Binary Similarity and Local Alignment Applied to Cover Song Identification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[51]  Yonghong Yan,et al.  MULTIPLE F0 ESTIMATION IN POLYPHONIC MUSIC (MIREX 2007) , 2007 .

[52]  Hiromasa Fujihara,et al.  F0 Estimation Method for Singing Voice in Polyphonic Audio Signal Based on Statistical Vocal Model and Viterbi Search , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[53]  Roland Badeau,et al.  Multipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[54]  Christopher Raphael,et al.  Desoloing Monaural Audio Using Mixture Models , 2007, ISMIR.

[55]  Anssi Klapuri,et al.  Query by humming of midi and audio using locality sensitive hashing , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[56]  Mark D. Plumbley Algorithms for nonnegative independent component analysis , 2003, IEEE Trans. Neural Networks.

[57]  Masataka Goto A Predominant-F0 Estimation Method for Polyphonic Musical Audio Signals , 2004 .

[58]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[59]  Ali Taylan Cemgil,et al.  Bayesian Music Transcription , 1997 .

[60]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[61]  Jyh-Shing Roger Jang,et al.  Singing Pitch Extraction from Monaural Polyphonic Songs by Contextual Audio Modeling and Singing Harmonic Enhancement , 2009, ISMIR.

[62]  José Manuel Iñesta Quereda,et al.  A Pattern Recognition Approach for Melody Track Selection in MIDI Files , 2006, ISMIR.

[63]  Gaël Richard,et al.  Instrument recognition in polyphonic music based on automatic taxonomies , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[64]  Anssi Klapuri,et al.  Musical Instrument Recognition in Polyphonic Audio Using Source-Filter Model for Sound Separation , 2009, ISMIR.

[65]  M.P. Ryynanen,et al.  Polyphonic music transcription using note event modeling , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[66]  Masataka Goto,et al.  A robust predominant-F0 estimation method for real-time detection of melody and bass lines in CD recordings , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[67]  Hirokazu Kameoka,et al.  Single Channel Speech and Background Segregation Through Harmonic-Temporal Clustering , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[68]  Preeti Rao,et al.  MELODY EXTRACTION USING HARMONIC MATCHING , 2008 .

[69]  Anssi Klapuri,et al.  Multipitch Analysis of Polyphonic Music and Speech Signals Using an Auditory Model , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[70]  Emmanuel Vincent,et al.  Enforcing Harmonicity and Smoothness in Bayesian Non-Negative Matrix Factorization Applied to Polyphonic Music Transcription , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[71]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[72]  R. McAulay,et al.  Speech enhancement using a soft-decision noise suppression filter , 1980 .

[73]  M. Davy,et al.  Bayesian analysis of polyphonic western tonal music. , 2006, The Journal of the Acoustical Society of America.