Large-Scale Content-Based Matching of MIDI and Audio Files

MIDI files, when paired with corresponding audio recordings, can be used as ground truth for many music information retrieval tasks. We present a system which can efficiently match and align MIDI files to entries in a large corpus of audio content based solely on content, i.e., without using any metadata. The core of our approach is a convolutional network-based cross-modality hashing scheme which transforms feature matrices into sequences of vectors in a common Hamming space. Once represented in this way, we can efficiently perform large-scale dynamic time warping searches to match MIDI data to audio recordings. We evaluate our approach on the task of matching a huge corpus of MIDI files to the Million Song Dataset. 1. TRAINING DATA FOR MIR Central to the task of content-based Music Information Retrieval (MIR) is the curation of ground-truth data for tasks of interest (e.g. timestamped chord labels for automatic chord estimation, beat positions for beat tracking, prominent melody time series for melody extraction, etc.). The quantity and quality of this ground-truth is often instrumental in the success of MIR systems which utilize it as training data. Creating appropriate labels for a recording of a given song by hand typically requires person-hours on the order of the duration of the data, and so training data availability is a frequent bottleneck in content-based MIR tasks. MIDI files that are time-aligned to matching audio can provide ground-truth information [8,25] and can be utilized in score-informed source separation systems [9, 10]. A MIDI file can serve as a timed sequence of note annotations (a “piano roll”). It is much easier to estimate information such as beat locations, chord labels, or predominant melody from these representations than from an audio signal. A number of tools have been developed for inferring this kind of information from MIDI files [6, 7, 17, 19]. Halevy et al. [11] argue that some of the biggest successes in machine learning came about because “...a large training set of the input-output behavior that we seek to automate is available to us in the wild.” The motivation behind c © Colin Raffel, Daniel P. W. Ellis. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Colin Raffel, Daniel P. W. Ellis. “LargeScale Content-Based Matching of MIDI and Audio Files”, 16th International Society for Music Information Retrieval Conference, 2015. J/Jerseygi.mid

[1]  A. Bhattacharyya On a measure of divergence between two statistical populations defined by their probability distributions , 1943 .

[2]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[3]  George Tzanetakis,et al.  Polyphonic audio matching and alignment for music retrieval , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[4]  Daniel P. W. Ellis,et al.  Ground-truth transcriptions of real music from force-aligned MIDI syntheses , 2003, ISMIR.

[5]  Daniel P. W. Ellis,et al.  A Large-Scale Evaluation of Acoustic and Subjective Music-Similarity Measures , 2004, Computer Music Journal.

[6]  Petri Toiviainen,et al.  MIR In Matlab: The MIDI Toolbox , 2004, ISMIR.

[7]  Ichiro Fujinaga,et al.  jSymbolic: A Feature Extractor for MIDI Files , 2006, ICMC.

[8]  Gert R. G. Lanckriet,et al.  Towards musical query-by-semantic-description using the CAL500 data set , 2007, SIGIR.

[9]  Meinard Müller,et al.  Information retrieval for music and motion , 2007 .

[10]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[11]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[12]  Gautham J. Mysore,et al.  Source Separation By Score Synthesis , 2010, ICMC.

[13]  Youngmoo E. Kim,et al.  Exploring automatic music annotation with "acoustically-objective" tags , 2010, MIR '10.

[14]  Christopher Ariza,et al.  Music21: A Toolkit for Computer-Aided Musicology and Symbolic Music Data , 2010, ISMIR.

[15]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[16]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[17]  Andreas Rauber,et al.  Facilitating Comprehensive Benchmarking Experiments on the Million Song Dataset , 2012, ISMIR.

[18]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[19]  Meinard Müller,et al.  Towards Cross-Version Harmonic Analysis of Music , 2012, IEEE Transactions on Multimedia.

[20]  Jürgen Schmidhuber,et al.  Multimodal Similarity-Preserving Hashing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Mark D. Plumbley,et al.  Score-Informed Source Separation for Musical Audio Recordings: An overview , 2014, IEEE Signal Processing Magazine.

[22]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.