Creating a Multitrack Classical Music Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications

We introduce a dataset for facilitating audio-visual analysis of music performances. The dataset comprises 44 simple multi-instrument classical music pieces assembled from coordinated but separately recorded performances of individual tracks. For each piece, we provide the musical score in MIDI format, the audio recordings of the individual tracks, the audio and video recording of the assembled mixture, and ground-truth annotation files including frame-level and note-level transcriptions. We describe our methodology for the creation of the dataset, particularly highlighting our approaches to address the challenges involved in maintaining synchronization and expressiveness. We demonstrate the high quality of synchronization achieved with our proposed approach by comparing the dataset with existing widely used music audio datasets. We anticipate that the dataset will be useful for the development and evaluation of existing music information retrieval (MIR) tasks, as well as for novel multimodal tasks. We benchmark two existing MIR tasks (multipitch analysis and score-informed source separation) on the dataset and compare them with other existing music audio datasets. In addition, we consider two novel multimodal MIR tasks (visually informed multipitch analysis and polyphonic vibrato analysis) enabled by the dataset and provide evaluation measurements and baseline systems for future comparisons (from our recent work). Finally, we propose several emerging research directions that the dataset enables.

[1]  Marcelo M. Wanderley,et al.  Visual Methods for the Retrieval of Guitarist Fingering , 2006, NIME.

[2]  Dmitry O. Gorodnichy,et al.  Detection and tracking of pianist hands and fingers , 2006, The 3rd Canadian Conference on Computer and Robot Vision (CRV'06).

[3]  Marcelo M. Wanderley,et al.  Segmenting and Parsing Instrumentalists' Gestures , 2012 .

[4]  Gaël Richard,et al.  ENST-Drums: an extensive audio-visual database for drum signals processing , 2006, ISMIR.

[5]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Declan G. Murphy,et al.  Tracking a Conductor's Baton , 2003 .

[7]  Alan Hanjalic,et al.  Vision-based Detection of Acoustic Timed Events: a Case Study on Clarinet Note Onsets , 2017, ArXiv.

[8]  Mark D. Plumbley,et al.  Score informed audio source separation using constrained nonnegative matrix factorization and score synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[10]  Jordi Janer,et al.  Score-Informed Source Separation for Multichannel Orchestral Recordings , 2016, J. Electr. Comput. Eng..

[11]  Mert Bay,et al.  Evaluation of Multiple-F0 Estimation and Tracking Systems , 2009, ISMIR.

[12]  Yi-Hsuan Yang,et al.  Escaping from the Abyss of Manual Annotation: New Methodology of Building Polyphonic Datasets for Automatic Music Transcription , 2015, CMMR.

[13]  Chenliang Xu,et al.  Deep Cross-Modal Audio-Visual Generation , 2017, ACM Multimedia.

[14]  Marcelo M. Wanderley,et al.  Estimation of Guitar Fingering and Plucking Controls Based on Multimodal Analysis of Motion, Audio and Musical Score , 2015, CMMR.

[15]  Anssi Klapuri,et al.  Score-informed transcription for automatic piano tutoring , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[16]  Masaki Hayashi,et al.  New chroma-key imagining technique with Hi-Vision background , 1989 .

[17]  Masataka Goto,et al.  RWC Music Database: Popular, Classical and Jazz Music Databases , 2002, ISMIR.

[18]  Tuomas Eerola,et al.  Modeling musical attributes to characterize ensemble recordings using rhythmic audio features , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Benoit Huet,et al.  A multimodal approach to music transcription , 2008, 2008 15th IEEE International Conference on Image Processing.

[20]  Simon Dixon,et al.  Computer-aided Melody Note Transcription Using the Tony Software: Accuracy and Efficiency , 2015 .

[21]  Ira Kemelmacher-Shlizerman,et al.  Audio to Body Dynamics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Rafael Ramirez,et al.  The Sense of Ensemble: a Machine Learning Approach to Expressive Performance Modelling in String Quartets , 2014 .

[23]  Vincenzo Lombardo,et al.  A segmentation-based prototype to compute string instruments fingering , 2004 .

[24]  Patrick Pérez,et al.  Motion informed audio source separation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Changshui Zhang,et al.  Multiple Fundamental Frequency Estimation by Modeling Spectral Peaks and Non-Peak Regions , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Tapio Lokki,et al.  Anechoic recording system for symphony orchestra , 2008 .

[27]  Hideo Saito,et al.  Vision-Based Guitarist Fingering Tracking Using a Bayesian Classifier and Particle Filters , 2007, PSIVT.

[28]  Yi-Hsuan Yang,et al.  Sparse Modeling of Magnitude and Phase-Derived Spectra for Playing Technique Classification , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Michael J. Black,et al.  Secrets of optical flow estimation and their principles , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[30]  Roland Badeau,et al.  Multipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Daniel P. W. Ellis,et al.  A Discriminative Model for Polyphonic Piano Transcription , 2007, EURASIP J. Adv. Signal Process..

[32]  Manabu Hashimoto,et al.  Marker-less piano fingering recognition using sequential depth images , 2013, The 19th Korea-Japan Joint Workshop on Frontiers of Computer Vision.

[33]  Takuya Fujishima,et al.  Realtime Chord Recognition of Musical Sound: a System Using Common Lisp Music , 1999, ICMC.

[34]  Alan Hanjalic,et al.  Exploiting Instrument-wise Playing/Non-Playing Labels for Score Synchronization of Symphonic Music , 2014, ISMIR.

[35]  Simon Dixon,et al.  PYIN: A fundamental frequency estimator using probabilistic threshold distributions , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Fei-Fei Li,et al.  Discovering Object Functionality , 2013, 2013 IEEE International Conference on Computer Vision.

[37]  Alan Hanjalic,et al.  On detecting the playing/non-playing activity of musicians in symphonic music videos , 2016, Comput. Vis. Image Underst..

[38]  Richard Green,et al.  Retrieval of guitarist fingering information using computer vision , 2010, 2010 25th International Conference of Image and Vision Computing New Zealand.

[39]  Joshua D. Reiss,et al.  MIXPLORATION: rethinking the audio mixer interface , 2014, IUI.

[40]  Yi-Hsuan Yang,et al.  Vocal activity informed singing voice separation with the iKala dataset , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Gaurav Sharma,et al.  Video-Based Vibrato Detection and Analysis for Polyphonic String Music , 2017, ISMIR.

[42]  Gaurav Sharma,et al.  Visually informed multi-pitch analysis of string ensembles , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Bochen Li,et al.  AUDIO-VISUAL SOURCE ASSOCIATION FOR STRING ENSEMBLES THROUGH MULTI-MODAL VIBRATO ANALYSIS , 2017 .

[44]  Bingjun Zhang,et al.  Visual analysis of fingering for pedagogical violin transcription , 2007, ACM Multimedia.

[45]  Gaël Richard,et al.  Automatic transcription of drum sequences using audiovisual features , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[46]  Shih-Fu Chang,et al.  Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Chia-Jung Tsay Sight over sound in the judgment of music performance , 2013, Proceedings of the National Academy of Sciences.

[48]  Bryan Pardo,et al.  Multi-pitch Streaming of Harmonic Sound Mixtures , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[49]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[51]  Friedrich Platz,et al.  When the Eye Listens: A Meta-analysis of How Audio-visual Presentation Enhances the Appreciation of Music Performance , 2012 .

[52]  Matthias Mauch,et al.  MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research , 2014, ISMIR.

[53]  Gaurav Sharma,et al.  See and listen: Score-informed association of sound tracks to players in chamber music performance videos , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Howard Cheng,et al.  Real-Time Piano Music Transcription Based on Computer Vision , 2015, IEEE Transactions on Multimedia.

[55]  Mark B. Sandler,et al.  Structural Segmentation of Multitrack Audio , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[56]  Bryan Pardo,et al.  Soundprism: An Online System for Score-Informed Source Separation of Music Audio , 2011, IEEE Journal of Selected Topics in Signal Processing.