Music Information Retrieval: An Inspirational Guide to Transfer from Related Disciplines

The emerging field of Music Information Retrieval (MIR) has been influenced by neighboring domains in signal processing and machine learning, including automatic speech recognition, image processing and text information retrieval. In this contribution, we start with concrete examples for methodology transfer between speech and music processing, oriented on the building blocks of pattern recognition: preprocessing, feature extraction, and classification/decoding. We then assume a higher level viewpoint when describing sources of mutual inspiration derived from text and image information retrieval. We conclude that dealing with the peculiarities of music in MIR research has contributed to advancing the state-of-the-art in other fields, and that many future challenges in MIR are strikingly similar to those that other research areas have been facing.

[1]  Douglas E. Sturim,et al.  SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[2]  Maurice Charbit,et al.  Factorial Scaled Hidden Markov Model for polyphonic audio representation and source separation , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[3]  Meinard Müller,et al.  Score-Informed Voice Separation For Piano Recordings , 2011, ISMIR.

[4]  Yi-Hsuan Yang,et al.  Automatic chord recognition for music classification and retrieval , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[5]  Malcolm Slaney,et al.  Precision-Recall Is Wrong for Multimedia , 2011, IEEE MultiMedia.

[6]  G. Peeters,et al.  GMM SUPERVECTOR FOR CONTENT BASED MUSIC SIMILARITY , 2011 .

[7]  Rainer Lienhart,et al.  The Holy Grail of Multimedia Information Retrieval: So Close or Yet So Far Away? , 2008 .

[8]  John R. Kender,et al.  Alignment of Speech to Highly Imperfect Text Transcriptions , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[9]  Maurizio Omologo,et al.  Use of Hidden Markov Models and Factored Language Models for Automatic Chord Recognition , 2009, ISMIR.

[10]  J. Stephen Downie,et al.  The music information retrieval evaluation exchange (2005-2007): A window into music information retrieval research , 2008 .

[11]  Haizhou Li,et al.  TEXT-INDEPENDENT SPEAKER RECOGNITION , 2011 .

[12]  Haizhou Li,et al.  An SVM Kernel With GMM-Supervector Based on the Bhattacharyya Distance for Speaker Recognition , 2009, IEEE Signal Processing Letters.

[13]  Mehryar Mohri,et al.  Robust Music Identification, Detection, and Analysis , 2007, ISMIR.

[14]  Nicholas J. Belkin,et al.  Some(what) grand challenges for information retrieval , 2008, SIGF.

[15]  Chin-Hui Lee,et al.  Automatic recognition of keywords in unconstrained speech using hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[16]  Roland Badeau,et al.  Score informed audio source separation using a parametric model of non-negative spectrogram , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[18]  Akinori Ito,et al.  A System for Evaluating Singing Enthusiasm for Karaoke , 2011, ISMIR.

[19]  Gregory H. Wakefield,et al.  Audio thumbnailing of popular music using chroma-based representations , 2005, IEEE Transactions on Multimedia.

[20]  Gonçalo Marques,et al.  Automatic Music Genre Classification Using a Hierarchical Clustering and a Language Model Approach , 2009, 2009 First International Conference on Advances in Multimedia.

[21]  Meinard Müller,et al.  Towards Structural Analysis of Audio Recordings in the Presence of Musical Variations , 2007, EURASIP J. Adv. Signal Process..

[22]  Gert R. G. Lanckriet,et al.  The Natural Language of Playlists , 2011, ISMIR.

[23]  K. Kroschel,et al.  Evaluation of natural emotions using self assessment manikins , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[24]  P. Smaragdis,et al.  Non-negative matrix factorization for polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[25]  Joan Serrà A Qualitative Assessment of Measures for the Evaluation of a Cover Song Identification System , 2007, ISMIR.

[26]  Meinard Müller,et al.  Information retrieval for music and motion , 2007 .

[27]  Juhan Nam,et al.  A Classification-Based Polyphonic Piano Transcription Approach Using Learned Feature Representations , 2011, ISMIR.

[28]  Björn Schuller,et al.  The Munich 2011 CHiME Challenge Contribution: NMF-BLSTM Speech Enhancement and Recognition for Reverberated Multisource Environments , 2011, Interspeech 2011.

[29]  Jyh-Shing Roger Jang,et al.  A Kernel Framework for Content-Based Artist Recommendation System in Music , 2011, IEEE Transactions on Multimedia.

[30]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[31]  Hiromasa Fujihara,et al.  Hyperlinking Lyrics: A Method for Creating Hyperlinks Between Phrases in Song Lyrics , 2008, ISMIR.

[32]  Jürgen Herre,et al.  AudioID: Towards Content-Based Identification of Audio Material , 2001 .

[33]  Ming Li,et al.  THINKIT'S SUBMISSIONS FOR MIREX2009 AUDIO MUSIC CLASSIFICATION AND SIMILARITY TASKS , 2009 .

[34]  Tao Li,et al.  N-Gram Chord Profiles for Composer Style Representation , 2008, ISMIR.

[35]  Mert Bay,et al.  The 2007 MIREX Audio Mood Classification Task: Lessons Learned , 2008, ISMIR.

[36]  Meinard Müller,et al.  Audio Matching via Chroma-Based Statistical Features , 2005, ISMIR.

[37]  Björn W. Schuller,et al.  Multi-Modal Non-Prototypical Music Mood Analysis in Continuous Space: Reliability and Performances , 2011, ISMIR.

[38]  Meinard Müller,et al.  Perceptual audio features for unsupervised key-phrase detection , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[40]  Julián Urbano Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain , 2011, ISMIR.

[41]  Björn W. Schuller,et al.  Vocalist Gender Recognition in Recorded Popular Music , 2010, ISMIR.

[42]  Meinard Müller,et al.  An Efficient Multiscale Approach to Audio Synchronization , 2006, ISMIR.

[43]  Paris Smaragdis,et al.  A non-negative approach to semi-supervised separation of speech from noise with the use of temporal dynamics , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Bhiksha Raj,et al.  Non-negative matrix factorization based compensation of music for automatic speech recognition , 2010, INTERSPEECH.

[45]  Meinard Müller,et al.  Path-constrained partial music synchronization , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[46]  Emmanuel Vincent,et al.  Harmonic and inharmonic Nonnegative Matrix Factorization for Polyphonic Pitch transcription , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[47]  Ricardo Scholz,et al.  Robust modeling of musical chord sequences using probabilistic N-grams , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[48]  Geraint A. Wiggins,et al.  On the non-existence of music: Why music theory is a figment of the imagination , 2010 .

[49]  Ching-Wei Chen,et al.  Improving melody extraction using Probabilistic Latent Component Analysis , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Tuomas Virtanen,et al.  Exemplar-based speech enhancement and its application to noise-robust automatic speech recognition , 2011 .

[51]  Annamaria Mesaros,et al.  AUTOMATIC ALIGNMENT OF MUSIC AUDIO AND LYRICS , 2008 .

[52]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[53]  Björn W. Schuller,et al.  Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies , 2008, INTERSPEECH.

[54]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[55]  Felix Burkhardt,et al.  Voice attributes affecting likability perception , 2010, INTERSPEECH.

[56]  Jin Ha Lee,et al.  Crowdsourcing Music Similarity Judgments using Mechanical Turk , 2010, ISMIR.

[57]  Yi-Hsuan Yang,et al.  A Regression Approach to Music Emotion Recognition , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[58]  Avery Wang,et al.  An Industrial Strength Audio Search Algorithm , 2003, ISMIR.

[59]  Daniel P. W. Ellis,et al.  A Discriminative Model for Polyphonic Piano Transcription , 2007, EURASIP J. Adv. Signal Process..

[60]  Meinard Müller,et al.  Automatic Mapping of Scanned Sheet Music to Audio Recordings , 2008, ISMIR.

[61]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[62]  J. Russell A circumplex model of affect. , 1980 .

[63]  Björn Schuller,et al.  ‘Mister D.J., Cheer Me Up!’: Musical and Textual Features for Automatic Mood Classification , 2010 .

[64]  Björn W. Schuller,et al.  Audio chord labeling by musiological modeling and beat-synchronization , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[65]  Tuomas Virtanen,et al.  Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[66]  Tuomas Virtanen,et al.  Automatic Recognition of Lyrics in Singing , 2010, EURASIP J. Audio Speech Music. Process..

[67]  Björn W. Schuller,et al.  Automatic Assessment of Singer Traits in Popular Music: Gender, Age, Height and Race , 2011, ISMIR.

[68]  Masataka Goto,et al.  A Vocabulary-Free Infinity-Gram Model for Nonparametric Bayesian Chord Progression Analysis , 2011, ISMIR.

[69]  Ning Ma,et al.  The CHiME corpus: a resource and a challenge for computational hearing in multisource environments , 2010, INTERSPEECH.

[70]  Gaël Richard,et al.  Source/Filter Model for Unsupervised Main Melody Extraction From Polyphonic Audio Signals , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[71]  Samy Bengio,et al.  Discriminative keyword spotting , 2009, Speech Commun..

[72]  Björn W. Schuller,et al.  The hinterland of emotions: Facing the open-microphone challenge , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[73]  Douglas Eck,et al.  The need for music information retrieval with user-centered and multimodal strategies , 2011, MIRUM '11.

[74]  Xavier Serra,et al.  Chroma Binary Similarity and Local Alignment Applied to Cover Song Identification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[75]  Tuomas Virtanen,et al.  Speech recognition using factorial hidden Markov models for separation in the feature space , 2006, INTERSPEECH.

[76]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[77]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[78]  George Tzanetakis,et al.  Polyphonic audio matching and alignment for music retrieval , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[79]  Frank Kurth,et al.  SyncTS: Automatic Synchronization of Speech and Text Documents , 2011, Semantic Audio.

[80]  J. Stephen Downie,et al.  Improving mood classification in music digital libraries by combining lyrics and audio , 2010, JCDL '10.

[81]  Björn W. Schuller,et al.  Acoustic-Linguistic Recognition of Interest in Speech with Bottleneck-BLSTM Nets , 2011, INTERSPEECH.