Large-Scale Music Annotation and Retrieval: Learning to Rank in Joint Semantic Spaces

Music prediction tasks range from predicting tags given a song or clip of audio, predicting the name of the artist, or predicting related songs given a song, clip, artist name or tag. That is, we are interested in every semantic relationship between the different musical concepts in our database. In realistically sized databases, the number of songs is measured in the hundreds of thousands or more, and the number of artists in the tens of thousands or more, providing a considerable challenge to standard machine learning techniques. In this work, we propose a method that scales to such datasets which attempts to capture the semantic similarities between the database items by modeling audio, artist names, and tags in a single low-dimensional semantic space. This choice of space is learnt by optimizing the set of prediction tasks of interest jointly using multi-task learning. Our method both outperforms baseline methods and, in comparison to them, is faster and consumes less memory. We then demonstrate how our method learns an interpretable model, where the semantic space captures well the similarities of interest.

[1]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[2]  Ali Farhadi,et al.  Unlabeled Data Improves Word Prediction , 2009 .

[3]  H. Robbins A Stochastic Approximation Method , 1951 .

[4]  Tom M. Mitchell,et al.  Learning to Tag from Open Vocabulary Labels , 2010, ECML/PKDD.

[5]  Jyh-Shing Roger Jang,et al.  On the Use of Anti-Word Models for Audio Music Annotation and Retrieval , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[7]  Gerhard Widmer,et al.  Improvements of Audio-Based Music Similarity and Genre Classificaton , 2005, ISMIR.

[8]  Ali Farhadi,et al.  Unlabeled data improvesword prediction , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[9]  Edith Law,et al.  Input-agreement: a new mechanism for collecting data using human computation games , 2009, CHI.

[10]  Daniel P. W. Ellis,et al.  The Quest for Ground Truth in Musical Artist Similarity , 2002, ISMIR.

[11]  Daniel P. W. Ellis,et al.  A Large-Scale Evaluation of Acoustic and Subjective Music-Similarity Measures , 2004, Computer Music Journal.

[12]  Samy Bengio,et al.  Sound Retrieval and Ranking Using Sparse Auditory Representations , 2010, Neural Computation.

[13]  Paul Lamere,et al.  Generating transparent, steerable recommendations from textual descriptions of items , 2009, RecSys '09.

[14]  G. Widmer,et al.  ON THE EVALUATION OF PERCEPTUAL SIMILARITY MEASURES FOR MUSIC , 2003 .

[15]  David Grangier,et al.  A Discriminative Kernel-based Model to Rank Images from Text Queries , 2007 .

[16]  Patrick Gallinari,et al.  Ranking with ordered weighted pairwise classification , 2009, ICML '09.

[17]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[18]  Volker Roth,et al.  Kernel methods for regression and classification , 2001 .

[19]  Samy Bengio,et al.  MIREX SPECIAL TAGATUNE EVALUATION SUBMISSION , 2009 .

[20]  M. Slaney,et al.  PERCEPTUAL DISTANCE IN TIMBRE SPACE , 2005 .

[21]  George Tzanetakis,et al.  MARSYAS SUBMISSIONS TO MIREX 2007 , 2007 .

[22]  Jason Weston,et al.  Large scale image annotation: learning to rank with joint word-image embeddings , 2010, Machine Learning.

[23]  Jonathan Foote,et al.  Content-based retrieval of music and audio , 1997, Other Conferences.

[24]  Samy Bengio,et al.  A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Yanjun Qi,et al.  Polynomial Semantic Indexing , 2009, NIPS.

[26]  Thierry Bertin-Mahieux,et al.  Automatic Tagging of Audio: The State-of-the-Art , 2011 .

[27]  Michael I. Mandel,et al.  Evaluation of Algorithms Using Games: The Case of Music Tagging , 2009, ISMIR.

[28]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[29]  Gert R. G. Lanckriet,et al.  Learning similarity in heterogeneous data , 2010, MIR '10.

[30]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[31]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[32]  Daniel P. W. Ellis,et al.  Multiple-Instance Learning for Music Information Retrieval , 2008, ISMIR.

[33]  Douglas Eck,et al.  Learning Features from Music Audio with Deep Belief Networks , 2010, ISMIR.

[34]  Daniel P. W. Ellis,et al.  Anchor space for classification and similarity measurement of music , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).