Multimodal Music and Lyrics Fusion Classifier for Artist Identification

Humans interact with each other using different communication modalities including speech, gestures and written documents. In the absence of one modality or presence of a noisy modality, other modalities can benefit precision of systems. HCI systems can also benefit from these multimodal communication models for different machine learning tasks. The provision of multiple modalities is motivated by usability, presence of noise in one modality and non-universality of a single modality. Combining multimodal information introduces new challenges to machine learning such as designing fusion classifiers. In this paper we explore the multimodal fusion of audio and lyrics for music artist identification. We compare our results with a single modality artist classifier and introduce new directions for designing a fusion classifier.

[1]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[2]  Arun Ross,et al.  Feature level fusion of hand and face biometrics , 2005, SPIE Defense + Commercial Sensing.

[3]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[4]  Remco C. Veltkamp,et al.  A Survey of Music Information Retrieval Systems , 2005, ISMIR.

[5]  Ali Shokoufandeh,et al.  Music genre classification using explicit semantic analysis , 2011, MIRUM '11.

[6]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[7]  Robert Jenssen,et al.  Kernel Entropy Component Analysis , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[9]  Tijl De Bie,et al.  CCA and a Multi-way Extension for Investigating Common Components between Audio, Lyrics and Tags. , 2012 .

[10]  Svetha Venkatesh,et al.  Bridging the Semantic Gap in Content Management Systems , 2002 .

[11]  Daniel P. W. Ellis,et al.  Song-Level Features and Support Vector Machines for Music Classification , 2005, ISMIR.

[12]  Zhao Li,et al.  Multimodal Sparse Linear Integration for Content-Based Item Recommendation , 2013, 2013 IEEE International Symposium on Multimedia.

[13]  George Karypis,et al.  SLIM: Sparse Linear Methods for Top-N Recommender Systems , 2011, 2011 IEEE 11th International Conference on Data Mining.

[14]  Antoni B. Chan,et al.  Genre Classification and the Invariance of MFCC Features to Key and Tempo , 2011, MMM.

[15]  Cheong Hee Park,et al.  Analysis of correlation based dimension reduction methods , 2011, Int. J. Appl. Math. Comput. Sci..

[16]  Ling Guan,et al.  Kernel Cross-Modal Factor Analysis for Information Fusion With Application to Bimodal Emotion Recognition , 2012, IEEE Transactions on Multimedia.

[17]  Ali Shokoufandeh,et al.  Automatic musical genre classification using sparsity-eager support vector machines , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[18]  Nicu Sebe,et al.  Multimodal Human Computer Interaction: A Survey , 2005, ICCV-HCI.

[19]  Nicu Sebe,et al.  Multimodal Human Computer Interaction: A Survey , 2005, ICCV-HCI.

[20]  Lars Kai Hansen,et al.  Towards a universal representation for audio information retrieval and analysis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.