Automatic Outlier Detection in Music Genre Datasets

Outlier detection, also known as anomaly detection, is an important topic that has been studied for decades. An outlier detection system is able to identify anomalies in a dataset and thus improve data integrity by removing the detected outliers. It has been successfully applied to different types of data in various fields such as cyber-security, finance, and transportation. In the field of Music Information Retrieval (MIR), however, the number of related studies is small. In this paper, we introduce different state-of-the-art outlier detection techniques and evaluate their viability in the context of music datasets. More specifically, we present a comparative study of 6 outlier detection algorithms applied to a Music Genre Recognition (MGR) dataset. It is determined how well algorithms can identify mislabeled or corrupted files, and how much the quality of the dataset can be improved. Results indicate that state-of-the-art anomaly detection systems have problems identifying anomalies in MGR datasets reliably.

[1]  Bob L. Sturm An analysis of the GTZAN music genre dataset , 2012, MIRUM '12.

[2]  Aidong Zhang,et al.  FindOut: Finding Outliers in Very Large Datasets , 2002, Knowledge and Information Systems.

[3]  António Pacheco,et al.  Detection of Outliers Using Robust Principal Component Analysis: A Simulation Study , 2010, SMPS.

[4]  Andrea Cerioli,et al.  Multivariate Outlier Detection With High-Breakdown Estimators , 2010 .

[5]  Gerhard Widmer,et al.  Novelty Detection Based on Spectral Similarity of Songs , 2005, ISMIR.

[6]  V. Yohai,et al.  Robust Statistics: Theory and Methods , 2006 .

[7]  Tony R. Martinez,et al.  Improving classification accuracy by identifying and removing instances that should be misclassified , 2011, The 2011 International Joint Conference on Neural Networks.

[8]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[9]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[10]  Takafumi Kanamori,et al.  Statistical outlier detection using direct density ratio estimation , 2011, Knowledge and Information Systems.

[11]  Alexander Lerch,et al.  An Introduction to Audio Content Analysis: Applications in Signal Processing and Music Informatics , 2012 .

[12]  Mikhail J. Atallah,et al.  Detection of significant sets of episodes in event sequences , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[13]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[14]  Salvatore J. Stolfo,et al.  A Geometric Framework for Unsupervised Anomaly Detection , 2002, Applications of Data Mining in Computer Security.

[15]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[16]  Volker Roth,et al.  Outlier Detection with One-class Kernel Fisher Discriminants , 2004, NIPS.

[17]  Markus Schedl,et al.  Music Information Retrieval: Recent Developments and Applications , 2014, Found. Trends Inf. Retr..

[18]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[19]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[20]  Shawn Turner,et al.  Empirical Approaches to Outlier Detection in Intelligent Transportation Systems Data , 2003 .

[21]  Kaare Brandt Petersen,et al.  Learning and clean-up in a large scale music database , 2007, 2007 15th European Signal Processing Conference.

[22]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[23]  Ashok N. Srivastava,et al.  Multiple kernel learning for heterogeneous anomaly detection: algorithm and aviation safety case study , 2010, KDD.

[24]  Alexander Lerch An introduction to audio content analysis , 2012 .

[25]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[26]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[27]  Zhouyu Fu,et al.  A Survey of Audio-Based Music Classification and Annotation , 2011, IEEE Transactions on Multimedia.

[28]  A. Atkinson,et al.  Finding an unknown number of multivariate outliers , 2009 .

[29]  Bob L. Sturm Music genre recognition with risk and rejection , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[30]  P. Brockett,et al.  Using Kohonen's Self-Organizing Feature Map to Uncover Automobile Bodily Injury Claims Fraud , 1998 .