Clustering of imbalanced high-dimensional media data

Media content in large repositories usually exhibits multiple groups of strongly varying sizes. Media of potential interest often form notably smaller groups. Such media groups differ so much from the remaining data that it may be worthy to look at them in more detail. In contrast, media with popular content appear in larger groups. Identifying groups of varying sizes is addressed by clustering of imbalanced data. Clustering highly imbalanced media groups is additionally challenged by the high dimensionality of the underlying features. In this paper, we present the imbalanced clustering (IClust) algorithm designed to reveal group structures in high-dimensional media data. IClust employs an existing clustering method in order to find an initial set of a large number of potentially highly pure clusters which are then successively merged. The main advantage of IClust is that the number of clusters does not have to be pre-specified and that no specific assumptions about the cluster or data characteristics need to be made. Experiments on real-world media data demonstrate that in comparison to existing methods, IClust is able to better identify media groups, especially groups of small sizes.

[1]  Tsunenori Ishioka,et al.  Extended K-means with an Efficient Estimation of the Number of Clusters , 2000, Ideal.

[2]  Ulrich Bodenhofer,et al.  APCluster: an R package for affinity propagation clustering , 2011, Bioinform..

[3]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[4]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[5]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[6]  Mohammad Al Hasan,et al.  Robust partitional clustering by outlier and density insensitive seeding , 2009, Pattern Recognit. Lett..

[7]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[8]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[9]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[10]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[11]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[12]  Venkatesh Saligrama,et al.  Spectral clustering with imbalanced data , 2013, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[14]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[15]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[16]  Jörg Sander Density-Based Clustering , 2017, Encyclopedia of Machine Learning and Data Mining.

[17]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[18]  Fionn Murtagh,et al.  Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion? , 2011, Journal of Classification.

[19]  Hans-Peter Kriegel,et al.  A survey on unsupervised outlier detection in high‐dimensional numerical data , 2012, Stat. Anal. Data Min..

[20]  J ZakiMohammed,et al.  Robust partitional clustering by outlier and density insensitive seeding , 2009 .

[21]  Luca Iocchi,et al.  Rek-Means: A k-Means Based Clustering Algorithm , 2008, ICVS.

[22]  Yangtao Wang,et al.  Multi-exemplar based clustering for imbalanced data , 2014, 2014 13th International Conference on Control Automation Robotics & Vision (ICARCV).

[23]  Ira Assent,et al.  Evaluating Clustering in Subspace Projections of High Dimensional Data , 2009, Proc. VLDB Endow..