A novel algorithm for fast and scalable subspace clustering of high-dimensional data

Rapid growth of high dimensional datasets in recent years has created an emergent need to extract the knowledge underlying them. Clustering is the process of automatically finding groups of similar data points in the space of the dimensions or attributes of a dataset. Finding clusters in the high dimensional datasets is an important and challenging data mining problem. Data group together differently under different subsets of dimensions, called subspaces. Quite often a dataset can be better understood by clustering it in its subspaces, a process called subspace clustering. But the exponential growth in the number of these subspaces with the dimensionality of data makes the whole process of subspace clustering computationally very expensive. There is a growing demand for efficient and scalable subspace clustering solutions in many Big data application domains like biology, computer vision, astronomy and social networking. Apriori based hierarchical clustering is a promising approach to find all possible higher dimensional subspace clusters from the lower dimensional clusters using a bottom-up process. However, the performance of the existing algorithms based on this approach deteriorates drastically with the increase in the number of dimensions. Most of these algorithms require multiple database scans and generate a large number of redundant subspace clusters, either implicitly or explicitly, during the clustering process. In this paper, we present SUBSCALE, a novel clustering algorithm to find non-trivial subspace clusters with minimal cost and it requires only k database scans for a k-dimensional data set. Our algorithm scales very well with the dimensionality of the dataset and is highly parallelizable. We present the details of the SUBSCALE algorithm and its evaluation in this paper.

[1]  Hans-Peter Kriegel,et al.  Subspace clustering , 2012, WIREs Data Mining Knowl. Discov..

[2]  Arthur Zimek,et al.  Clustering High-Dimensional Data , 2018, Data Clustering: Algorithms and Applications.

[3]  René Vidal,et al.  Multiframe Motion Segmentation with Missing Data Using PowerFactorization and GPCA , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[4]  Parag Kulkarni,et al.  Algorithm to determine ε-distance parameter in density based clustering , 2014, Expert Syst. Appl..

[5]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[6]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[7]  Arthur Zimek,et al.  A survey on enhanced subspace clustering , 2013, Data Mining and Knowledge Discovery.

[8]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[10]  Thomas Seidl,et al.  Finding density-based subspace clusters in graphs with feature vectors , 2012, Data Mining and Knowledge Discovery.

[11]  P. Erdös,et al.  The distribution of the number of summands in the partitions of a positive integer , 1941 .

[12]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[13]  D. Zhang,et al.  Principle Component Analysis , 2004 .

[14]  R. Schilizzi The Square Kilometre Array , 2006, Proceedings of the International Astronomical Union.

[15]  Amitava Datta,et al.  SUBSCALE: Fast and Scalable Subspace Clustering for High Dimensional Data , 2014, 2014 IEEE International Conference on Data Mining Workshop.

[16]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[17]  Richard T. Schilizzi,et al.  The Square Kilometre Array , 2009, Proceedings of the IEEE.

[18]  Vipin Kumar,et al.  The Challenges of Clustering High Dimensional Data , 2004 .

[19]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[20]  Ronen Basri,et al.  Lambertian Reflectance and Linear Subspaces , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[22]  Ira Assent,et al.  DUSC: Dimensionality Unbiased Subspace Clustering , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[23]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[24]  Charu C. Aggarwal,et al.  Data Clustering , 2013 .

[25]  Hans-Peter Kriegel,et al.  A generic framework for efficient subspace clustering of high-dimensional data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[26]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[27]  Ira Assent,et al.  INSCY: Indexing Subspace Clusters with In-Process-Removal of Redundancy , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[28]  David J. Kriegman,et al.  Clustering appearances of objects under varying illumination conditions , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[29]  Myoung-Ho Kim,et al.  FINDIT: a fast and intelligent subspace clustering algorithm using dimension voting , 2004, Inf. Softw. Technol..

[30]  Ira Assent,et al.  Evaluating Clustering in Subspace Projections of High Dimensional Data , 2009, Proc. VLDB Endow..

[31]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[32]  Woncheol Jang,et al.  Cluster analysis of massive datasets in astronomy , 2007, Stat. Comput..

[33]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[34]  Charu C. Aggarwal,et al.  Data Clustering: Algorithms and Applications , 2014 .

[35]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[36]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[37]  R. Vidal,et al.  Sparse Subspace Clustering: Algorithm, Theory, and Applications. , 2013, IEEE transactions on pattern analysis and machine intelligence.

[38]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[39]  Luca Benini,et al.  Discovering coherent biclusters from gene expression data using zero-suppressed binary decision diagrams , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[40]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[41]  Alok N. Choudhary,et al.  Adaptive Grids for Clustering Massive Data Sets , 2001, SDM.

[42]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[43]  E. M. Wright,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[44]  Stanley M. Bileschi,et al.  Street Scenes: towards scene understanding in still images , 2006 .

[45]  Dennis McLeod,et al.  Subspace Clustering of Microarray Data Based on Domain Transformation , 2006, VDMB.

[46]  Olga G. Troyanskaya,et al.  Detailing regulatory networks through large scale data integration , 2009, Bioinform..

[47]  H. Deutsch Principle Component Analysis , 2004 .

[48]  Shengcai Liao,et al.  Pedestrian Attribute Classification in Surveillance: Database and Evaluation , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[49]  Mehmet Deveci,et al.  A comparative analysis of biclustering algorithms for gene expression data , 2013, Briefings Bioinform..

[50]  Hans-Peter Kriegel,et al.  Density-Connected Subspace Clustering for High-Dimensional Data , 2004, SDM.

[51]  Tao Li,et al.  Document clustering via adaptive subspace iteration , 2004, SIGIR '04.

[52]  Junbin Gao,et al.  Subspace Clustering for Sequential Data , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.