A survey on enhanced subspace clustering

Subspace clustering finds sets of objects that are homogeneous in subspaces of high-dimensional datasets, and has been successfully applied in many domains. In recent years, a new breed of subspace clustering algorithms, which we denote as enhanced subspace clustering algorithms, have been proposed to (1) handle the increasing abundance and complexity of data and to (2) improve the clustering results. In this survey, we present these enhanced approaches to subspace clustering by discussing the problems they are solving, their cluster definitions and algorithms. Besides enhanced subspace clustering, we also present the basic subspace clustering and the related works in high-dimensional clustering.

[1]  G. W. Snedecor Statistical Methods , 1964 .

[2]  David G. Stork,et al.  Pattern Classification , 1973 .

[3]  Bernard W. Silverman,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[4]  Ron Rymon,et al.  Search through Systematic Set Enumeration , 1992, KR.

[5]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[6]  David Avis,et al.  Reverse Search for Enumeration , 1996, Discret. Appl. Math..

[7]  Ramakrishnan Srikant,et al.  Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[8]  AgrawalRakesh,et al.  Mining quantitative association rules in large relational tables , 1996 .

[9]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[10]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[11]  Kristin P. Bennett,et al.  Density-based indexing for approximate nearest-neighbor queries , 1999, KDD '99.

[12]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[13]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[14]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[15]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[16]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[17]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[18]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[19]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[20]  D. Keim,et al.  What Is the Nearest Neighbor in High Dimensional Spaces? , 2000, VLDB.

[21]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[22]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[23]  Alok N. Choudhary,et al.  Adaptive Grids for Clustering Massive Data Sets , 2001, SDM.

[24]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[25]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[26]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[27]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[28]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.

[29]  Chris H. Q. Ding,et al.  Adaptive dimension reduction for clustering high dimensional data , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[30]  Philip S. Yu,et al.  /spl delta/-clusters: capturing subspace correlation in a large data set , 2002, Proceedings 18th International Conference on Data Engineering.

[31]  Huan Liu,et al.  Feature selection for clustering - a filter solution , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[32]  Hans-Peter Kriegel,et al.  Ranking Interesting Subspaces for Clustering High Dimensional Data , 2003, PKDD.

[33]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[34]  Mohammed J. Zaki,et al.  SCHISM: a new approach for interesting subspace mining , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[35]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[36]  Hans-Peter Kriegel,et al.  Density-Connected Subspace Clustering for High-Dimensional Data , 2004, SDM.

[37]  Michael K. Ng,et al.  An optimization algorithm for clustering using weighted dissimilarity measures , 2004, Pattern Recognit..

[38]  Akira Tanaka,et al.  The Worst-Case Time Complexity for Generating All Maximal Cliques , 2004, COCOON.

[39]  Ming-Syan Chen,et al.  Subspace Clustering of High Dimensional Spatial Data with Noises , 2004, PAKDD.

[40]  Dimitrios Gunopulos,et al.  Subspace Clustering of High Dimensional Data , 2004, SDM.

[41]  Dana Ron,et al.  A New Conceptual Clustering Framework , 2004, Machine Learning.

[42]  Jian Pei,et al.  Mining coherent gene clusters from gene-sample-time microarray data , 2004, KDD.

[43]  Tao Li,et al.  Document clustering via adaptive subspace iteration , 2004, SIGIR '04.

[44]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[45]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[46]  Christian Böhm,et al.  Computing Clusters of Correlation Connected objects , 2004, SIGMOD '04.

[47]  Hiroki Arimura,et al.  LCM ver. 2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets , 2004, FIMI.

[48]  Jon M. Kleinberg,et al.  A Microeconomic View of Data Mining , 1998, Data Mining and Knowledge Discovery.

[49]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[50]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[51]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[52]  J. G. Burleigh,et al.  Identifying optimal incomplete phylogenetic data sets from sequence databases. , 2005, Molecular phylogenetics and evolution.

[53]  Ira Assent,et al.  CLICKS: an effective algorithm for mining subspace clusters in categorical datasets , 2005, KDD '05.

[54]  Hans-Peter Kriegel,et al.  A generic framework for efficient subspace clustering of high-dimensional data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[55]  Jinyan Li,et al.  A Correspondence Between Maximal Complete Bipartite Subgraphs and Closed Patterns , 2005, PKDD.

[56]  Mohammed J. Zaki,et al.  TRICLUSTER: an effective algorithm for mining coherent clusters in 3D microarray data , 2005, SIGMOD '05.

[57]  Farshad Fotouhi,et al.  Co-clustering Documents and Words Using Bipartite Isoperimetric Graph Partitioning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[58]  Christos Faloutsos,et al.  On data mining, compression, and Kolmogorov complexity , 2007, Data Mining and Knowledge Discovery.

[59]  Anthony K. H. Tung,et al.  Mining frequent closed cubes in 3D datasets , 2006, VLDB.

[60]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[61]  Marina Meila,et al.  Comparing subspace clusterings , 2006, IEEE Transactions on Knowledge and Data Engineering.

[62]  Tie-Yan Liu,et al.  Star-Structured High-Order Heterogeneous Data Co-clustering Based on Consistent Information Theory , 2006, Sixth International Conference on Data Mining (ICDM'06).

[63]  Anthony K. H. Tung,et al.  Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[64]  Luigi Pontieri,et al.  An Information-Theoretic Framework for Process Structure and Data Mining , 2006, Int. J. Data Warehous. Min..

[65]  Andreas Hotho,et al.  TRIAS--An Algorithm for Mining Iceberg Tri-Lattices , 2006, Sixth International Conference on Data Mining (ICDM'06).

[66]  A. Zimek,et al.  Deriving quantitative models for correlation clusters , 2006, KDD '06.

[67]  Wilfred Ng,et al.  Mining quantitative correlated patterns using an information-theoretic approach , 2006, KDD '06.

[68]  Elke Achtert,et al.  Finding Hierarchies of Subspace Clusters , 2006, PKDD.

[69]  Jinyan Li,et al.  Efficient Mining of Large Maximal Bicliques , 2006, DaWaK.

[70]  Jinyan Li,et al.  Mining Maximal Quasi-Bicliques to Co-Cluster Stocks and Financial Ratios for Value Investment , 2006, Sixth International Conference on Data Mining (ICDM'06).

[71]  Qi Zhang,et al.  Incremental Subspace Clustering over Multiple Data Streams , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[72]  Ira Assent,et al.  DUSC: Dimensionality Unbiased Subspace Clustering , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[73]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[74]  Philip S. Yu,et al.  GraphScope: parameter-free mining of large time-evolving graphs , 2007, KDD '07.

[75]  Hans-Peter Kriegel,et al.  Future trends in data mining , 2007, Data Mining and Knowledge Discovery.

[76]  Michel Verleysen,et al.  The Concentration of Fractional Distances , 2007, IEEE Transactions on Knowledge and Data Engineering.

[77]  Xiang Zhang,et al.  An Efficient Algorithm for Mining Coherent Patterns from Heterogeneous Microarrays , 2007, 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007).

[78]  Elke Achtert,et al.  Detection and Visualization of Subspace Cluster Hierarchies , 2007, DASFAA.

[79]  Jörg Sander,et al.  Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering , 2008, KDD.

[80]  Yannis Manolopoulos,et al.  Continuous subspace clustering in streaming time series , 2008, Inf. Syst..

[81]  Ira Assent,et al.  Morpheus: interactive exploration of subspace clustering , 2008, KDD.

[82]  Ruggero G. Pensa,et al.  Constrained Co-clustering of Gene Expression Data , 2008, SDM.

[83]  Ira Assent,et al.  INSCY: Indexing Subspace Clusters with In-Process-Removal of Redundancy , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[84]  Jean-François Boulicaut,et al.  Data Peeler: Contraint-Based Closed Pattern Mining in n-ary Relations , 2008, SDM.

[85]  Qiang Yang,et al.  Self-taught clustering , 2008, ICML '08.

[86]  Ira Assent,et al.  EDSC: efficient density-based subspace clustering , 2008, CIKM '08.

[87]  Jinyan Li,et al.  Maximal Quasi-Bicliques with Balanced Noise Tolerance: Concepts and Co-clustering Applications , 2008, SDM.

[88]  Emmanuel Müller,et al.  Detection of orthogonal concepts in subspaces of high dimensional data , 2009, CIKM.

[89]  Jean-François Boulicaut,et al.  Closed patterns meet n-ary relations , 2009, TKDD.

[90]  Céline Robardet,et al.  Constraint-Based Subspace Clustering , 2009, SDM.

[91]  Jinyan Li,et al.  Efficient mining of distance‐based subspace clusters , 2009, Stat. Anal. Data Min..

[92]  Jinyan Li,et al.  Mining maximal quasi‐bicliques: Novel algorithm and applications in the stock market and protein networks , 2009, Stat. Anal. Data Min..

[93]  Hans-Peter Kriegel,et al.  Subspace and projected clustering: experimental evaluation and analysis , 2009, Knowledge and Information Systems.

[94]  See-Kiong Ng,et al.  MACs: Multi-Attribute Co-clusters with High Correlation Information , 2009, ECML/PKDD.

[95]  Ira Assent,et al.  HSM: Heterogeneous Subspace Mining in High Dimensional Data , 2009, SSDBM.

[96]  Qiang Fu,et al.  Bayesian Overlapping Subspace Clustering , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[97]  Ira Assent,et al.  Evaluating Clustering in Subspace Projections of High Dimensional Data , 2009, Proc. VLDB Endow..

[98]  Ira Assent,et al.  DensEst: Density Estimation for Data Mining in High Dimensional Spaces , 2009, SDM.

[99]  K. Tan,et al.  Finding Time-Lagged 3D Clusters , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[100]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[101]  Guimei Liu,et al.  Prequential analysis of complex data with adaptive model reselection , 2009 .

[102]  Ira Assent,et al.  Relevant Subspace Clustering: Mining the Most Interesting Non-redundant Concepts in High Dimensional Data , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[103]  Bernhard Schölkopf,et al.  Multi-way set enumeration in weight tensors , 2011, Machine Learning.

[104]  Kelvin Sim,et al.  Mining Actionable Subspace Clusters in Sequential Data , 2010, SDM.

[105]  T. Seidl,et al.  ASCLU : Alternative Subspace Clustering , 2010 .

[106]  Thomas Seidl,et al.  Subspace Clustering for Uncertain Data , 2010, SDM.

[107]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[108]  A. Zimek,et al.  Subspace Clustering, Ensemble Clustering, Alternative Clustering, Multiview Clustering: What Can We Learn From Each Other? , 2010 .

[109]  Thomas Seidl,et al.  Subspace Clustering Meets Dense Subgraph Mining: A Synthesis of Two Paradigms , 2010, 2010 IEEE International Conference on Data Mining.

[110]  Kelvin Sim,et al.  Discovering Correlated Subspace Clusters in 3D Continuous-Valued Data , 2010, 2010 IEEE International Conference on Data Mining.

[111]  A. Zimek,et al.  On Using Class-Labels in Evaluation of Clusterings , 2010 .

[112]  Hans-Peter Kriegel,et al.  Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? , 2010, SSDBM.

[113]  W. M. Wan,et al.  The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD , 2011 .

[114]  Hans-Peter Kriegel,et al.  Density Based Subspace Clustering over Dynamic Data , 2011, SSDBM.

[115]  Arthur Zimek,et al.  When pattern met subspace cluster a relationship story , 2011 .

[116]  Jinyan Li,et al.  A case study on financial ratios via cross-graph quasi-bicliques , 2011, Inf. Sci..