Discovering Correlated Subspace Clusters in 3D Continuous-Valued Data

Subspace clusters represent useful information in high-dimensional data. However, mining significant subspace clusters in continuous-valued 3D data such as stock-financial ratio-year data, or gene-sample-time data, is difficult. Firstly, typical metrics either find subspaces with very few objects, or they find too many insignificant subspaces – those which exist by chance. Besides, typical 3D subspace clustering approaches abound with parameters, which are usually set under biased assumptions, making the mining process a ‘guessing game’. We address these concerns by proposing an information theoretic measure, which allows us to identify 3D subspace clusters that stand out from the data. We also develop a highly effective, efficient and parameter-robust algorithm, which is a hybrid of information theoretical and statistical techniques, to mine these clusters. From extensive experimentations, we show that our approach can discover significant 3D subspace clusters embedded in 110 synthetic datasets of varying conditions. We also perform a case study on real-world stock datasets, which shows that our clusters can generate higher profits compared to those mined by other approaches.

[1]  Anthony K. H. Tung,et al.  Mining frequent closed cubes in 3D datasets , 2006, VLDB.

[2]  Ronald R. Yager On the instantiation of possibility distributions , 2002, Fuzzy Sets Syst..

[3]  Jimeng Sun,et al.  Beyond streams and graphs: dynamic tensor analysis , 2006, KDD '06.

[4]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[5]  Richard E. Barlow,et al.  Statistical Analysis of Reliability and Life Testing Models , 1975 .

[6]  Jörg Sander,et al.  Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering , 2008, KDD.

[7]  Deepayan Chakrabarti,et al.  AutoPart: Parameter-Free Graph Partitioning and Outlier Detection , 2004, PKDD.

[8]  K. Tan,et al.  Finding Time-Lagged 3D Clusters , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[9]  Ivan Bratko,et al.  Analyzing Attribute Dependencies , 2003, PKDD.

[10]  Chris H. Q. Ding,et al.  Simultaneous tensor subspace selection and clustering: the equivalence of high order svd and k-means clustering , 2008, KDD.

[11]  J. Miller Numerical Analysis , 1966, Nature.

[12]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[13]  Jinyan Li,et al.  Efficient mining of distance‐based subspace clusters , 2009, Stat. Anal. Data Min..

[14]  Jian Pei,et al.  Mining coherent gene clusters from gene-sample-time microarray data , 2004, KDD.

[15]  Chris H. Q. Ding,et al.  K-Subspace Clustering , 2009, ECML/PKDD.

[16]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[17]  Chong-Ho Choi,et al.  Input Feature Selection by Mutual Information Based on Parzen Window , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[19]  H. Oppenheimer A Test of Ben Graham’s Stock Selection Criteria , 1984 .

[20]  Peter Bajorski,et al.  Wiley Series in Probability and Statistics , 2010 .

[21]  R. Caflisch,et al.  Quasi-Monte Carlo integration , 1995 .

[22]  Christian Böhm,et al.  Robust information-theoretic clustering , 2006, KDD '06.

[23]  Hans-Peter Kriegel,et al.  Density-Connected Subspace Clustering for High-Dimensional Data , 2004, SDM.

[24]  F. Downton,et al.  Statistical analysis of reliability and life-testing models : theory and methods , 1992 .

[25]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[26]  J. Douglas Faires,et al.  Numerical Analysis , 1981 .

[27]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[28]  Benjamin A. T. Graham The Intelligent Investor: A Book of Practical Counsel , 1949 .

[29]  Jean-François Boulicaut,et al.  Data Peeler: Contraint-Based Closed Pattern Mining in n-ary Relations , 2008, SDM.

[30]  Mohammed J. Zaki,et al.  TRICLUSTER: an effective algorithm for mining coherent clusters in 3D microarray data , 2005, SIGMOD '05.

[31]  Vipin Kumar,et al.  Quantitative evaluation of approximate frequent pattern mining algorithms , 2008, KDD.

[32]  See-Kiong Ng,et al.  MACs: Multi-Attribute Co-clusters with High Correlation Information , 2009, ECML/PKDD.

[33]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[34]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.