CARE: Finding Local Linear Correlations in High Dimensional Data

Finding latent patterns in high dimensional data is an important research problem with numerous applications. Existing approaches can be summarized into 3 categories: feature selection, feature transformation (or feature projection) and projected clustering. Being widely used in many applications, these methods aim to capture global patterns and are typically performed in the full feature space. In many emerging biomedical applications, however, scientists are interested in the local latent patterns held by feature subsets, which may be invisible via any global transformation. In this paper, we investigate the problem of finding local linear correlations in high dimensional data. Our goal is to find the latent pattern structures that may exist only in some subspaces. We formalize this problem as finding strongly correlated feature subsets which are supported by a large portion of the data points. Due to the combinatorial nature of the problem and lack of monotonicity of the correlation measurement, it is prohibitively expensive to exhaustively explore the whole search space. In our algorithm, CARE, we utilize spectrum properties and effective heuristic to prune the search space. Extensive experimental results show that our approach is effective in finding local linear correlations that may not be identified by existing methods.

[1]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[2]  W. Mendenhall,et al.  A Second Course in Statistics: Regression Analysis , 1996 .

[3]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[4]  D. Botstein,et al.  The transcriptional program in the response of human fibroblasts to serum. , 1999, Science.

[5]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[6]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[7]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.

[8]  Hans-Peter Kriegel,et al.  Supervised probabilistic principal component analysis , 2006, KDD '06.

[9]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[10]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[11]  Ron Kohavi,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998 .

[12]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[13]  H. Lindman Analysis of variance in complex experimental designs , 1974 .

[14]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .

[15]  Christian Böhm,et al.  Computing Clusters of Correlation Connected objects , 2004, SIGMOD '04.

[16]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[17]  Shiwei Tang,et al.  Mining Representative Subspace Clusters in High-dimensional Data , 2009, 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery.

[18]  A. Zimek,et al.  Deriving quantitative models for correlation clusters , 2006, KDD '06.

[19]  GunopulosDimitrios,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998 .

[20]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[21]  Huan Liu,et al.  Searching for Interacting Features , 2007, IJCAI.

[22]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..