REDUS: finding reducible subspaces in high dimensional data

Finding latent patterns in high dimensional data is an important research problem with numerous applications. The most well known approaches for high dimensional data analysis are feature selection and dimensionality reduction. Being widely used in many applications, these methods aim to capture global patterns and are typically performed in the full feature space. In many emerging applications, however, scientists are interested in the local latent patterns held by feature subspaces, which may be invisible via any global transformation. In this paper, we investigate the problem of finding strong linear and nonlinear correlations hidden in feature subspaces of high dimensional data. We formalize this problem as identifying reducible subspaces in the full dimensional space. Intuitively, a reducible subspace is a feature subspace whose intrinsic dimensionality is smaller than the number of features. We present an effective algorithm, REDUS, for finding the reducible subspaces. Two key components of our algorithm are finding the overall reducible subspace, and uncovering the individual reducible subspaces from the overall reducible subspace. A broad experimental evaluation demonstrates the effectiveness of our algorithm.

[1]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[2]  C. F. Kossack,et al.  Rank Correlation Methods , 1949 .

[3]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[4]  P. Groenen,et al.  Modern multidimensional scaling , 1996 .

[5]  Audra E. Kosh,et al.  Linear Algebra and its Applications , 1992 .

[6]  Christos Faloutsos,et al.  Deflating the dimensionality curse using multiple fractal dimensions , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[7]  Christian Böhm,et al.  Computing Clusters of Correlation Connected objects , 2004, SIGMOD '04.

[8]  Peter J. Bickel,et al.  Maximum Likelihood Estimation of Intrinsic Dimension , 2004, NIPS.

[9]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[10]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[11]  Anthony K. H. Tung,et al.  CURLER: finding and visualizing nonlinear correlation clusters , 2005, SIGMOD '05.

[12]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[13]  Christos Faloutsos,et al.  Self-spacial join selectivity estimation using fractal concepts , 1998, TOIS.

[14]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[15]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[16]  Christos Faloutsos,et al.  Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension , 1994, PODS.

[17]  D. Botstein,et al.  The transcriptional program in the response of human fibroblasts to serum. , 1999, Science.

[18]  M. Kendall,et al.  Rank Correlation Methods , 1949 .

[19]  Keinosuke Fukunaga,et al.  An Algorithm for Finding Intrinsic Dimensionality of Data , 1971, IEEE Transactions on Computers.

[20]  H. T. Reynolds,et al.  The analysis of cross-classifications , 1977 .

[21]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[22]  S. N. Rasband,et al.  Chaotic Dynamics of Nonlinear Systems , 1990 .

[23]  Gene H. Golub,et al.  Matrix computations , 1983 .

[24]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[25]  Ping Chen,et al.  Using the fractal dimension to cluster datasets , 2000, KDD '00.

[26]  Manfred Schroeder,et al.  Fractals, Chaos, Power Laws: Minutes From an Infinite Paradise , 1992 .

[27]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Huan Liu,et al.  Searching for Interacting Features , 2007, IJCAI.

[29]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[30]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[31]  Keinosuke Fukunaga 15 Intrinsic dimensionality extraction , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[32]  Aristides Gionis,et al.  Dimension induced clustering , 2005, KDD '05.

[33]  Xiang Zhang,et al.  CARE: Finding Local Linear Correlations in High Dimensional Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.