Treelets | A Tool for Dimensionality Reduction and Multi-Scale Analysis of Unstructured Data

In many modern data mining applications, such as analysis of gene expression or worddocument data sets, the data is highdimensional with hundreds or even thousands of variables, unstructured with no specific order of the original variables, and noisy. Despite the high dimensionality, the data is typically redundant with underlying structures that can be represented by only a few features. In such settings and specifically when the number of variables is much larger than the sample size, standard global methods may not perform well for common learning tasks such as classification, regression and clustering. In this paper, we present treelets — a new tool for multi-resolution analysis that extends wavelets on smooth signals to general unstructured data sets. By construction, treelets provide an orthogonal basis that reflects the internal structure of the data. In addition, treelets can be useful for feature selection and dimensionality reduction prior to learning. We give a theoretical analysis of our algorithm for a linear mixture model, and present a variety of situations where treelets outperform classical principal component analysis, as well as variable selection schemes such as supervised (sparse) PCA.

[1]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[2]  Sujit K. Ghosh,et al.  Essential Wavelets for Statistical Applications and Data Analysis , 2001, Technometrics.

[3]  C. R. Rao Improved Linear Discrimination Using Time-frequency Dictionaries , 1995 .

[4]  I. Johnstone,et al.  Sparse Principal Components Analysis , 2009, 0901.4392.

[5]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[6]  I. Johnstone,et al.  Adapting to Unknown Smoothness via Wavelet Shrinkage , 1995 .

[7]  Huan Liu,et al.  Searching for Interacting Features , 2007, IJCAI.

[8]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[9]  Ronald R. Coifman,et al.  The prediction error in CLS and PLS: the importance of feature selection prior to multivariate calibration , 2005 .

[10]  Fionn Murtagh,et al.  The Haar Wavelet Transform of a Dendrogram , 2006, J. Classif..

[11]  Nicholas Kushmerick,et al.  Learning to remove Internet advertisements , 1999, AGENTS '99.

[12]  Ronald R. Coifman,et al.  The local Karhunen-Loeve bases , 1996, Proceedings of Third International Symposium on Time-Frequency and Time-Scale Analysis (TFTS-96).

[13]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[14]  Gene H. Golub,et al.  Matrix computations , 1983 .

[15]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .