Treelets--An adaptive multi-scale basis for sparse unordered data

In many modern applications, including analysis of gene expression and text documents, the data are noisy, high-dimensional, and unordered--with no particular meaning to the given order of the variables. Yet, successful learning is often possible due to sparsity: the fact that the data are typically redundant with underlying structures that can be represented by only a few features. In this paper we present treelets--a novel construction of multi-scale bases that extends wavelets to nonsmooth signals. The method is fully adaptive, as it returns a hierarchical tree and an orthonormal basis which both reflect the internal structure of the data. Treelets are especially well-suited as a dimensionality reduction and feature selection tool prior to regression and classification, in situations where sample sizes are small and the data are sparse with unknown groupings of correlated or collinear variables. The method is also simple to implement and analyze theoretically. Here we describe a variety of situations where treelets perform better than principal component analysis, as well as some common variable selection and cluster averaging schemes. We illustrate treelets on a blocked covariance model and on several data sets (hyperspectral image data, DNA microarray data, and internet advertisements) with highly complex dependencies between variables.

[1]  S. Mallat A wavelet tour of signal processing , 1998 .

[2]  Jeongyoun Ahn,et al.  Maximal Data Piling in Discrimination , 2004 .

[3]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[4]  Marvin H. J. Gruber Improving Efficiency by Shrinkage: The James--Stein and Ridge Regression Estimators , 1998 .

[5]  B. Nadler,et al.  The prediction error in CLS and PLS: the importance of feature selection prior to multivariate calibration , 2005 .

[6]  Avraham Lorber,et al.  Net analyte signal calculation in multivariate calibration , 1997 .

[7]  I. Jolliffe Principal Component Analysis , 2002 .

[8]  I. Johnstone On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[9]  Ronald R. Coifman,et al.  Discriminant feature extraction using empirical probability density estimation and a local basis library , 2002, Pattern Recognit..

[10]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[11]  Nicholas Kushmerick,et al.  Learning to remove Internet advertisements , 1999, AGENTS '99.

[12]  Peter Bühlmann,et al.  Finding predictive gene groups from microarray data , 2004 .

[13]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[14]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[15]  Fionn Murtagh,et al.  Overcoming the Curse of Dimensionality in Clustering by Means of the Wavelet Transform , 2000, Comput. J..

[16]  Fionn Murtagh,et al.  The Haar Wavelet Transform of a Dendrogram , 2006, J. Classif..

[17]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[18]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[19]  R. Tibshirani,et al.  Clustering methods for the analysis of DNA microarray data , 1999 .

[20]  David L. Donoho,et al.  Improved linear discrimination using time-frequency dictionaries , 1995, Optics + Photonics.

[21]  A. Fischer,et al.  Detection of malignancy in cytology specimens using spectral–spatial analysis , 2005, Laboratory Investigation.

[22]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[23]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[24]  Huan Liu,et al.  Searching for Interacting Features , 2007, IJCAI.

[25]  Peter Bühlmann,et al.  Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm , 2007, J. Mach. Learn. Res..

[26]  J. S. Marron,et al.  Geometric representation of high dimension, low sample size data , 2005 .

[27]  B. Nadler Finite sample approximation results for principal component analysis: a matrix perturbation approach , 2009, 0901.3245.

[28]  Ronald R. Coifman,et al.  On local orthonormal bases for classification and regression , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[29]  Daniel Asimov,et al.  The grand tour: a tool for viewing multidimensional data , 1985 .

[30]  R. Beran,et al.  Bootstrap Tests and Confidence Regions for Functions of a Covariance Matrix , 1985 .

[31]  Mikhail Belkin,et al.  Semi-Supervised Learning on Riemannian Manifolds , 2004, Machine Learning.

[32]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[33]  B. Nadler,et al.  Partial least squares, Beer's law and the net analyte signal: statistical modeling and analysis , 2005 .

[34]  D. Donoho,et al.  Maximal Sparsity Representation via l 1 Minimization , 2002 .

[35]  C. R. Rao Improved Linear Discrimination Using Time-frequency Dictionaries , 1995 .

[36]  Ann B. Lee,et al.  Treelets | A Tool for Dimensionality Reduction and Multi-Scale Analysis of Unstructured Data , 2007, AISTATS.

[37]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Gene H. Golub,et al.  Matrix computations , 1983 .

[39]  Ann B. Lee,et al.  Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Ronald R. Coifman,et al.  The local Karhunen-Loeve bases , 1996, Proceedings of Third International Symposium on Time-Frequency and Time-Scale Analysis (TFTS-96).

[41]  Sujit K. Ghosh,et al.  Essential Wavelets for Statistical Applications and Data Analysis , 2001, Technometrics.

[42]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[43]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[44]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[45]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[46]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[47]  T. Hastie,et al.  Classification of gene microarrays by penalized logistic regression. , 2004, Biostatistics.

[48]  Fionn Murtagh,et al.  On Ultrametricity, Data Coding, and Computation , 2004, J. Classif..

[49]  P. Bickel,et al.  Regularized estimation of large covariance matrices , 2008, 0803.1909.

[50]  Ronald R. Coifman,et al.  Entropy-based algorithms for best basis selection , 1992, IEEE Trans. Inf. Theory.

[51]  I. Johnstone,et al.  Adapting to Unknown Smoothness via Wavelet Shrinkage , 1995 .