Novel Unsupervised Feature Filtering of Biological Data

MOTIVATION Many methods have been developed for selecting small informative feature subsets in large noisy data. However, unsupervised methods are scarce. Examples are using the variance of data collected for each feature, or the projection of the feature on the first principal component. We propose a novel unsupervised criterion, based on SVD-entropy, selecting a feature according to its contribution to the entropy (CE) calculated on a leave-one-out basis. This can be implemented in four ways: simple ranking according to CE values (SR); forward selection by accumulating features according to which set produces highest entropy (FS1); forward selection by accumulating features through the choice of the best CE out of the remaining ones (FS2); backward elimination (BE) of features with the lowest CE. RESULTS We apply our methods to different benchmarks. In each case we evaluate the success of clustering the data in the selected feature spaces, by measuring Jaccard scores with respect to known classifications. We demonstrate that feature filtering according to CE outperforms the variance method and gene-shaving. There are cases where the analysis, based on a small set of selected features, outperforms the best score reported when all information was used. Our method calls for an optimal size of the relevant feature set. This turns out to be just a few percents of the number of genes in the two Leukemia datasets that we have analyzed. Moreover, the most favored selected genes turn out to have significant GO enrichment in relevant cellular processes.

[1]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[2]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[3]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[4]  Nicolaj Søndberg-Madsen,et al.  Unsupervised Feature Subset Selection , 2003 .

[5]  David Horn,et al.  Novel Clustering Algorithm for Microarray Expression Data in A Truncated SVD Space , 2003, Bioinform..

[6]  Richard M. Karp,et al.  CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts , 2001, ISMB.

[7]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database - An integrated resource of GO annotations to the UniProt Knowledgebase , 2003, Silico Biol..

[8]  Chris H. Q. Ding,et al.  Adaptive dimension reduction for clustering high dimensional data , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[9]  Tianzi Jiang,et al.  A combinational feature selection and ensemble neural network method for classification of gene expression data , 2004, BMC Bioinformatics.

[10]  Assaf Gottlieb,et al.  Algorithm for data clustering in pattern recognition problems based on quantum mechanics. , 2001, Physical review letters.

[11]  Joaquín Dopazo,et al.  Gene expression data preprocessing , 2003, Bioinform..

[12]  Y. Danieli Guide , 2005 .

[13]  Workshop on Probabilistic Graphical Models for Classification , 2004 .

[14]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[15]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Michal Linial,et al.  COMPACT: A Comparative Package for Clustering Assessment , 2005, ISPA Workshops.

[17]  John B. Moore,et al.  Singular Value Decomposition , 1994 .

[18]  R. Sharan,et al.  CLICK: a clustering algorithm with applications to gene expression analysis. , 2000, Proceedings. International Conference on Intelligent Systems for Molecular Biology.

[19]  Joaquín Dopazo,et al.  An Approach to Inferring Transcriptional Regulation Among Genes From Large-Scale Expression Data , 2003, Comparative and functional genomics.

[20]  Ash A. Alizadeh,et al.  'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns , 2000, Genome Biology.

[21]  Luis Mateus Rocha,et al.  Singular value decomposition and principal component analysis , 2003 .

[22]  Nir Friedman,et al.  Class discovery in gene expression data , 2001, RECOMB.

[23]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[24]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[25]  Roded Sharan,et al.  Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis , 2000, ISMB.

[26]  Chris H. Q. Ding,et al.  Unsupervised Feature Selection Via Two-way Ordering in Gene Expression Analysis , 2003, Bioinform..

[27]  Lior Wolf,et al.  Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weighted-based approach , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[28]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[29]  Huiqing Liu,et al.  A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. , 2002, Genome informatics. International Conference on Genome Informatics.