Feature selection using Haar wavelet power spectrum

BackgroundFeature selection is an approach to overcome the 'curse of dimensionality' in complex researches like disease classification using microarrays. Statistical methods are utilized more in this domain. Most of them do not fit for a wide range of datasets. The transform oriented signal processing domains are not probed much when other fields like image and video processing utilize them well. Wavelets, one of such techniques, have the potential to be utilized in feature selection method. The aim of this paper is to assess the capability of Haar wavelet power spectrum in the problem of clustering and gene selection based on expression data in the context of disease classification and to propose a method based on Haar wavelet power spectrum.ResultsHaar wavelet power spectra of genes were analysed and it was observed to be different in different diagnostic categories. This difference in trend and magnitude of the spectrum may be utilized in gene selection. Most of the genes selected by earlier complex methods were selected by the very simple present method. Each earlier works proved only few genes are quite enough to approach the classification problem [1]. Hence the present method may be tried in conjunction with other classification methods. The technique was applied without removing the noise in data to validate the robustness of the method against the noise or outliers in the data. No special softwares or complex implementation is needed. The qualities of the genes selected by the present method were analysed through their gene expression data. Most of them were observed to be related to solve the classification issue since they were dominant in the diagnostic category of the dataset for which they were selected as features.ConclusionIn the present paper, the problem of feature selection of microarray gene expression data was considered. We analyzed the wavelet power spectrum of genes and proposed a clustering and feature selection method useful for classification based on Haar wavelet power spectrum. Application of this technique in this area is novel, simple, and faster than other methods, fit for a wide range of data types. The results are encouraging and throw light into the possibility of using this technique for problem domains like disease classification, gene network identification and personalized drug design.

[1]  Ingrid Daubechies,et al.  Ten Lectures on Wavelets , 1992 .

[2]  Amara Lynn Graps,et al.  An introduction to wavelets , 1995 .

[3]  Michael Frazier Wavelets on ℤ , 2000 .

[4]  Justin Doak,et al.  CSE-92-18 - An Evaluation of Feature Selection Methodsand Their Application to Computer Security , 1992 .

[5]  Ron Kohavi,et al.  Wrappers for feature selection , 1997 .

[6]  David W. Aha,et al.  A Comparative Evaluation of Sequential Feature Selection Algorithms , 1995, AISTATS.

[7]  Xiaobo Zhou,et al.  A Bayesian approach to nonlinear probit gene selection and classification , 2004, J. Frankl. Inst..

[8]  T. Triche,et al.  Experimental evidence for a neural origin of Ewing's sarcoma of bone. , 1987, The American journal of pathology.

[9]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[10]  I. Nikonenko,et al.  Microglia and astrocytes in the adult rat brain: comparative immunocytochemical analysis demonstrates the efficacy of lipocortin 1 immunoreactivity , 2000, Neuroscience.

[11]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[12]  Pietro Liò,et al.  Wavelets in bioinformatics and computational biology: state of art and perspectives , 2003, Bioinform..

[13]  Michael L. Bittner,et al.  cDNA microarrays detect activation of a myogenic transcription program by the PAX3-FKHR fusion oncogene , 1999, Nature Genetics.

[14]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[15]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[16]  Robert Kohn,et al.  Bayesian Variable Selection and Model Averaging in High-Dimensional Multinomial Nonparametric Regression , 2003 .

[17]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[18]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[19]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[20]  Michael L. Bittner,et al.  Strong Feature Sets from Small Samples , 2002, J. Comput. Biol..

[21]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[22]  Michael I. Jordan,et al.  Simultaneous Relevant Feature Identification and Classification in High-Dimensional Spaces , 2002, WABI.

[23]  Xiaodong Wang,et al.  Binarization of microarray data on the basis of a mixture model. , 2003, Molecular cancer therapeutics.

[24]  E. Dougherty,et al.  NONLINEAR PROBIT GENE CLASSIFICATION USING MUTUAL INFORMATION AND WAVELET-BASED FEATURE SELECTION , 2004 .

[25]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[26]  Jinyan Li,et al.  Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns , 2002, Bioinform..

[27]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[28]  D. Michie Personal models of rationality , 1990 .

[29]  Nikola Kasabov,et al.  Evolving Connectionist Systems: Methods and Applications in Bioinformatics, Brain Study and Intelligent Machines , 2002, IEEE Transactions on Neural Networks.

[30]  Jun-Ichi Yano,et al.  Time–Frequency Variability of ENSO and Stochastic Simulations , 1998 .

[31]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[32]  Marina Vannucci,et al.  Gene selection: a Bayesian variable selection approach , 2003, Bioinform..

[33]  Andrea Califano,et al.  Analysis of Gene Expression Microarrays for Phenotype Classification , 2000, ISMB.

[34]  T. Sapatinas,et al.  Wavelet Analysis and its Statistical Applications , 2000 .

[35]  A. Aldroubi,et al.  Wavelets in Medicine and Biology , 1997 .

[36]  F. Ramsey,et al.  The statistical sleuth : a course in methods of data analysis , 2002 .

[37]  Philip M. Long,et al.  Optimal gene expression analysis by microarrays. , 2002, Cancer cell.

[38]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[39]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[40]  Rich Caruana,et al.  Greedy Attribute Selection , 1994, ICML.

[41]  Gilbert Strang,et al.  Wavelets and Dilation Equations: A Brief Introduction , 1989, SIAM Rev..

[42]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..

[43]  R. Kohn,et al.  Nonparametric regression using Bayesian variable selection , 1996 .

[44]  E. Kohn,et al.  Insulin-like growth factor II acts as an autocrine growth and motility factor in human rhabdomyosarcoma tumors. , 1990, Cell growth & differentiation : the molecular biology journal of the American Association for Cancer Research.

[45]  Michael I. Jordan,et al.  Simultaneous classification and relevant feature identification in high-dimensional spaces: application to molecular profiling data , 2003, Signal Process..

[46]  Shenghuo Zhu,et al.  A survey on wavelet applications in data mining , 2002, SKDD.

[47]  S. Mallat A wavelet tour of signal processing , 1998 .

[48]  Justin Doak,et al.  An evaluation of feature selection methods and their application to computer security , 1992 .

[49]  Michael L. Bittner,et al.  cDNA microarrays detect activation of a myogenic transcription program by the PAX3-FKHR fusion oncogene. , 1999 .

[50]  T. Nagano,et al.  Differentially expressed olfactomedin-related glycoproteins (Pancortins) in the brain. , 1998, Brain research. Molecular brain research.