Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data

BackgroundDue to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in reducing noise and computational cost in gene expression-based tissue classification while improving accuracy at the same time. Surprisingly, this does not appear to be the case for all multiclass microarray datasets. The reason is that many feature selection techniques applied on microarray datasets are either rank-based and hence do not take into account correlations between genes, or are wrapper-based, which require high computational cost, and often yield difficult-to-reproduce results. In studies where correlations between genes are considered, attempts to establish the merit of the proposed techniques are hampered by evaluation procedures which are less than meticulous, resulting in overly optimistic estimates of accuracy.ResultsWe present two realistically evaluated correlation-based feature selection techniques which incorporate, in addition to the two existing criteria involved in forming a predictor set (relevance and redundancy), a third criterion called the degree of differential prioritization (DDP). DDP functions as a parameter to strike the balance between relevance and redundancy, providing our techniques with the novel ability to differentially prioritize the optimization of relevance against redundancy (and vice versa). This ability proves useful in producing optimal classification accuracy while using reasonably small predictor set sizes for nine well-known multiclass microarray datasets.ConclusionFor multiclass microarray datasets, especially the GCM and NCI60 datasets, DDP enables our filter-based techniques to produce accuracies better than those reported in previous studies which employed similarly realistic evaluation procedures.

[1]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[2]  Takashi Akasaka,et al.  BCL6 gene translocation in follicular lymphoma: a harbinger of eventual transformation to diffuse aggressive lymphoma. , 2003, Blood.

[3]  F Mayall,et al.  Microsatellite abnormalities in plasma of patients with breast carcinoma: concordance with the primary tumour. , 1999, Journal of clinical pathology.

[4]  Siegfried J. Pöppl,et al.  The 'subsequent artificial neural network' (SANN) approach might bring more classificatory power to ANN-based DNA microarray analyses , 2004, Bioinform..

[5]  Patrick Tan,et al.  Genetic algorithms applied to multi-class prediction for the analysis of gene expression data , 2003, Bioinform..

[6]  U. Kellner,et al.  Missense mutation (Arg121Trp) in the norrie disease gene associated with X‐linked exudative vitreoretinopathy , 1995, Human mutation.

[7]  Ralph S Freedman,et al.  Ovarian cancer, the coagulation pathway, and inflammation , 2005, Journal of Translational Medicine.

[8]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[9]  N. Yang,et al.  Transcriptional coactivator Drosophila eyes absent homologue 2 is up-regulated in epithelial ovarian cancer and promotes tumor growth. , 2005, Cancer research.

[10]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[11]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[12]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[13]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[14]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[15]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[16]  T. Furui,et al.  Translocation of lysophosphatidic acid phosphatase in response to gonadotropin-releasing hormone to the plasma membrane in ovarian cancer cell. , 2004, American journal of obstetrics and gynecology.

[17]  P. Wagner,et al.  Putative dehydrogenase tms1 suppresses growth arrest induced by a p53 tumour mutant in fission yeast. , 1993, European journal of biochemistry.

[18]  Madhu Chetty,et al.  Relevance, Redundancy and Differential Prioritization in Feature Selection for Multiclass Gene Expression Data , 2005, ISBMDA.

[19]  Madhu Chetty,et al.  The Role of Feature Redundancy in tumor Classification , 2005, Advances in Bioinformatics and Its Applications.

[20]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[21]  Yi Xiao,et al.  Signal therapy of human pancreatic cancer and NF1-deficient breast cancer xenograft in mice by a combination of PP1 and GL-2003, anti-PAK1 drugs (Tyr-kinase inhibitors). , 2007, Cancer letters.

[22]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[23]  L Pei,et al.  Pituitary Tumor-transforming Gene Protein Associates with Ribosomal Protein S10 and a Novel Human Homologue of DnaJ in Testicular Cells* , 1999, The Journal of Biological Chemistry.

[24]  Shyh Wei Teng,et al.  Modeling microarray datasets for efficient feature selection , 2005 .

[25]  J. Dorado,et al.  Cloning of a human cDNA encoding a novel enzyme involved in the elongation of long-chain polyunsaturated fatty acids. , 2000, The Biochemical journal.

[26]  R S Chaganti,et al.  BCL-6, a POZ/zinc-finger protein, is a sequence-specific transcriptional repressor. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Xuefeng Bruce Ling,et al.  Multiclass cancer classification and biomarker discovery using GA-based algorithms , 2005, Bioinform..

[28]  Lloyd A. Smith,et al.  Practical feature subset selection for machine learning , 1998 .

[29]  Igor V. Tetko,et al.  Optimization models for cancer classification: extracting gene interaction information from microarray expression data , 2004, Bioinform..

[30]  Nello Cristianini,et al.  Large Margin DAGs for Multiclass Classification , 1999, NIPS.

[31]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[32]  Peter Möller,et al.  Identification of novel Myc target genes with a potential role in lymphomagenesis. , 2004, Nucleic acids research.

[33]  Madhu Chetty,et al.  A Comparative Study of Two Novel Predictor Set Scoring Methods , 2005, IDEAL.

[34]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[35]  G. Grimber,et al.  Time-course development of differentiated hepatocarcinoma and lung metastasis in transgenic mice. , 1991, Journal of hepatology.

[36]  C. Domeniconi,et al.  An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification , 2004 .

[37]  Adil M. Bagirov,et al.  New algorithms for multi-class cancer diagnosis using tumor gene expression signatures , 2003, Bioinform..

[38]  Paul E. Kroeger,et al.  (Biochem. J., 350:765-770)Cloning of a human cDNA encoding a novel enzyme involved in the elongation of long-chain polyunsaturated fatty acids , 2000 .

[39]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[40]  Philippe Ruminy,et al.  An interplay of Sp1, GKLF and CREB-2 controls human Pre-α-Inhibitor gene (ITIH3) transcription , 2003 .

[41]  N Schütze,et al.  Expression pattern of gastrointestinal selenoproteins--targets for selenium supplementation. , 1998, Nutrition and cancer.

[42]  J. Stuart Aitken,et al.  Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes , 2005, BMC Bioinformatics.

[43]  C R King,et al.  Human APC2 localization and allelic imbalance. , 2001, Cancer research.

[44]  Kamesh Munagala,et al.  Cancer characterization and feature set extraction by discriminative margin clustering , 2004, BMC Bioinformatics.

[45]  G. Ramadori,et al.  The IGF axis and hepatocarcinogenesis , 2001, Molecular pathology : MP.

[46]  B. Zabel,et al.  Mapping and structure of DMXL1, a human homologue of the DmX gene from Drosophila melanogaster coding for a WD repeat protein. , 2000, Genomics.

[47]  M. Watson,et al.  Mammaglobin, a Breast‐Specific Gene, and Its Utility as a Marker for Breast Cancer , 2000, Annals of the New York Academy of Sciences.

[48]  Huan Liu,et al.  Redundancy based feature selection for microarray data , 2004, KDD.

[49]  Richard Unwin,et al.  Proteomic analysis of primary cell lines identifies protein changes present in renal cell carcinoma , 2006, Proteomics.

[50]  Mee Young Park,et al.  Hierarchical Classification using Shrunken Centroids , 2005 .