Microarray gene expression classification with few genes: Criteria to combine attribute selection and classification methods

Microarray data classification is a task involving high dimensionality and small samples sizes. A common criterion to decide on the number of selected genes is maximizing the accuracy, which risks overfitting and usually selects more genes than actually needed. We propose, relaxing the maximum accuracy criterion, to select the combination of attribute selection and classification algorithm that using less attributes has an accuracy not statistically significantly worst that the best. Also we give some advice to choose a suitable combination of attribute selection and classifying algorithms for a good accuracy when using a low number of gene expressions. We used some well known attribute selection methods (FCBF, ReliefF and SVM-RFE, plus a Random selection, used as a base line technique) and classifying techniques (Naive Bayes, 3 Nearest Neighbor and SVM with linear kernel) applied to 30 data sets involving different cancer types.

[1]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[2]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[3]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[4]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[5]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[6]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[7]  Marko Robnik-Sikonja,et al.  An adaptation of Relief for attribute estimation in regression , 1997, ICML.

[8]  田中 俊典 National Center for Biotechnology Information (NCBI) , 2012 .

[9]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[10]  Carlos J. Alonso,et al.  Selecting Few Genes for Microarray Gene Expression Classification , 2009, CAEPIA.

[11]  Chris H. Q. Ding,et al.  Minimum Redundancy Feature Selection from Microarray Gene Expression Data , 2005, J. Bioinform. Comput. Biol..

[12]  Torben F. Ørntoft,et al.  Identifying distinct classes of bladder carcinoma using microarrays , 2003, Nature Genetics.

[13]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[14]  D. Botstein,et al.  A gene expression database for the molecular pharmacology of cancer , 2000, Nature Genetics.

[15]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[16]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian Cancer , 2002 .

[17]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[18]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[19]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[20]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[21]  Dana H. Ballard Feature Selection and Classification , 1976 .

[22]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[23]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[24]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[25]  Holger Sültmann,et al.  Global gene expression analysis reveals specific patterns of cell junctions in non-small cell lung cancer subtypes. , 2009, Lung cancer.

[26]  Stuart M. Brown,et al.  Selection and validation of differentially expressed genes in head and neck cancer , 2004, Cellular and Molecular Life Sciences CMLS.

[27]  Zhou-Jun Li,et al.  Are filter methods very effective in gene selection of microarray data? , 2007, 2007 IEEE International Conference on Bioinformatics and Biomedicine Workshops.

[28]  J. Welsh,et al.  Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. , 2001, Cancer research.

[29]  R. Verhaak,et al.  Prognostically useful gene-expression profiles in acute myeloid leukemia. , 2004, The New England journal of medicine.

[30]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[31]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[32]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[34]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[35]  Yi-Ching Hsieh,et al.  In chronic myeloid leukemia white cells from cytogenetic responders and non-responders to imatinib have very similar gene expression signatures. , 2005, Haematologica.

[36]  M. Newton,et al.  Genes Involved in DNA Repair and Nitrosamine Metabolism and Those Located on Chromosome 14q32 Are Dysregulated in Nasopharyngeal Carcinoma , 2006, Cancer Epidemiology Biomarkers & Prevention.

[37]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[38]  Wentian Li,et al.  How Many Genes are Needed for a Discriminant Microarray Data Analysis , 2001, physics/0104029.

[39]  S. Dhanasekaran,et al.  Delineation of prognostic biomarkers in prostate cancer , 2001, Nature.

[40]  Meland,et al.  THE USE OF MOLECULAR PROFILING TO PREDICT SURVIVAL AFTER CHEMOTHERAPY FOR DIFFUSE LARGE-B-CELL LYMPHOMA , 2002 .

[41]  José Hernández-Orallo,et al.  An experimental comparison of performance measures for classification , 2009, Pattern Recognit. Lett..

[42]  J. Bergh,et al.  Identification of molecular apocrine breast tumours by microarray analysis , 2005, Breast Cancer Research.

[43]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[44]  Yoshua Bengio,et al.  Inference for the Generalization Error , 1999, Machine Learning.

[45]  Peter Kokol,et al.  Feature Selection and Classification for Small Gene Sets , 2008, PRIB.

[46]  Liviu Badea,et al.  Combined gene expression analysis of whole-tissue and microdissected pancreatic ductal adenocarcinoma identifies genes specifically overexpressed in tumor epithelia. , 2008, Hepato-gastroenterology.

[47]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[48]  Nor Hayati Othman,et al.  A review of feature selection techniques via gene expression profiles , 2008, 2008 International Symposium on Information Technology.

[49]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[50]  M. Tyers,et al.  Molecular profiling of non-small cell lung cancer and correlation with disease-free survival. , 2002, Cancer research.

[51]  M. Xiong,et al.  Biomarker Identification by Feature Wrappers , 2022 .

[52]  C. Domeniconi,et al.  An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification , 2004 .

[53]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[54]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[55]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[56]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[57]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[58]  Yanqing Zhang,et al.  FCM-SVM-RFE Gene Feature Selection Algorithm for Leukemia Classification from Microarray Gene Expression Data , 2005, The 14th IEEE International Conference on Fuzzy Systems, 2005. FUZZ '05..