An expert system to classify microarray gene expression data using gene selection by decision tree

Gene selection can help the analysis of microarray gene expression data. However, it is very difficult to obtain a satisfactory classification result by machine learning techniques because of both the curse-of-dimensionality problem and the over-fitting problem. That is, the dimensions of the features are too large but the samples are too few. In this study, we designed an approach that attempts to avoid these two problems and then used it to select a small set of significant biomarker genes for diagnosis. Finally, we attempted to use these markers for the classification of cancer. This approach was tested the approach on a number of microarray datasets in order to demonstrate that it performs well and is both useful and reliable.

[1]  K. J. Ray Liu,et al.  Ensemble dependence model for classification and prediction of cancer and normal gene expression data , 2005, Bioinform..

[2]  Bani K. Mallick,et al.  Gene selection using a two-level hierarchical Bayesian model , 2004, Bioinform..

[3]  Alexandra King Major developments in adjuvant treatment of early HER2-positive breast cancer , 2006, Nature Clinical Practice Oncology.

[4]  John W. Keele,et al.  Positional candidate gene selection from livestock EST databases using Gene Ontology , 2003, Bioinform..

[5]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[6]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[7]  Gino J. Lim,et al.  Optimization Models for Cancer Treatment Planning , 2011 .

[8]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[9]  Wei Chu,et al.  Biomarker discovery in microarray gene expression data with Gaussian processes , 2005, Bioinform..

[10]  G. Getz,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2005, Breast Cancer Research.

[11]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[12]  Igor V. Tetko,et al.  Exploiting scale-free information from expression data for cancer classification , 2005, German Conference on Bioinformatics.

[13]  Fillia Makedon,et al.  HykGene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data , 2005, Bioinform..

[14]  Minoru Kanehisa,et al.  The KEGG database. , 2002, Novartis Foundation symposium.

[15]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[16]  Koji Kono,et al.  A case of metastatic breast cancer with outgrowth of HER2-negative cells after eradication of HER2-positive cells by humanized anti-HER2 monoclonal antibody (trastuzumab) combined with docetaxel. , 2004, Human pathology.

[17]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[18]  C. Müller,et al.  Large-scale clustering of cDNA-fingerprinting data. , 1999, Genome research.

[19]  Xia Li,et al.  Gene mining: a novel and powerful ensemble decision approach to hunting for disease genes using microarray expression profiling. , 2004, Nucleic acids research.

[20]  A. Riva,et al.  HER2-positive breast cancer: update on Breast Cancer International Research Group trials. , 2002, Clinical breast cancer.

[21]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[22]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[23]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[24]  B. Aronow,et al.  Microarray analysis of trophoblast differentiation: gene expression reprogramming in key gene function categories. , 2001, Physiological genomics.

[25]  Igor V. Tetko,et al.  Optimization models for cancer classification: extracting gene interaction information from microarray expression data , 2004, Bioinform..

[26]  Yan Jiao,et al.  Application of DNA microarray technology in genetics , 2007 .

[27]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[28]  Susumu Goto,et al.  The KEGG databases at GenomeNet , 2002, Nucleic Acids Res..

[29]  Huiqing Liu,et al.  Discovery of significant rules for classifying cancer diagnosis data , 2003, ECCB.

[30]  Adrian E. Raftery,et al.  Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data , 2005, Bioinform..

[31]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[32]  Andrew F. Siegel,et al.  Practical Business Statistics , 1994 .

[33]  Tao Cai,et al.  Automated genome annotation and pathway identification using the KEGG Orthology (KO) as a controlled vocabulary , 2005, Bioinform..

[34]  D. Higgins,et al.  Application of DNA microarray technology in determining breast cancer prognosis and therapeutic response , 2005, Expert opinion on biological therapy.

[35]  Sangsoo Kim,et al.  Gene expression Differential coexpression analysis using microarray data and its application to human cancer , 2005 .

[36]  J. Welsh,et al.  Molecular classification of human carcinomas by use of gene expression signatures. , 2001, Cancer research.

[37]  Ljubomir J. Buturovic,et al.  PCP: a program for supervised classification of gene expression profiles , 2006, Bioinform..

[38]  S. Cosimo,et al.  A phase II study on metastatic breast cancer patients treated with weekly vinorelbine with or without trastuzumab according to HER2 expression: changing the natural history of HER2-positive disease. , 2006, Annals of oncology : official journal of the European Society for Medical Oncology.

[39]  R. Fisher,et al.  Statistical Methods for Research Workers , 1930, Nature.