Microarray Data Classifier Consisting of k-Top-Scoring Rank-Comparison Decision Rules With a Variable Number of Genes

Microarray experiments generate quantitative expression measurements for thousands of genes simultaneously, which is useful for phenotype classification of many diseases. Our proposed phenotype classifier is an ensemble method with k-top-scoring decision rules. Each rule involves a number of genes, a rank comparison relation among them, and a class label. Current classifiers, which are also ensemble methods, consist of k-top-scoring decision rules. Some of these classifiers fix the number of genes in each rule as a triple or a pair. In this paper, we generalize the number of genes involved in each rule. The number of genes in each rule ranges from 2 to N, respectively. Generalizing the number of genes increases the robustness and the reliability of the classifier for the class prediction of an independent sample. Our algorithm saves resources by combining shorter rules in order to build a longer rule. It converges rapidly toward its high-scoring rule list by implementing several heuristics. The parameter k is determined by applying leave-one-out cross validation to the training dataset.

[1]  Jiawei Han,et al.  Cancer classification using gene expression data , 2003, Inf. Syst..

[2]  Daniel Q. Naiman,et al.  Simple decision rules for classifying human cancers from gene expression profiles , 2005, Bioinform..

[3]  Sushmita Mitra,et al.  Evolutionary Rough Feature Selection in Gene Expression Data , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[4]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[5]  Ernst Wit,et al.  Statistics for Microarrays : Design, Analysis and Inference , 2004 .

[6]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[7]  Yinglei Lai,et al.  A mixture model approach to the tests of concordance and discordance between two large-scale experiments with two-sample groups , 2007, Bioinform..

[8]  Sanghyun Park,et al.  Direct integration of microarrays for selecting informative genes and phenotype classification , 2008, Inf. Sci..

[9]  Ian Witten,et al.  Data Mining , 2000 .

[10]  R. H. Myers,et al.  STAT 319 : Probability & Statistics for Engineers & Scientists Term 152 ( 1 ) Final Exam Wednesday 11 / 05 / 2016 8 : 00 – 10 : 30 AM , 2016 .

[11]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[12]  Aik Choon Tan,et al.  Ensemble machine learning on gene expression data for cancer classification. , 2003, Applied bioinformatics.

[13]  E. Latulippe,et al.  Comprehensive gene expression analysis of prostate cancer reveals distinct transcriptional programs associated with metastatic disease. , 2002, Cancer research.

[14]  I. Yang,et al.  Multi-platform, multi-site, microarray-based human tumor classification. , 2004, The American journal of pathology.

[15]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  J. Welsh,et al.  Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. , 2001, Cancer research.

[18]  Li M. Fu,et al.  Evaluation of gene importance in microarray data based upon probability of selection , 2005, BMC Bioinformatics.

[19]  S. Mitra,et al.  Bioinformatics with soft computing , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[20]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[21]  Xiaoxing Liu,et al.  An Entropy-based gene selection method for cancer classification using microarray data , 2005, BMC Bioinformatics.

[22]  Yvan Saeys,et al.  Robust Feature Selection Using Ensemble Feature Selection Techniques , 2008, ECML/PKDD.

[23]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[24]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[25]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[26]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[27]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[28]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[29]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[30]  Philip M. Long,et al.  Boosting and Microarray Data , 2003, Machine Learning.

[31]  Xin Yao,et al.  Gene selection algorithms for microarray data based on least squares support vector machine , 2006, BMC Bioinformatics.

[32]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[33]  Sayan Mukherjee,et al.  Estimating Dataset Size Requirements for Classifying DNA Microarray Data , 2003, J. Comput. Biol..

[34]  G DietterichThomas An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees , 2000 .