Forest classification trees and forest support vector machines algorithms: Demonstration using microarray data

Classification into multiple classes when the measured variables are outnumbered is a major methodological challenge in -omics studies. Two algorithms that overcome the dimensionality problem are presented: the forest classification tree (FCT) and the forest support vector machines (FSVM). In FCT, a set of variables is randomly chosen and a classification tree (CT) is grown using a forward classification algorithm. The process is repeated and a forest of CTs is derived. Finally, the most frequent variables from the trees with the smallest apparent misclassification rate (AMR) are used to construct a productive tree. In FSVM, the CTs are replaced by SVMs. The methods are demonstrated using prostate gene expression data for classifying tissue samples into four tumor types. For threshold split value 0.001 and utilizing 100 markers the productive CT consisted of 29 terminal nodes and achieved perfect classification (AMR=0). When the threshold value was set to 0.01, a tree with 17 terminal nodes was constructed based on 15 markers (AMR=7%). In FSVM, reducing the fraction of the forest that was used to construct the best classifier from the top 80% to the top 20% reduced the misclassification to 25% (when using 200 markers). The proposed methodologies may be used for identifying important variables in high dimensional data. Furthermore, the FCT allows exploring the data structure and provides a decision rule.

[1]  Elias Zintzaras,et al.  A tree-based decision rule for identifying profile groups of cases without predefined classes: application in diffuse large B-cell lymphomas , 2007, Comput. Biol. Medicine.

[2]  Dimitrios Kalles,et al.  Breeding Decision Trees Using Evolutionary Techniques , 2001, ICML.

[3]  Sayan Mukherjee,et al.  Molecular classification of multiple tumor types , 2001, ISMB.

[4]  Alan R. Dabney BIOINFORMATICS Classification of Microarrays to Nearest Centroids , 2022 .

[5]  Athanassios Papagelis,et al.  Lossless fitness inheritance in genetic algorithms for decision trees , 2006, Soft Comput..

[6]  Elias Zintzaras Classification tree based protein structure distances for testing sequence-structure correlation , 2008, Comput. Biol. Medicine.

[7]  Elias Zintzaras,et al.  Growing a classification tree using the apparent misclassification rate , 1994, Comput. Appl. Biosci..

[8]  Mark S. Boguski,et al.  Biomedical informatics for proteomics , 2003, Nature.

[9]  Sarunas Raudys,et al.  On Dimensionality, Sample Size, and Classification Error of Nonparametric Linear Classification Algorithms , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..

[11]  Kenneth A. De Jong,et al.  Using genetic algorithms for concept learning , 1993, Machine Learning.

[12]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[13]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[14]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[15]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[16]  Guangzhe Fan,et al.  Classification tree analysis using TARGET , 2008, Comput. Stat. Data Anal..

[17]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[18]  Sarunas Raudys,et al.  On Dimensionality, Sample Size, Classification Error, and Complexity of Classification Algorithm in Pattern Recognition , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  B. Korf,et al.  Molecular diagnosis (2) , 1995, The New England journal of medicine.

[20]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[21]  M. Xiong,et al.  Recursive partitioning for tumor classification with gene expression microarray data , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[22]  David Ward,et al.  Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data , 2003, Bioinform..

[23]  Elias Zintzaras,et al.  Statistical tree classification of aphids based on morphological characteristics , 1999 .

[24]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[25]  J. Phan,et al.  Improvement of SVM Algorithm for Microarray Analysis Using Intelligent Parameter Selection , 2005, 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference.

[26]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[27]  Dimitrios Kalles,et al.  GA Tree: genetically evolved decision trees , 2000, Proceedings 12th IEEE Internationals Conference on Tools with Artificial Intelligence. ICTAI 2000.

[28]  M. Becich,et al.  Gene expression profiles of prostate cancer reveal involvement of multiple molecular pathways in the metastatic process , 2007, BMC Cancer.

[29]  B Cochand-Priollet,et al.  Profiling and classification tree applied to renal epithelial tumours , 2007, Histopathology.

[30]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[31]  Stephen W. Wharton An analysis of the effects of sample size on classification performance of a histogram based cluster analysis procedure , 1984, Pattern Recognit..

[32]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Peter D. Turney Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm , 1994, J. Artif. Intell. Res..

[34]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[35]  Heping Zhang,et al.  Use of classification trees for association studies , 2000, Genetic epidemiology.

[36]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[37]  C. Tappert,et al.  A Genetic Algorithm for Constructing Compact Binary Decision Trees , 2009 .

[38]  Weida Tong,et al.  Multiclass Decision Forest--a novel pattern recognition method for multiclass classification in microarray data analysis. , 2004, DNA and cell biology.

[39]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[40]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[41]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[42]  Šarūnas Raudys,et al.  Statistical and Neural Classifiers: An Integrated Approach to Design , 2012 .

[43]  C. Sutton Classification and Regression Trees, Bagging, and Boosting , 2005 .

[44]  Elias Zintzaras,et al.  Non-parametric classification of protein secondary structures , 2006, Comput. Biol. Medicine.

[45]  B Efron,et al.  Statistical Data Analysis in the Computer Age , 1991, Science.

[46]  R Spang,et al.  Molecular Diagnosis , 2005, Methods of Information in Medicine.

[47]  Deborah R. Carvalho,et al.  A hybrid decision tree/genetic algorithm method for data mining , 2004, Inf. Sci..

[48]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[49]  Stefano Toppo,et al.  Pattern recognition in gene expression profiling using DNA array: a comparative study of different statistical methods applied to cancer classification. , 2003, Human molecular genetics.

[50]  Lon R Cardon,et al.  Evaluating coverage of genome-wide association studies , 2006, Nature Genetics.

[51]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[52]  John P. A. Ioannidis,et al.  Meta-analysis for ranked discovery datasets: Theoretical framework and empirical demonstration for microarrays , 2008, Comput. Biol. Chem..

[53]  Gerhard Tutz,et al.  A CART-based approach to discover emerging patterns in microarray data , 2003, Bioinform..