Investigation of the attribute of the random forest in defining significant pathways

High-throughput genomic data produced by DNA microarray provide an opportunity for identifying genes that are related to various clinical phenotypes. Besides these genomic data, another significant source of data is the biological knowledge about genes and pathways that related to phenotypes of complex diseases such as cancer diseases. Recently, microarray analysis is done by looking on single gene at a time such as genes selection and classification. These research approaches frequently neglect interaction between gene-gene and gene-pathways that led to the loss in biological interpretation. In order to overcome the limitation of this univariate microarray analysis, the integration of microarray data with biological knowledge such as pathways information may lead to more informative and relevant knowledge in biological research. Therefore this research is mainly concerned with pathway-based microarray gene expression analysis to defined significant phenotype-related pathways and genes by ranking pathways. The purpose of this paper is to investigate the properties of Random Forest and its applicability to ranking pathways via classification and regression methods.

[1]  Lenka Lhotská,et al.  Hybridized Swarm Metaheuristics for Evolutionary Random Forest Generation , 2007, 7th International Conference on Hybrid Intelligent Systems (HIS 2007).

[2]  Sung-Bae Cho,et al.  Machine Learning in DNA Microarray Analysis for Cancer Classification , 2003, APBC.

[3]  Mohd Saberi Mohamad,et al.  A Hybrid of Genetic Algorithm and Support Vector Machine for Features Selection and Classification of Gene Expression Microarray , 2005, Int. J. Comput. Intell. Appl..

[4]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[5]  Wei Pan,et al.  Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms , 2007, Bioinform..

[6]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[7]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[9]  Hongyu Zhao,et al.  Acute Drug-Induced Vascular Injury in Beagle Dogs: Pathology and Correlating Genomic Expression , 2006, Toxicologic pathology.

[10]  Yoram Singer,et al.  Boosting and Rocchio applied to text filtering , 1998, SIGIR '98.

[11]  Wei Pan,et al.  Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data , 2007, Bioinform..

[12]  Marko Robnik-Sikonja,et al.  Improving Random Forests , 2004, ECML.

[13]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[14]  Jun Lu,et al.  Pathway level analysis of gene expression using singular value decomposition , 2005, BMC Bioinformatics.

[15]  L. Martiny,et al.  The tumor suppressor PTEN inhibits EGF-induced TSP-1 and TIMP-1 expression in FTC-133 thyroid carcinoma cells. , 2005, Experimental cell research.

[16]  M. Orešič,et al.  Pathways to the analysis of microarray data. , 2005, Trends in biotechnology.

[17]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[18]  Hongyu Zhao,et al.  Pathway analysis using random forests classification and regression , 2006, Bioinform..

[19]  Yung-Seop Lee,et al.  Enriched random forests , 2008, Bioinform..

[20]  David W. Opitz,et al.  An Empirical Evaluation of Bagging and Boosting , 1997, AAAI/IAAI.

[21]  Musa H. Asyali,et al.  Gene Expression Profile Classification: A Review , 2006 .

[22]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[23]  Ming Wu,et al.  Gene module level analysis: identification to networks and dynamics. , 2008, Current opinion in biotechnology.

[24]  A. Al-Majed,et al.  A Possible Modulatory Role of Nitric Oxide in Paraquat-indu ced Lung Injury in Mice , 2005 .

[25]  Hongzhe Li,et al.  Group additive regression models for genomic data analysis. , 2008, Biostatistics.

[26]  Ron Kohavi,et al.  Option Decision Trees with Majority Votes , 1997, ICML.

[27]  E. Appella,et al.  Post-translational modifications and activation of p53 by genotoxic stresses. , 2001, European journal of biochemistry.

[28]  Hongzhe Li,et al.  Nonparametric pathway-based regression models for analysis of genomic data. , 2007, Biostatistics.