Random Forest with Random Projection to Impute Missing Gene Expression Data

Measurement error or lack of proper experimental setup often results in invalid or missing data in gene expression studies. Small sample size and cost of re-running the experiment presents a need for an efficient missing data imputation technique. In this paper, we propose a method based on Random forest using Random projection as a data pre-processing filter. Initial results using varying missing data proportions on variety of real datasets show that the imputation process based on Random forest performs equally well or better than K-Nearest Neighbor & Support Vector Regression based methods. Using Random projection we show that dimensionality of a dataset can be reduced by 50 percent without affecting the imputation process.

[1]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[2]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[3]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[4]  M. Kenward,et al.  Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls , 2009, BMJ : British Medical Journal.

[5]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Atul J. Butte,et al.  Determining Significant Fold Differences in Gene Expression Analysis , 2000, Pacific Symposium on Biocomputing.

[7]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[8]  M. Cotreau,et al.  Molecular classification of Crohn's disease and ulcerative colitis patients using transcriptional profiles in peripheral blood mononuclear cells. , 2006, The Journal of molecular diagnostics : JMD.

[9]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[10]  Thomas G. Dietterich,et al.  Pruning Adaptive Boosting , 1997, ICML.

[11]  R. Gentleman,et al.  Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. , 2004, Blood.

[12]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[13]  S. Ishii,et al.  Identification of expressed genes linked to malignancy of human colorectal carcinoma by parametric clustering of quantitative expression data , 2003, Genome Biology.

[14]  Henrik Boström,et al.  Reducing High-Dimensional Data by Principal Component Analysis vs. Random Projection for Nearest Neighbor Classification , 2006, 2006 5th International Conference on Machine Learning and Applications (ICMLA'06).

[15]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[17]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[18]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[19]  T Tanaka,et al.  Prediction of sensitivity of esophageal tumors to adjuvant chemotherapy by cDNA microarray analysis of gene-expression profiles. , 2001, Cancer research.

[20]  Ao Li,et al.  Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme , 2006, BMC Bioinformatics.

[21]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[22]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[23]  F. Zhan,et al.  The role of the Wnt-signaling antagonist DKK1 in the development of osteolytic lesions in multiple myeloma. , 2003, The New England journal of medicine.

[24]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .

[25]  竹政 伊知朗,et al.  Construction of Preferential cDNA Microarray Specialized for Human Colorectal Carcinoma : Molecular Sketch of Colorectal Cancer , 2002 .

[26]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[27]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[28]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[29]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[30]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[31]  B. Christensen,et al.  Aging and Environmental Exposures Alter Tissue-Specific DNA Methylation Dependent upon CpG Island Context , 2009, PLoS genetics.

[32]  Truong Q. Nguyen,et al.  Single Image Superresolution Based on Support Vector Regression , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.