Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes

MotivationMultivariate quantitative traits arise naturally in recent neuroimaging genetics studies, in which both structural and functional variability of the human brain is measured non-invasively through techniques such as magnetic resonance imaging (MRI). There is growing interest in detecting genetic variants associated with such multivariate traits, especially in genome-wide studies. Random forests (RFs) classifiers, which are ensembles of decision trees, are amongst the best performing machine learning algorithms and have been successfully employed for the prioritisation of genetic variants in case-control studies. RFs can also be applied to produce gene rankings in association studies with multivariate quantitative traits, and to estimate genetic similarities measures that are predictive of the trait. However, in studies involving hundreds of thousands of SNPs and high-dimensional traits, a very large ensemble of trees must be inferred from the data in order to obtain reliable rankings, which makes the application of these algorithms computationally prohibitive.ResultsWe have developed a parallel version of the RF algorithm for regression and genetic similarity learning tasks in large-scale population genetic association studies involving multivariate traits, called PaRFR (Parallel Random Forest Regression). Our implementation takes advantage of the MapReduce programming model and is deployed on Hadoop, an open-source software framework that supports data-intensive distributed applications. Notable speed-ups are obtained by introducing a distance-based criterion for node splitting in the tree estimation process. PaRFR has been applied to a genome-wide association study on Alzheimer's disease (AD) in which the quantitative trait consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in the human brain structure. PaRFR provides a ranking of SNPs associated to this trait, and produces pair-wise measures of genetic proximity that can be directly compared to pair-wise measures of phenotypic proximity. Several known AD-related variants have been identified, including APOE4 and TOMM40. We also present experimental evidence supporting the hypothesis of a linear relationship between the number of top-ranked mutated states, or frequent mutation patterns, and an indicator of disease severity.AvailabilityThe Java codes are freely available at http://www2.imperial.ac.uk/~gmontana.

[1]  Olaf Sporns,et al.  Complex network measures of brain connectivity: Uses and interpretations , 2010, NeuroImage.

[2]  Rui Jiang,et al.  A random forest approach to the detection of epistatic interactions in case-control studies , 2009, BMC Bioinformatics.

[3]  Jason H. Moore,et al.  Alzheimer's Disease Neuroimaging Initiative biomarkers as quantitative phenotypes: Genetics core aims, progress, and plans , 2010, Alzheimer's & Dementia.

[4]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[5]  Manuel A. R. Ferreira,et al.  Genetics and population analysis A multivariate test of association , 2009 .

[6]  Shiow-Fen Hwang,et al.  ProLoc-GO: Utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization , 2008, BMC Bioinformatics.

[7]  Håkan Grahn,et al.  CudaRF: A CUDA-based implementation of Random Forests , 2011, 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA).

[8]  Jon Hill,et al.  SPRINT: A new parallel framework for R , 2008, BMC Bioinformatics.

[9]  Adele Cutler,et al.  An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings , 2010, BMC Genetics.

[10]  Kristin K. Nicodemus,et al.  Letter to the Editor: On the stability and ranking of predictors from random forest variable importance measures , 2011, Briefings Bioinform..

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Mark R. Segal,et al.  Identification of Yeast Transcriptional Regulation Networks Using Multivariate Random Forests , 2009, PLoS Comput. Biol..

[13]  Paul M. Thompson,et al.  Sparse reduced-rank regression detects genetic associations with voxel-wise longitudinal phenotypes in Alzheimer's disease , 2012, NeuroImage.

[14]  K. Frazer,et al.  Human genetic variation and its contribution to complex traits , 2009, Nature Reviews Genetics.

[15]  N. Mantel The detection of disease clustering and a generalized regression approach. , 1967, Cancer research.

[16]  P. Thompson,et al.  Neuroimaging endophenotypes: Strategies for finding genes influencing brain structure and function , 2007, Human brain mapping.

[17]  Mark R. Segal,et al.  Multivariate random forests , 2011, WIREs Data Mining Knowl. Discov..

[18]  P. Thompson,et al.  Multilocus Genetic Analysis of Brain Images , 2011, Front. Gene..

[19]  M. Segal Tree-Structured Methods for Longitudinal Data , 1992 .

[20]  Yan V Sun,et al.  Multigenic modeling of complex disease by random forests. , 2010, Advances in genetics.

[21]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[22]  Andrew J. Saykin,et al.  Voxelwise genome-wide association study (vGWAS) , 2010, NeuroImage.

[23]  Paul M. Thompson,et al.  Identification of gene pathways implicated in Alzheimer's disease using longitudinal imaging phenotypes with sparse regression☆ , 2012, NeuroImage.

[24]  Giovanni Montana,et al.  Distance-based differential analysis of gene curves , 2011, Bioinform..

[25]  Marleen de Bruijne,et al.  A Genome-Wide Association Study Identifies Five Loci Influencing Facial Morphology in Europeans , 2012, PLoS genetics.

[26]  E. Stone,et al.  The genetics of quantitative traits: challenges and prospects , 2009, Nature Reviews Genetics.

[27]  Limsoon Wong,et al.  Improved statistical model checking methods for pathway analysis , 2012, BMC Bioinformatics.

[28]  A. Meyer-Lindenberg,et al.  Intermediate phenotypes and genetic mechanisms of psychiatric disorders , 2006, Nature Reviews Neuroscience.

[29]  Thomas E. Nichols,et al.  Discovering genetic associations with high-dimensional neuroimaging phenotypes: A sparse reduced-rank regression approach , 2010, NeuroImage.

[30]  E. Polley,et al.  Statistical Applications in Genetics and Molecular Biology Random Forests for Genetic Association Studies , 2011 .

[31]  Pierre Geurts,et al.  A screening methodology based on Random Forests to improve the detection of gene–gene interactions , 2010, European Journal of Human Genetics.

[32]  Carolin Strobl,et al.  Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations , 2012, Briefings Bioinform..

[33]  Michael Weiner,et al.  Genome-wide analysis reveals novel genes influencing temporal lobe structure with relevance to neurodegeneration in Alzheimer's disease , 2010, NeuroImage.

[34]  Annette M. Molinaro,et al.  Power of Data Mining Methods to Detect Genetic Associations and Interactions , 2011, Human Heredity.

[35]  Toby Sharp,et al.  Implementing Decision Trees and Forests on a GPU , 2008, ECCV.

[36]  Andreas Meyer-Lindenberg,et al.  The future of fMRI and genetics research , 2012, NeuroImage.

[37]  Greig de Zubicaray,et al.  Neuroimaging and Genetics: Exploring, Searching, and Finding , 2012, Twin Research and Human Genetics.

[38]  Debashis Mukhopadhyay,et al.  AICD Overexpression in Neuro 2A Cells Regulates Expression of PTCH1 and TRPC5 , 2010, International journal of Alzheimer's disease.

[39]  Paul M. Thompson,et al.  Neuroimaging Measures as Endophenotypes in Alzheimer's Disease , 2011, International journal of Alzheimer's disease.

[40]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[41]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[42]  Tamara G. Kolda,et al.  COMET: A Recipe for Learning and Using Large Ensembles on Massive Data , 2011, 2011 IEEE 11th International Conference on Data Mining.