Parallel classification and feature selection in microarray data using SPRINT

The statistical language R is favoured by many biostatisticians for processing microarray data. In recent times, the quantity of data that can be obtained in experiments has risen significantly, making previously fast analyses time consuming or even not possible at all with the existing software infrastructure. High performance computing (HPC) systems offer a solution to these problems but at the expense of increased complexity for the end user. The Simple Parallel R Interface is a library for R that aims to reduce the complexity of using HPC systems by providing biostatisticians with drop‐in parallelised replacements of existing R functions. In this paper we describe parallel implementations of two popular techniques: exploratory clustering analyses using the random forest classifier and feature selection through identification of differentially expressed genes using the rank product method. Copyright © 2012 John Wiley & Sons, Ltd.

[1]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[2]  E. Gehan,et al.  The properties of high-dimensional data spaces: implications for exploring gene and protein expression data , 2008, Nature Reviews Cancer.

[3]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[4]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[5]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[6]  David A. Bader,et al.  A new deterministic parallel sorting algorithm with an experimental evaluation , 1998, JEAL.

[7]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[8]  Gonzalo Vera,et al.  BMC Bioinformatics , 2005 .

[9]  Xin Li,et al.  Data mining, neural nets, trees — Problems 2 and 3 of Genetic Analysis Workshop 15 , 2007, Genetic epidemiology.

[10]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[11]  Karolj Skala,et al.  Reimplementation of the Random Forest Algorithm , 2005 .

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[14]  Vipin Kumar,et al.  ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[15]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[16]  Alexander A. Morgan,et al.  Translational bioinformatics in the cloud: an affordable alternative , 2010, Genome Medicine.

[17]  Jon Hill,et al.  SPRINT: A new parallel framework for R , 2008, BMC Bioinformatics.

[18]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[19]  Rainer Breitling,et al.  RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis , 2006, Bioinform..

[20]  R. Clarke,et al.  Approaches to working in high-dimensional data spaces: gene expression microarrays , 2008, British Journal of Cancer.

[21]  J. Koziol Comments on the rank product method for analyzing replicated experiments , 2010, FEBS letters.

[22]  S. Lakshmivarahan,et al.  Parallel Sorting Algorithms , 1984, Adv. Comput..

[23]  Raffaele Giancarlo,et al.  Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer , 2008, BMC Bioinformatics.

[24]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[25]  Yi Yu,et al.  Performance of random forest when SNPs are in linkage disequilibrium , 2009, BMC Bioinformatics.

[26]  I. König,et al.  Picking single-nucleotide polymorphisms in forests , 2007, BMC proceedings.

[27]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[28]  Terence M. Sloan,et al.  Optimization of a parallel permutation testing function for the SPRINT R package , 2010, HPDC '10.

[29]  Jonathan Schaeffer,et al.  Parallel Sorting by Regular Sampling , 1992, J. Parallel Distributed Comput..

[30]  Rainer Breitling,et al.  Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments , 2004, FEBS letters.