SPRINT: A new parallel framework for R

BackgroundMicroarray analysis allows the simultaneous measurement of thousands to millions of genes or sequences across tens to thousands of different samples. The analysis of the resulting data tests the limits of existing bioinformatics computing infrastructure. A solution to this issue is to use High Performance Computing (HPC) systems, which contain many processors and more memory than desktop computer systems. Many biostatisticians use R to process the data gleaned from microarray analysis and there is even a dedicated group of packages, Bioconductor, for this purpose. However, to exploit HPC systems, R must be able to utilise the multiple processors available on these systems. There are existing modules that enable R to use multiple processors, but these are either difficult to use for the HPC novice or cannot be used to solve certain classes of problems. A method of exploiting HPC systems, using R, but without recourse to mastering parallel programming paradigms is therefore necessary to analyse genomic data to its fullest.ResultsWe have designed and built a prototype framework that allows the addition of parallelised functions to R to enable the easy exploitation of HPC systems. The Simple Parallel R INTerface (SPRINT) is a wrapper around such parallelised functions. Their use requires very little modification to existing sequential R scripts and no expertise in parallel computing. As an example we created a function that carries out the computation of a pairwise calculated correlation matrix. This performs well with SPRINT. When executed using SPRINT on an HPC resource of eight processors this computation reduces by more than three times the time R takes to complete it on one processor.ConclusionSPRINT allows the biostatistician to concentrate on the research problems rather than the computation, while still allowing exploitation of HPC systems. It is easy to use and with further development will become more useful as more functions are added to the framework.

[1]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[2]  Holger Schwender,et al.  Empirical Bayes Analysis of Single Nucleotide Polymorphisms Empirical Bayes Analysis of Single Nucleotide Polymorphisms , 2008 .

[3]  Sergio Contrino,et al.  ArrayExpress—a public repository for microarray gene expression data at the EBI , 2004, Nucleic Acids Res..

[4]  Yudi Pawitan,et al.  Normalization of oligonucleotide arrays based on the least-variant set of genes , 2008, BMC Bioinformatics.

[5]  Simon Tavaré,et al.  Statistical issues in the analysis of Illumina data , 2008, BMC Bioinformatics.

[6]  D. Bowtell,et al.  Options available—from start to finish—for obtaining expression data by microarray , 1999, Nature Genetics.

[7]  Petter Mostad,et al.  Empirical Bayes models for multiple probe type microarrays at the probe level , 2008, BMC Bioinformatics.

[8]  John Quackenbush Microarray data normalization and transformation , 2002, Nature Genetics.

[9]  Gonzalo Vera,et al.  R/parallel – speeding up bioinformatics analysis with R , 2008, BMC Bioinformatics.

[10]  M. Heller DNA microarray technology: devices, systems, and applications. , 2002, Annual review of biomedical engineering.

[11]  Xuhua Xia,et al.  Using Generalized Procrustes Analysis (GPA) for normalization of cDNA microarray data , 2008, BMC Bioinformatics.

[12]  Rafael A. Irizarry,et al.  Bioinformatics and Computational Biology Solutions using R and Bioconductor , 2005 .

[13]  J. Davis Bioinformatics and Computational Biology Solutions Using R and Bioconductor , 2007 .

[14]  G. A. Geist,et al.  The PVM System: Supercomputer Level Concurrent Computation on a Heterogeneous Network of Workstations , 1991, The Sixth Distributed Memory Computing Conference, 1991. Proceedings.