Combining R with Scalable Libraries to Get the Best of Both for Big Data

The R programming language is known for its diversity and sophistication in data analysis, however its scalability to big data has been lacking. Our project ”Programming with Big Data in R” (pbdR) is adding scalability to its list of data virtues. We have developed several packages that provide a tight coupling of R with highly scalable libraries, enabling scalability to terabytes of data on tens of thousands of cores. We also added classes and methods to handle distributed data objects needed by the libraries so that the R language syntax is largely unchanged. Our philosophy is that the R developer is not asked to deal with the details of managing the data distribution and processor communication but the developer is asked to be aware of the data distribution and provided with high-level functions to manage it if needed. Many R functions are already instrumented to handle the distributed data classes. We encourage developers of compute intensive R packages to use pbdR methods for scalability to bigger data and to bigger computing platforms.

[1]  George Ostrouchov,et al.  Tight Coupling of R and Distributed Linear Algebra for High-Level Programming with Big Data , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[2]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[3]  Karsten Schwan,et al.  Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS) , 2008, CLADE '08.

[4]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[5]  R. G. Voigt,et al.  Bibliography on parallel and vector numerical algorithms. Final report , 1987 .

[6]  Mark Frederick Hoemmen,et al.  An Overview of Trilinos , 2003 .

[7]  George Ostrouchov A Call for Participation in XSEDE Computing: Statisticians Needed for Big Data , 2013 .

[8]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[9]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[10]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[11]  Thomas Hérault,et al.  Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[12]  William Gropp,et al.  Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries , 1997, SciTools.