RProtoBuf: Efficient Cross-Language Data Serialization in R

Modern data collection and analysis pipelines often involve a sophisticated mix of applications written in general purpose and specialized programming languages. Many formats commonly used to import and export data between different programs or systems, such as CSV or JSON, are verbose, inefficient, not type-safe, or tied to a specific programming language. Protocol Buffers are a popular method of serializing structured data between applications - while remaining independent of programming languages or operating systems. They offer a unique combination of features, performance, and maturity that seems particularly well suited for data-driven applications and numerical computing. The RProtoBuf package provides a complete interface to Protocol Buffers from the R environment for statistical computing. This paper outlines the general class of data serialization requirements for statistical computing, describes the implementation of the RProtoBuf package, and illustrates its use with example applications in large-scale data collection pipelines and web services.

[1]  Dirk Eddelbuettel,et al.  Seamless R and C++ Integration with Rcpp , 2013 .

[2]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[3]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  Dirk Eddelbuettel,et al.  Exposing C++ functions and classes with Rcpp modules , 2016 .

[6]  J. G. Garson,et al.  The metric system of identification of criminals, as used in Great Britain and Ireland. , 2010 .

[7]  Arif Merchant,et al.  Janus: Optimal Flash Provisioning for Cloud Storage Workloads , 2013, USENIX Annual Technical Conference.

[8]  Dirk Eddelbuettel,et al.  Rcpp: Seamless R and C++ Integration , 2011 .

[9]  Jeroen Ooms,et al.  The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects , 2014, ArXiv.

[10]  Arif Merchant,et al.  Projecting disk usage based on historical trends in a cloud environment , 2012, ScienceCloud '12.

[11]  Yakov Shafranovich,et al.  Common Format and MIME Type for Comma-Separated Values (CSV) Files , 2005, RFC.

[12]  Duncan Temple Lang,et al.  XML and Web Technologies for Data Sciences with R , 2013 .

[13]  Simon Urbanek,et al.  Rserve A fast way to provide R functionality to applications , 2003 .

[14]  S. Kami Makki,et al.  A comparison of data serialization formats for optimal efficiency on a mobile platform , 2012, ICUIMC.

[15]  Chandra Krintz,et al.  Cross-language, type-safe, and transparent object sharing for co-located managed runtimes , 2010, OOPSLA.

[16]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[17]  A. Bowman,et al.  A look at some data on the old faithful geyser , 1990 .

[18]  Alan Edelman,et al.  Julia: A Fast Dynamic Language for Technical Computing , 2012, ArXiv.

[19]  Xiao-Li Meng,et al.  The potential and perils of preprocessing: Building new foundations , 2013, 1309.6790.

[20]  Jeffrey Dean,et al.  Designs, Lessons and Advice from Building Large Distributed Systems , 2009 .

[21]  Alexander W. B Locker The potential and perils of preprocessing: Building new foundations , 2013 .