论文信息 - Bridging two worlds with RICE

Bridging two worlds with RICE

The growing need to use large amounts of data as the basis for sophisticated business analysis conflicts with the current capabilities of statistical software systems as well as the functions provided by most modern databases. We developed two novel approaches towards a solution for this basic conflict, based on the widely-used statistical software package R and the SAP In-Memory Computing Engine (IMCE). We thereby propose an alternative data exchange mechanism with R. Instead of using standard SQL interfaces like JDBC or ODBC we introduced SQL-SHM, a shared memory-based data exchange to incorporate R’s vertical data structure. Furthermore, we extended this approach to R-Op introducing R scripts equivalent to native database operations like join or aggregation within the execution plans. With the calculation engine, IMCE provides a framework to model logical execution plans and thereby offers a convenient way to use the full functionality of R via SQL interface. Moreover, this enables us to run R scripts in parallel without the necessity of extending the R interpreter itself.

[1] William S. Cleveland,et al. Computing environment for the statistical analysis of large and complex data , 2010 .

[2] Peter J. Haas,et al. MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[3] Robert A. Muenchen,et al. R for SAS and SPSS Users , 2008 .

[4] Kurt Hornik,et al. kernlab - An S4 Package for Kernel Methods in R , 2004 .

[5] Igor Durdanovic,et al. Parallel Support Vector Machines: The Cascade SVM , 2004, NIPS.

[6] Alexander Zeier,et al. In-memory data management: an inflection point for enterprise applications , 2011 .

[7] Joseph M. Hellerstein,et al. MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[8] Kunle Olukotun,et al. Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[9] Michael Stonebraker,et al. Requirements for Science Data Bases and SciDB , 2009, CIDR.

[10] Weiping Zhang,et al. I/O-efficient statistical computing with RIOT , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[11] Simon Urbanek,et al. Rserve A fast way to provide R functionality to applications , 2003 .

[12] Wolfgang Lehner,et al. Robust Distributed Top-N Frequent Pattern Mining Using the SAP BW Accelerator , 2009, Proc. VLDB Endow..

[13] Hao Yu,et al. State of the Art in Parallel Computing with R , 2009 .

[14] Wolfgang Lehner,et al. Hybride Datenbankarchitekturen am Beispiel der neuen SAP In-Memory-Technologie , 2010, Datenbank-Spektrum.

[15] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16] Kurt Hornik,et al. Support Vector Machines in R , 2006 .

[17] Peter J. Haas,et al. Ricardo: integrating R and Hadoop , 2010, SIGMOD Conference.

[18] John M. Chambers,et al. Programming With Data , 1998 .