Scalable Machine Learning in the R Language Using a Summarization Matrix

Big data analytics generally rely on parallel processing in large computer clusters. However, this approach is not always the best. CPUs speed and RAM capacity keep growing, making small computers faster and more attractive to the analyst. Machine Learning (ML) models are generally computed on a data set, aggregating, transforming and filtering big data, which is orders of magnitude smaller than raw data. Users prefer “easy” high-level languages like R and Python, which accomplish complex analytic tasks with a few lines of code, but they present memory and speed limitations. Finally, data summarization has been a fundamental technique in data mining that has great promise with big data. With that motivation in mind, we adapt the \(\varGamma \) (Gamma) summarization matrix, previously used in parallel DBMSs, to work in the R language. \(\varGamma \) is significantly smaller than the data set, but captures fundamental statistical properties. \(\varGamma \) works well for a remarkably wide spectrum of ML models, including supervised and unsupervised models, assuming dimensions (variables) are either dependent or independent. An extensive experimental evaluation proves models on summarized data sets are accurate and their computation is significantly faster than R built-in functions. Moreover, experiments illustrate our R solution is faster and less resource hungry than competing parallel systems including a parallel DBMS and Spark.

[1]  Wellington Cabrera,et al.  The Gamma Matrix to Summarize Dense and Sparse Data Sets for Big Data Analytics , 2016, IEEE Transactions on Knowledge and Data Engineering.

[2]  Carlos Ordonez,et al.  Efficient disk-based K-means clustering for relational databases , 2004, IEEE Transactions on Knowledge and Data Engineering.

[3]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[4]  Divesh Srivastava,et al.  Integrating the R Language Runtime System with a Data Stream Warehouse , 2017, DEXA.

[5]  Wellington Cabrera,et al.  Big Data Analytics Integrating a Parallel Columnar DBMS and the R Language , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[6]  Geoffrey E. Hinton,et al.  SMEM Algorithm for Mixture Models , 1998, Neural Computation.

[7]  Carlos Ordonez,et al.  Big Data Analytics: Exploring Graphs with Optimized SQL Queries , 2018, DEXA Workshops.

[8]  Ramakrishna Varadarajan,et al.  The Vertica Analytic Database: C-Store 7 Years Later , 2012, Proc. VLDB Endow..

[9]  Jan Vitek,et al.  Evaluating the Design of the R Language - Objects and Functions for Data Analysis , 2012, ECOOP.

[10]  Adam Welc,et al.  Optimizing R language execution via aggressive speculation , 2016, DLS.

[11]  Carlos Ordonez,et al.  Bayesian Classifiers Programmed in SQL , 2010, IEEE Transactions on Knowledge and Data Engineering.

[12]  Ricardo Vilalta,et al.  A Decomposition of Classes via Clustering to Explain and Improve Naive Bayes , 2003, ECML.

[13]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[14]  Todd Mytkowicz,et al.  Parallelizing user-defined aggregations using symbolic execution , 2015, SOSP.