There is a need to integrate SQL processing with more advanced machine learning (ML) analytics to drive actionable insights from large volumes of data. As a first step towards this integration, we study how to efficiently connect big SQL systems (either MPP databases or new-generation SQL-on-Hadoop systems) with distributed big ML systems. We identify two important challenges to address in the integrated data analytics pipeline: data transformation, how to efficiently transform SQL data into a form suitable for ML, and data transfer, how to efficiently handover SQL data to ML systems. For the data transformation problem, we propose an In-SQL approach to incorporate common data transformations for ML inside SQL systems through extended user-defined functions (UDFs), by exploiting the massive parallelism of the big SQL systems. We propose and study a general method for transferring data between big SQL and big ML systems in a parallel streaming fashion. Furthermore, we explore caching intermediate or final results of data transformation to improve the performance. Our techniques are generic: they apply to any big SQL system that supports UDFs and any big ML system that uses Hadoop InputFormats to ingest input data.
[1]
Andrey Balmin,et al.
Jaql
,
2011,
Proc. VLDB Endow..
[2]
Scott Shenker,et al.
Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks
,
2014,
SoCC.
[3]
Drew Conway,et al.
Machine Learning for Hackers
,
2012
.
[4]
Kun Li,et al.
The MADlib Analytics Library or MAD Skills, the SQL
,
2012,
Proc. VLDB Endow..
[5]
Christopher Ré,et al.
Towards a unified architecture for in-RDBMS analytics
,
2012,
SIGMOD Conference.
[6]
Pete Wyckoff,et al.
Hive - A Warehousing Solution Over a Map-Reduce Framework
,
2009,
Proc. VLDB Endow..
[7]
Alon Y. Halevy,et al.
Answering queries using views: A survey
,
2001,
The VLDB Journal.
[8]
Shirish Tatikonda,et al.
SystemML: Declarative machine learning on MapReduce
,
2011,
2011 IEEE 27th International Conference on Data Engineering.
[9]
Jun Rao,et al.
Liquid: Unifying Nearline and Offline Big Data Integration
,
2015,
CIDR.
[10]
Zohra Bellahsene,et al.
A survey of view selection methods
,
2012,
SGMD.