Declarative Machine Learning - A Classification of Basic Properties and Types

Declarative machine learning (ML) aims at the high-level specification of ML tasks or algorithms, and automatic generation of optimized execution plans from these specifications. The fundamental goal is to simplify the usage and/or development of ML algorithms, which is especially important in the context of large-scale computations. However, ML systems at different abstraction levels have emerged over time and accordingly there has been a controversy about the meaning of this general definition of declarative ML. Specification alternatives range from ML algorithms expressed in domain-specific languages (DSLs) with optimization for performance, to ML task (learning problem) specifications with optimization for performance and accuracy. We argue that these different types of declarative ML complement each other as they address different users (data scientists and end users). This paper makes an attempt to create a taxonomy for declarative ML, including a definition of essential basic properties and types of declarative ML. Along the way, we provide insights into implications of these properties. We also use this taxonomy to classify existing systems. Finally, we draw conclusions on defining appropriate benchmarks and specification languages for declarative ML.

[1]  Sayan Mukherjee,et al.  Cumulon: Matrix-Based Data Analytics in the Cloud with Spot Instances , 2015, Proc. VLDB Endow..

[2]  Felix Naumann,et al.  The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[3]  Li Zhang,et al.  SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark , 2015, Conf. Computing Frontiers.

[4]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[5]  Kunle Olukotun,et al.  OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning , 2011, ICML.

[6]  Shirish Tatikonda,et al.  Scalable and Numerically Stable Descriptive Statistics in SystemML , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[7]  Tim Kraska,et al.  Automating model search for large scale machine learning , 2015, SoCC.

[8]  Jeffrey F. Naughton,et al.  Model Selection Management Systems: The Next Frontier of Advanced Analytics , 2016, SGMD.

[9]  Luis Leopoldo Perez,et al.  A comparison of platforms for implementing and running very large scale machine learning algorithms , 2014, SIGMOD Conference.

[10]  Michael Stonebraker,et al.  The Architecture of SciDB , 2011, SSDBM.

[11]  Joseph M. Hellerstein,et al.  MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[12]  Shivnath Babu,et al.  Processing Forecasting Queries , 2007, VLDB.

[13]  Shirish Tatikonda,et al.  SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs , 2014, IEEE Data Eng. Bull..

[14]  Tim Kraska,et al.  MLbase: A Distributed Machine-learning System , 2013, CIDR.

[15]  Peter J. Haas,et al.  Simulation of database-valued markov chains using SimSQL , 2013, SIGMOD '13.

[16]  Volker Markl,et al.  Implicit Parallelism through Deep Language Embedding , 2016, SGMD.

[17]  Wolfgang Lehner,et al.  F2DB: The Flash-Forward Database System , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[18]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[19]  Shirish Tatikonda,et al.  SystemML: Declarative machine learning on MapReduce , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[20]  Carsten Binnig,et al.  An Architecture for Compiling UDF-centric Workflows , 2015, Proc. VLDB Endow..

[21]  Christopher Ré,et al.  Towards a unified architecture for in-RDBMS analytics , 2012, SIGMOD Conference.

[22]  Neoklis Polyzotis,et al.  Declarative Systems for Large-Scale Machine Learning , 2012, IEEE Data Eng. Bull..

[23]  Christopher De Sa,et al.  Incremental Knowledge Base Construction Using DeepDive , 2015, The VLDB Journal.

[24]  Volker Markl,et al.  Spinning Fast Iterative Data Flows , 2012, Proc. VLDB Endow..

[25]  Bin Cui,et al.  Exploiting Matrix Dependency for Efficient Distributed Matrix Computation , 2015, SIGMOD Conference.

[26]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[27]  Alvin AuYoung,et al.  Presto: distributed machine learning and graph processing with sparse matrices , 2013, EuroSys '13.

[28]  Shivnath Babu,et al.  Cumulon: optimizing statistical data analysis in the cloud , 2013, SIGMOD '13.

[29]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[30]  Stanley B. Zdonik,et al.  A skip-list approach for efficiently processing forecasting queries , 2008, Proc. VLDB Endow..

[31]  Carlo Curino,et al.  Discussion of BigBench: A Proposed Industry Standard Performance Benchmark for Big Data , 2014, TPCTC.

[32]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[33]  Christopher De Sa,et al.  Incremental Knowledge Base Construction Using DeepDive , 2015, Proceedings of the VLDB Endowment International Conference on Very Large Data Bases.

[34]  Michael Stonebraker,et al.  GenBase: a complex analytics genomics benchmark , 2014, SIGMOD Conference.

[35]  Shirish Tatikonda,et al.  Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML , 2014, Proc. VLDB Endow..

[36]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[37]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[38]  Frederick Reiss,et al.  SparkBench - A Spark Performance Testing Suite , 2015, TPCTC.

[39]  Christopher Ré,et al.  Materialization optimizations for feature selection workloads , 2014, SIGMOD Conference.