Vertica-ML: Distributed Machine Learning in Vertica Database

A growing number of companies rely on machine learning as a key element for gaining a competitive edge from their collected Big Data. An in-database machine learning system can provide many advantages in this scenario, e.g., eliminating the overhead of data transfer, avoiding the maintenance costs of a separate analytical system, and addressing data security and provenance concerns. In this paper, we present our distributed machine learning subsystem within the Vertica database. This subsystem, Vertica-ML, includes machine learning functionalities with SQL API which cover a complete data science workflow as well as model management. We treat machine learning models in Vertica as first-class database objects like tables and views; therefore, they enjoy a similar mechanism for archiving and managing. We explain the architecture of the subsystem, and present a set of experiments to evaluate the performance of the machine learning algorithms implemented on top of it.

[1]  Carlo Curino,et al.  Cloudy with high chance of DBMS: a 10-year prediction for Enterprise-Grade ML , 2020, CIDR.

[2]  Rui Liu,et al.  Building the Enterprise Fabric for Big Data with Vertica and Spark Integration , 2016, SIGMOD Conference.

[3]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[4]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[5]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[6]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[7]  Samuel Madden,et al.  MODELDB: Opportunities and Challenges in Managing Machine Learning Models , 2018, IEEE Data Eng. Bull..

[8]  Ramakrishna Varadarajan,et al.  The Vertica Analytic Database: C-Store 7 Years Later , 2012, Proc. VLDB Endow..

[9]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[10]  Tim Kraska,et al.  Neo: A Learned Query Optimizer , 2019, Proc. VLDB Endow..

[11]  Neoklis Polyzotis,et al.  Data Lifecycle Challenges in Production Machine Learning , 2018, SIGMOD Rec..

[12]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[13]  Chris Jermaine,et al.  Declarative Recursive Computation on an RDBMS, or, Why You Should Use a Database For Distributed Machine Learning , 2019, ArXiv.

[14]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[15]  Tim Kraska,et al.  Machine Learning and Databases: The Sound of Things to Come or a Cacophony of Hype? , 2015, SIGMOD Conference.

[16]  Sriram Subramanian,et al.  Model Governance: Reducing the Anarchy of Production ML , 2018, USENIX Annual Technical Conference.

[17]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.