A Comparison of Decision Forest Inference Platforms from A Database Perspective

Decision forest, including RandomForest, XGBoost, and LightGBM, is one of the most popular machine learning techniques used in many industrial scenarios, such as credit card fraud detection, ranking, and business intelligence. Because the inference process is usually performance-critical, a number of frameworks were developed and dedicated for decision forest inference, such as ONNX, TreeLite from Amazon, TensorFlow Decision Forest from Google, HummingBird from Microsoft, Nvidia FIL, and lleaves. However, these frameworks are all decoupled with data management frameworks. It is unclear whether in-database inference will improve the overall performance. In addition, these frameworks used different algorithms, optimization techniques, and parallelism models. It is unclear how these implementations will affect the overall performance and how to make design decisions for an in-database inference framework. In this work, we investigated the above questions by comprehensively comparing the end-to-end performance of the aforementioned inference frameworks and netsDB, an in-database inference framework we implemented. Through this study, we identified that netsDB is best suited for handling small-scale models on large-scale datasets and all-scale models on small-scale datasets, for which it achieved up to hundreds of times of speedup. In addition, the relation-centric representation we proposed significantly improved netsDB's performance in handling large-scale models, while the model reuse optimization we proposed further improved netsDB's performance in handling small-scale datasets.

[1]  Jan Pfeifer,et al.  Yggdrasil Decision Forests: A Fast and Extensible Decision Forests Library , 2022, KDD.

[2]  Uday Bondhugula,et al.  Treebeard: An Optimizing Compiler for Decision Tree Based ML Inference , 2022, 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Lei Yu,et al.  Serving Deep Learning Models with Deduplication from Relational Databases , 2022, Proc. VLDB Endow..

[4]  Hang Liu,et al.  Tahoe: tree structure-aware high performance inference engine for decision tree ensemble on GPU , 2021, EuroSys.

[5]  Udayan Khurana,et al.  Automated Data Science for Relational Data , 2021, 2021 IEEE 37th International Conference on Data Engineering (ICDE).

[6]  Carlo Curino,et al.  A Tensor Compiler for Unified Machine Learning Prediction Serving , 2020, OSDI.

[7]  C. Jermaine,et al.  Tensor Relational Algebra for Machine Learning System Design , 2020, ArXiv.

[8]  C. Jermaine,et al.  Architecture of a distributed storage that combines file system, memory and computation in a single layer , 2020, The VLDB Journal.

[9]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[10]  Carlo Curino,et al.  Extending Relational Query Processing with ML Inference , 2019, CIDR.

[11]  Sercan Ö. Arik,et al.  TabNet: Attentive Interpretable Tabular Learning , 2019, AAAI.

[12]  Chris Jermaine,et al.  Declarative Recursive Computation on an RDBMS , 2019, Proc. VLDB Endow..

[13]  Anna Veronika Dorogush,et al.  Why every GBDT speed benchmark is wrong , 2018, ArXiv.

[14]  Chris Jermaine,et al.  Pangea: Monolithic Distributed Storage for Data Analytics , 2018, Proc. VLDB Endow..

[15]  Bin Gao,et al.  RapidScorer: Fast Tree Ensemble Evaluation by Maximizing Compactness in Data Level Parallelization , 2018, KDD.

[16]  Randal C. Burns,et al.  Forest Packing: Fast, Parallel Decision Forests , 2018, SDM.

[17]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[18]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[19]  Kwanghyun Park,et al.  Froid: Optimization of Imperative Programs in a Relational Database , 2017, Proc. VLDB Endow..

[20]  Chris Jermaine,et al.  PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development , 2017, SIGMOD Conference.

[21]  Salvatore Orlando,et al.  QuickScorer: Efficient Traversal of Large Ensembles of Decision Trees , 2017, ECML/PKDD.

[22]  Weiwei Deng,et al.  Model Ensemble for Click Prediction in Bing Search Ads , 2017, WWW.

[23]  Yi Zhang,et al.  Deep Embedding Forest: Forest-based Serving with Deep Embedding Features , 2017, KDD.

[24]  Xin Wang,et al.  Clipper: A Low-Latency Online Prediction Serving System , 2016, NSDI.

[25]  Hongbo Deng,et al.  Ranking Relevance in Yahoo Search , 2016, KDD.

[26]  Changsheng Li,et al.  Characterizing Driving Styles with Deep Learning , 2016, ArXiv.

[27]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[28]  Raffaele Perego,et al.  QuickScorer: A Fast Algorithm to Rank Documents with Additive Ensembles of Regression Trees , 2015, SIGIR.

[29]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[30]  M. de Rijke,et al.  Online Exploration for Detecting Shifts in Fresh Intent , 2014, CIKM.

[31]  Joaquin Quiñonero Candela,et al.  Practical Lessons from Predicting Clicks on Ads at Facebook , 2014, ADKDD'14.

[32]  Chen Wang,et al.  MRTuner: A Toolkit to Enable Holistic Optimization for MapReduce Jobs , 2014, Proc. VLDB Endow..

[33]  Yu Cheng,et al.  GLADE: big data analytics made easy , 2012, SIGMOD Conference.

[34]  Pavel Serdyukov,et al.  Recency ranking by diversification of result set , 2011, CIKM '11.

[35]  Tong Zhang,et al.  Learning Nonlinear Functions Using Regularized Greedy Forest , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[37]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[38]  Yi Chang,et al.  Yahoo! Learning to Rank Challenge Overview , 2010, Yahoo! Learning to Rank Challenge.

[39]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[40]  Toby Sharp,et al.  Implementing Decision Trees and Forests on a GPU , 2008, ECCV.

[41]  Samuel Madden,et al.  MauveDB: supporting model-based user views in database systems , 2006, SIGMOD Conference.

[42]  Carlo Zaniolo,et al.  ATLAS: A Small but Complete SQL Extension for Data Mining and Data Streams , 2003, VLDB.

[43]  L. Breiman Random Forests , 2001, Encyclopedia of Machine Learning and Data Mining.

[44]  Eugene Wu,et al.  ConnectorX: Accelerating Data Loading From Databases to Dataframes , 2022, Proc. VLDB Endow..

[45]  Arun Iyengar,et al.  Lachesis: Automated Partitioning for UDF-Centric Analytics , 2021, Proc. VLDB Endow..

[46]  Yi Tay,et al.  Are Neural Rankers still Outperformed by Gradient Boosted Decision Trees? , 2021, ICLR.

[47]  Ming Zhao,et al.  WATSON: A Workflow-based Data Storage Optimizer for Analytics , 2020 .

[48]  Chris Jermaine,et al.  Declarative Recursive Computation on an RDBMS, or, Why You Should Use a Database For Distributed Machine Learning , 2019, ArXiv.

[49]  Hyunsu Cho,et al.  Treelite: Toolbox for Decision Tree Deployment , 2018 .

[50]  Gustavo Alonso,et al.  Scalable inference of decision tree ensembles: Flexible design for CPU-FPGA platforms , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[51]  Jimmy J. Lin,et al.  Runtime Optimizations for Tree-Based Machine Learning Models , 2014, IEEE Transactions on Knowledge and Data Engineering.