Yggdrasil Decision Forests: A Fast and Extensible Decision Forests Library

Yggdrasil Decision Forests is a library for the training, serving and interpretation of decision forest models, targeted both at research and production work, implemented in C++, and available in C++, command line interface, Python (under the name TensorFlow Decision Forests), JavaScript, Go, and Google Sheets (under the name Simple ML for Sheets). The library has been developed organically since 2018 following a set of four design principles applicable to machine learning libraries and frameworks: simplicity of use, safety of use, modularity and high-level abstraction, and integration with other machine learning libraries. In this paper, we describe those principles in detail and present how they have been used to guide the design of the library. We then showcase the use of our library on a set of classical machine learning problems. Finally, we report a benchmark comparing our library to related solutions.

[1]  Jan Pfeifer,et al.  Modeling Text with Decision Forests using Categorical-Set Splits , 2020, ArXiv.

[2]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[3]  Sebastian Bruch,et al.  Learning Representations for Axis-Aligned Decision Forests through Input Perturbation , 2020, ArXiv.

[4]  Jaime Fern'andez del R'io,et al.  Array programming with NumPy , 2020, Nature.

[5]  Sergei Popov,et al.  Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data , 2019, ICLR.

[6]  Xuanhui Wang,et al.  Combining Decision Trees and Neural Networks for Learning-to-Rank in Personal Search , 2019, KDD.

[7]  Tie-Yan Liu,et al.  DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks , 2019, KDD.

[8]  Olivier Teytaud,et al.  Exact Distributed Training: Random Forest with Billions of Examples , 2018, ArXiv.

[9]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[10]  Thomas Colthurst,et al.  TF Boosted Trees: A Scalable TensorFlow Based Framework for Gradient Boosting , 2017, ECML/PKDD.

[11]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[12]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[13]  Andreas Ziegler,et al.  ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R , 2015, 1508.04409.

[14]  Raffaele Perego,et al.  QuickScorer: A Fast Algorithm to Rank Documents with Additive Ensembles of Regression Trees , 2015, SIGIR.

[15]  Carey E. Priebe,et al.  Sparse Projection Oblique Randomer Forests , 2015, J. Mach. Learn. Res..

[16]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[17]  Gilles Louppe,et al.  Independent consultant , 2013 .

[18]  L. Breiman Random Forests , 2001, Encyclopedia of Machine Learning and Data Mining.

[19]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[20]  William G. Griswold,et al.  The structure and value of modularity in software design , 2001, ESEC/FSE-9.

[21]  Shinichi Morishita,et al.  On Classification and Regression , 1998, Discovery Science.

[22]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[23]  Walter D. Fisher On Grouping for Maximum Homogeneity , 1958 .

[24]  Pan Li Combining Decision Trees and Neural Networks for Learning-to-Rank in Personal Search , 2019 .

[25]  Jimmy J. Lin,et al.  Runtime Optimizations for Tree-Based Machine Learning Models , 2014, IEEE Transactions on Knowledge and Data Engineering.

[26]  Robert E. Schapire,et al.  Explaining AdaBoost , 2013, Empirical Inference.

[27]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[28]  Haijia Shi Best-first Decision Tree Learning , 2007 .

[29]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[30]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .