Practical Federated Gradient Boosting Decision Trees

Gradient Boosting Decision Trees (GBDTs) have become very successful in recent years, with many awards in machine learning and data mining competitions. There have been several recent studies on how to train GBDTs in the federated learning setting. In this paper, we focus on horizontal federated learning, where data samples with the same features are distributed among multiple parties. However, existing studies are not efficient or effective enough for practical use. They suffer either from the inefficiency due to the usage of costly data transformations such as secure sharing and homomorphic encryption, or from the low model accuracy due to differential privacy designs. In this paper, we study a practical federated environment with relaxed privacy constraints. In this environment, a dishonest party might obtain some information about the other parties' data, but it is still impossible for the dishonest party to derive the actual raw data of other parties. Specifically, each party boosts a number of trees by exploiting similarity information based on locality-sensitive hashing. We prove that our framework is secure without exposing the original record to other parties, while the computation overhead in the training process is kept low. Our experimental studies show that, compared with normal training with the local data of each party, our approach can significantly improve the predictive accuracy, and achieve comparable accuracy to the original GBDT with the data from all parties.

[1]  Gebräuchliche Fertigarzneimittel,et al.  V , 1893, Therapielexikon Neurologie.

[2]  Andrew Chi-Chih Yao,et al.  Protocols for secure computations , 1982, FOCS 1982.

[3]  V. Zolotarev One-dimensional stable distributions , 1986 .

[4]  O. A. Ladyzhenskai︠a︡,et al.  Linear and Quasi-linear Equations of Parabolic Type , 1995 .

[5]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[6]  Thomas de Quincey [C] , 2000, The Works of Thomas De Quincey, Vol. 1: Writings, 1799–1820.

[7]  장윤희,et al.  Y. , 2003, Industrial and Labor Relations Terms.

[8]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[9]  Yunghsiang Sam Han,et al.  Privacy-Preserving Multivariate Statistical Analysis: Linear Regression and Classification , 2004, SDM.

[10]  Matthew Richardson,et al.  Predicting clicks: estimating the click-through rate for new ads , 2007, WWW '07.

[11]  Xin Yuan,et al.  Bandwidth optimal all-reduce algorithms for clusters of workstations , 2009, J. Parallel Distributed Comput..

[12]  Soo-Min Kim,et al.  Improving web page classification by label-propagation over click graphs , 2009, CIKM.

[13]  Christopher J. C. Burges,et al.  From RankNet to LambdaRank to LambdaMART: An Overview , 2010 .

[14]  Yiwei Thomas Hou,et al.  Privacy-preserving multi-keyword fuzzy search over encrypted data in the cloud , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[15]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[16]  Ehsan Hesamifard,et al.  Preserving Multi-party Machine Learning with Homomorphic Encryption , 2016 .

[17]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[18]  Ahmad-Reza Sadeghi,et al.  CryptoML: Secure outsourcing of big data machine learning applications , 2016, 2016 IEEE International Symposium on Hardware Oriented Security and Trust (HOST).

[19]  Blaise Agüera y Arcas,et al.  Federated Learning of Deep Networks using Model Averaging , 2016, ArXiv.

[20]  Ameet Talwalkar,et al.  Federated Multi-Task Learning , 2017, NIPS.

[21]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[22]  Xuyun Zhang,et al.  A Distributed Locality-Sensitive Hashing-Based Approach for Cloud Service Recommendation From Multi-Source Data , 2017, IEEE Journal on Selected Areas in Communications.

[23]  Yanjiao Chen,et al.  InPrivate Digging: Enabling Tree-based Distributed Data Mining with Differential Privacy , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.

[24]  Yang Liu,et al.  Secure Federated Transfer Learning , 2018, ArXiv.

[25]  С.О. Грищенко 2014 , 2019, The Winning Cars of the Indianapolis 500.

[26]  Mehryar Mohri,et al.  Agnostic Federated Learning , 2019, ICML.

[27]  Yasaman Khazaeni,et al.  Bayesian Nonparametric Federated Learning of Neural Networks , 2019, ICML.

[28]  Yang Liu,et al.  Boosting Privately: Privacy-Preserving Federated Extreme Boosting for Mobile Crowdsensing , 2019, ArXiv.

[29]  Tianjian Chen,et al.  Federated Machine Learning: Concept and Applications , 2019 .

[30]  Bingsheng He,et al.  Privacy-Preserving Gradient Boosting Decision Trees , 2019, AAAI.

[31]  Tianjian Chen,et al.  A Secure Federated Transfer Learning Framework , 2020, IEEE Intelligent Systems.

[32]  Bingsheng He,et al.  ThunderGBM: Fast GBDTs and Random Forests on GPUs , 2020, J. Mach. Learn. Res..

[33]  P. Alam,et al.  H , 1887, High Explosives, Propellants, Pyrotechnics.

[34]  P. Alam ‘A’ , 2021, Composites Engineering: An A–Z Guide.

[35]  Bingsheng He,et al.  A Survey on Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection , 2019, IEEE Transactions on Knowledge and Data Engineering.

[36]  Qiang Yang,et al.  SecureBoost: A Lossless Federated Learning Framework , 2019, IEEE Intelligent Systems.