Secure training of decision trees with continuous attributes

Abstract We apply multiparty computation (MPC) techniques to show, given a database that is secret-shared among multiple mutually distrustful parties, how the parties may obliviously construct a decision tree based on the secret data. We consider data with continuous attributes (i.e., coming from a large domain), and develop a secure version of a learning algorithm similar to the C4.5 or CART algorithms. Previous MPC-based work only focused on decision tree learning with discrete attributes (De Hoogh et al. 2014). Our starting point is to apply an existing generic MPC protocol to a standard decision tree learning algorithm, which we then optimize in several ways. We exploit the fact that even if we allow the data to have continuous values, which a priori might require fixed or floating point representations, the output of the tree learning algorithm only depends on the relative ordering of the data. By obliviously sorting the data we reduce the number of comparisons needed per node to O(N log2 N) from the naive O(N2), where N is the number of training records in the dataset, thus making the algorithm feasible for larger datasets. This does however introduce a problem when duplicate values occur in the dataset, but we manage to overcome this problem with a relatively cheap subprotocol. We show a procedure to convert a sorting network into a permutation network of smaller complexity, resulting in a round complexity of O(log N) per layer in the tree. We implement our algorithm in the MP-SPDZ framework and benchmark our implementation for both passive and active three-party computation using arithmetic modulo 264. We apply our implementation to a large scale medical dataset of ≈ 290 000 rows using random forests, and thus demonstrate practical feasibility of using MPC for privacy-preserving machine learning based on decision trees for large datasets.

[1]  Yehuda Lindell,et al.  High-Throughput Semi-Honest Secure Three-Party Computation with an Honest Majority , 2016, IACR Cryptol. ePrint Arch..

[2]  Ping Deng,et al.  Secure Multi-party Protocols for Privacy Preserving Data Mining , 2008, WASA.

[3]  Chris Clifton,et al.  Privacy-Preserving Decision Trees over Vertically Partitioned Data , 2005, DBSec.

[4]  Dan Bogdanov,et al.  A Practical Analysis of Oblivious Sorting Algorithms for Secure Multi-party Computation , 2014, NordSec.

[5]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[6]  Tianqing Zhu,et al.  An Effective Deferentially Private Data Releasing Algorithm for Decision Tree , 2013, 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications.

[7]  Abraham Waksman,et al.  A Permutation Network , 1968, JACM.

[8]  Jan Willemson,et al.  Round-Efficient Oblivious Database Manipulation , 2011, ISC.

[9]  Philip S. Yu,et al.  Classification Spanning Private Databases , 2006, AAAI.

[10]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[11]  Assaf Schuster,et al.  Data mining with differential privacy , 2010, KDD.

[12]  Kai Han,et al.  Privacy Preserving ID3 Algorithm over Horizontally Partitioned Data , 2005, Sixth International Conference on Parallel and Distributed Computing Applications and Technologies (PDCAT'05).

[13]  Yehuda Lindell,et al.  Fast Large-Scale Honest-Majority MPC for Malicious Adversaries , 2018, Journal of Cryptology.

[14]  Ivan Damgård,et al.  Scalable and Unconditionally Secure Multiparty Computation , 2007, CRYPTO.

[15]  Marcel Keller,et al.  MP-SPDZ: A Versatile Framework for Multi-Party Computation , 2020, IACR Cryptol. ePrint Arch..

[16]  Erwan Scornet,et al.  Impact of subsampling and pruning on random forests , 2016, 1603.04261.

[17]  Kamal Jethwani,et al.  Predictive Modeling of 30-Day Emergency Hospital Transport of Patients Using a Personal Emergency Response System: Prognostic Retrospective Study , 2018, JMIR medical informatics.

[18]  Ivan Damgård,et al.  SPDℤ2k: Efficient MPC mod 2k for Dishonest Majority , 2018, IACR Cryptol. ePrint Arch..

[19]  Ali Miri,et al.  Privacy preserving ID3 using Gini Index over horizontally partitioned data , 2008, 2008 IEEE/ACS International Conference on Computer Systems and Applications.

[20]  Mark Simkin,et al.  Use your Brain! Arithmetic 3PC For Any Modulus with Active Security , 2019, IACR Cryptol. ePrint Arch..

[21]  Anderson C. A. Nascimento,et al.  Efficient and Private Scoring of Decision Trees, Support Vector Machines and Logistic Regression Models Based on Pre-Computation , 2019, IEEE Transactions on Dependable and Secure Computing.

[22]  Wenliang Du,et al.  Building decision tree classifier on private data , 2002 .

[23]  Ping Chen,et al.  Practical Secure Decision Tree Learning in a Teletreatment Application , 2014, Financial Cryptography.

[24]  Kilian Stoffel,et al.  Theoretical Comparison between the Gini Index and Information Gain Criteria , 2004, Annals of Mathematics and Artificial Intelligence.

[25]  Xu Zhang,et al.  Privacy-preserving decision tree for epistasis detection , 2019, Cybersecur..

[26]  Marcel Keller,et al.  New Primitives for Actively-Secure MPC over Rings with Applications to Private Machine Learning , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[27]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.