Substra: a framework for privacy-preserving, traceable and collaborative Machine Learning

Machine learning is promising, but it often needs to process vast amounts of sensitive data which raises concerns about privacy. In this white-paper, we introduce Substra, a distributed framework for privacy-preserving, traceable and collaborative Machine Learning. Substra gathers data providers and algorithm designers into a network of nodes that can train models on demand but under advanced permission regimes. To guarantee data privacy, Substra implements distributed learning: the data never leave their nodes; only algorithms, predictive models and non-sensitive metadata are exchanged on the network. The computations are orchestrated by a Distributed Ledger Technology which guarantees traceability and authenticity of information without needing to trust a third party. Although originally developed for Healthcare applications, Substra is not data, algorithm or programming language specific. It supports many types of computation plans including parallel computation plan commonly used in Federated Learning. With appropriate guidelines, it can be deployed for numerous Machine Learning use-cases with data or algorithm providers where trust is limited.

[1]  Satoshi Nakamoto Bitcoin : A Peer-to-Peer Electronic Cash System , 2009 .

[2]  Somesh Jha,et al.  Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures , 2015, CCS.

[3]  Marina Blanton,et al.  Secure Multiparty Computation , 2011, Encyclopedia of Cryptography and Security.

[4]  Blaise Agüera y Arcas,et al.  Federated Learning of Deep Networks using Model Averaging , 2016, ArXiv.

[5]  Agustí Verde Parera,et al.  General data protection regulation , 2018 .

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Amit Sahai,et al.  Secure Multi-Party Computation , 2013 .

[8]  Eliezer Yudkowsky,et al.  The Ethics of Artificial Intelligence , 2014, Artificial Intelligence Safety and Security.

[9]  Michael P. Wellman,et al.  SoK: Security and Privacy in Machine Learning , 2018, 2018 IEEE European Symposium on Security and Privacy (EuroS&P).

[10]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[11]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[12]  Daniel Davis Wood,et al.  ETHEREUM: A SECURE DECENTRALISED GENERALISED TRANSACTION LEDGER , 2014 .

[13]  Nils Lid Hjort,et al.  Model Selection and Model Averaging , 2001 .

[14]  Fan Zhang,et al.  Stealing Machine Learning Models via Prediction APIs , 2016, USENIX Security Symposium.

[15]  Bonnie Berger,et al.  Realizing private and practical pharmacological collaboration , 2018, Science.

[16]  Dawn Song,et al.  Keystone: An Open Framework for Architecting TEEs , 2019 .

[17]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[18]  Michael P. Wellman,et al.  Towards the Science of Security and Privacy in Machine Learning , 2016, ArXiv.

[19]  Hubert Eichner,et al.  Towards Federated Learning at Scale: System Design , 2019, SysML.

[20]  Abdelmadjid Bouabdallah,et al.  Trusted Execution Environment: What It is, and What It is Not , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.

[21]  Somesh Jha,et al.  Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting , 2017, 2018 IEEE 31st Computer Security Foundations Symposium (CSF).

[22]  Marko Vukolic,et al.  Hyperledger fabric: a distributed operating system for permissioned blockchains , 2018, EuroSys.

[23]  Sarvar Patel,et al.  Practical Secure Aggregation for Privacy-Preserving Machine Learning , 2017, IACR Cryptol. ePrint Arch..