On the Privacy of Federated Pipelines

Federated learning (FL) is becoming an increasingly popular machine learning paradigm in application scenarios where sensitive data available at various local sites cannot be shared due to privacy protection regulations. In FL, the sensitive data never leaves the local sites and only model parameters are shared with a global aggregator. Nonetheless, it has recently been shown that, under some circumstances, the private data can be reconstructed from the model parameters, which implies that data leakage can occur in FL. In this paper, we draw attention to another risk associated with FL: Even if federated algorithms are individually privacy-preserving, combining them into pipelines is not necessarily privacy-preserving. We provide a concrete example from genome-wide association studies, where the combination of federated principal component analysis and federated linear regression allows the aggregator to retrieve sensitive patient data by solving an instance of the multidimensional subset sum problem. This supports the increasing awareness in the field that, for FL to be truly privacy-preserving, measures have to be undertaken to protect against data leakage at the aggregator.

[1]  Fabian J Theis,et al.  Deep learning: new computational modelling techniques for genomics , 2019, Nature Reviews Genetics.

[2]  Reza Nasirigerdeh,et al.  sPLINK: A Federated, Privacy-Preserving Tool as a Robust Alternative to Meta-Analysis in Genome-Wide Association Studies , 2020, bioRxiv.

[3]  Jianping Fan,et al.  A covariance-free iterative algorithm for distributed principal component analysis on vertically partitioned data , 2012, Pattern Recognit..

[4]  John N. Tsitsiklis,et al.  Introduction to linear optimization , 1997, Athena scientific optimization and computation series.

[5]  Ioannis Z. Emiris,et al.  Approximating Multidimensional Subset Sum and the Minkowski Decomposition of Polygons ∗ , 2016 .

[6]  Amir Houmansadr,et al.  Comprehensive Privacy Analysis of Deep Learning: Passive and Active White-box Inference Attacks against Centralized and Federated Learning , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[7]  Xiaoqian Jiang,et al.  SAFETY: Secure gwAs in Federated Environment through a hYbrid Solution , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Yaniv Erlich,et al.  Routes for breaching and protecting genetic privacy , 2013, Nature Reviews Genetics.

[9]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[10]  Sayan Mukherjee,et al.  Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. , 2016, American journal of human genetics.

[11]  Richard Nock,et al.  Advances and Open Problems in Federated Learning , 2021, Found. Trends Mach. Learn..

[12]  Rui Zhang,et al.  A Hybrid Approach to Privacy-Preserving Federated Learning , 2019, AISec@CCS.

[13]  Jason H. Moore,et al.  Chapter 11: Genome-Wide Association Studies , 2012, PLoS Comput. Biol..

[14]  David J. Wu,et al.  Secure genome-wide association analysis using multiparty computation , 2018, Nature Biotechnology.

[15]  Y. Bossé,et al.  Benefits and limitations of genome-wide association studies , 2019, Nature Reviews Genetics.

[16]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[17]  Andreas Holzinger,et al.  Privacy-Preserving Artificial Intelligence Techniques in Biomedicine , 2020, Methods of information in medicine.

[18]  Christine Solnon,et al.  Experimental Evaluation of Subgraph Isomorphism Solvers , 2019, GbRPR.

[19]  P. Visscher,et al.  10 Years of GWAS Discovery: Biology, Function, and Translation. , 2017, American journal of human genetics.

[20]  Laura M. Heiser,et al.  How Machine Learning Will Transform Biomedicine , 2020, Cell.

[21]  E. Regan,et al.  Genetic Epidemiology of COPD (COPDGene) Study Design , 2011, COPD.

[22]  Kobbi Nissim,et al.  Towards formalizing the GDPR’s notion of singling out , 2019, Proceedings of the National Academy of Sciences.

[23]  Yang Liu,et al.  BatchCrypt: Efficient Homomorphic Encryption for Cross-Silo Federated Learning , 2020, USENIX ATC.