Practical Lossless Federated Singular Vector Decomposition over Billion-Scale Data

With the enactment of privacy-preserving regulations, e.g., GDPR, federated SVD is proposed to enable SVD-based applications over different data sources without revealing the original data. However, many SVD-based applications cannot be well supported by existing federated SVD solutions. The crux is that these solutions, adopting either differential privacy (DP) or homomorphic encryption (HE), suffer from accuracy loss caused by unremovable noise or degraded efficiency due to inflated data. In this paper, we propose FedSVD, a practical lossless federated SVD method over billion-scale data, which can simultaneously achieve lossless accuracy and high efficiency. At the heart of FedSVD is a lossless matrix masking scheme delicately designed for SVD: 1) While adopting the masks to protect private data, FedSVD completely removes them from the final results of SVD to achieve lossless accuracy; and 2) As the masks do not inflate the data, FedSVD avoids extra computation and communication overhead during the factorization to maintain high efficiency. Experiments with real-world datasets show that FedSVD is over 10000x faster than the HE-based method and has 10 orders of magnitude smaller error than the DP-based solution (ε=0.1, δ=0.1) on SVD tasks. We further build and evaluate FedSVD over three real-world applications: principal components analysis (PCA), linear regression (LR), and latent semantic analysis (LSA), to show its superior performance in practice. On federated LR tasks, compared with two state-of-the-art solutions: FATE [17] and SecureML [19], FedSVD-LR is 100x faster than SecureML and 10x faster than FATE.

[1]  Han Tian,et al.  Sphinx: Enabling Privacy-Preserving Online Learning over the Cloud , 2022, 2022 IEEE Symposium on Security and Privacy (SP).

[2]  Yuanman Li,et al.  Secure and Verifiable Outsourcing of Large-Scale Nonnegative Matrix Factorization (NMF) , 2021, IEEE Transactions on Services Computing.

[3]  Xinchen Wan,et al.  TACC: A Full-stack Cloud Computing Infrastructure for Machine Learning Tasks , 2021, ArXiv.

[4]  Bowen Liu,et al.  Privacy-Preserving Decentralised Singular Value Decomposition , 2019, IACR Cryptol. ePrint Arch..

[5]  Yang Qiang,et al.  Federated Recommendation Systems , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[6]  Jon Crowcroft,et al.  Federated Principal Component Analysis , 2019, NeurIPS.

[7]  Kai Chen,et al.  Secure Federated Matrix Factorization , 2019, IEEE Intelligent Systems.

[8]  Rui Li,et al.  Insecurity and Hardness of Nearest Neighbor Queries Over Encrypted Data , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[9]  Qiang Yang,et al.  Federated Machine Learning , 2019, ACM Trans. Intell. Syst. Technol..

[10]  Changqing Luo,et al.  SecFact: Secure Large-scale QR and LU Factorizations , 2017, IEEE Transactions on Big Data.

[11]  Sarvar Patel,et al.  Practical Secure Aggregation for Privacy-Preserving Machine Learning , 2017, IACR Cryptol. ePrint Arch..

[12]  Payman Mohassel,et al.  SecureML: A System for Scalable Privacy-Preserving Machine Learning , 2017, 2017 IEEE Symposium on Security and Privacy (SP).

[13]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[14]  Roksana Boreli,et al.  Applying Differential Privacy to Matrix Factorization , 2015, RecSys.

[15]  Nilanjan Chatterjee,et al.  Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies , 2013, Nature Genetics.

[16]  Philip S. Yu,et al.  Privacy-Preserving Singular Value Decomposition , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[17]  Kemal Polat,et al.  Medical diagnosis of atherosclerosis from Carotid Artery Doppler Signals using principal component analysis (PCA), k-NN based weighting pre-processing and Artificial Immune Recognition System (AIRS) , 2008, J. Biomed. Informatics.

[18]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[19]  A. Rukhin Matrix Variate Distributions , 1999, The Multivariate Normal Distribution.

[20]  G. Stewart,et al.  Reorthogonalization and stable algorithms for updating the Gram-Schmidt QR factorization , 1976 .

[21]  Qiang Yang,et al.  FATE: An Industrial Grade Platform for Collaborative Learning With Data Protection , 2021, J. Mach. Learn. Res..

[22]  Sheng Zhong,et al.  Secure and Efficient Outsourcing of PCA-Based Face Recognition , 2020, IEEE Transactions on Information Forensics and Security.

[23]  Aditya Bhaskara,et al.  On Distributed Averaging for Stochastic k-PCA , 2019, NeurIPS.

[24]  Paul Voigt,et al.  The EU General Data Protection Regulation (GDPR) , 2017 .

[25]  Parinya Sanguansat,et al.  Principal Component Analysis: Engineering Applications , 2014 .

[26]  Peter Wiemer-Hastings,et al.  Latent semantic analysis , 2004, Annu. Rev. Inf. Sci. Technol..

[27]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[28]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.