Enabling SQL-based Training Data Debugging for Federated Learning

How can we debug a logistical regression model in a federated learning setting when seeing the model behave unexpectedly (e.g., the model rejects all high-income customers’ loan applications)? The SQL-based training data debugging framework has proved effective to fix this kind of issue in a non-federated learning setting. Given an unexpected query result over model predictions, this framework automatically removes the label errors from training data such that the unexpected behavior disappears in the retrained model. In this paper, we enable this powerful framework for federated learning. The key challenge is how to develop a security protocol for federated debugging which is proved to be secure, efficient, and accurate. Achieving this goal requires us to investigate how to seamlessly integrate the techniques from multiple fields (Databases, Machine Learning, and Cybersecurity). We first propose FedRain, which extends Rain, the state-of-the-art SQL-based training data debugging framework, to our federated learning setting. We address several technical challenges to make FedRainwork and analyze its security guarantee and time complexity. The analysis results show that FedRain falls short in terms of both efficiency and security. To overcome these limitations, we redesign our security protocol and propose Frog, a novel SQL-based training data debugging framework tailored for federated learning. Our theoretical analysis shows that Frog is more secure, more accurate, and more efficient than FedRain. We conduct extensive experiments using several real-world datasets and a case study. The experimental results are consistent with our theoretical analysis and validate the effectiveness of Frog in practice.

[1]  Zhihua Tian,et al.  FederBoost: Private Federated Learning for GBDT , 2020, ArXiv.

[2]  Bingsheng He,et al.  Practical Federated Gradient Boosting Decision Trees , 2019, AAAI.

[3]  Tianjian Chen,et al.  A Secure Federated Transfer Learning Framework , 2020, IEEE Intelligent Systems.

[4]  Richard Nock,et al.  Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption , 2017, ArXiv.

[5]  Hao Chen,et al.  Fast Private Set Intersection from Homomorphic Encryption , 2017, CCS.

[6]  Dan Suciu,et al.  A formal approach to finding explanations for database queries , 2014, SIGMOD Conference.

[7]  Berk Ustun,et al.  Repairing without Retraining: Avoiding Disparate Impact with Counterfactual Distributions , 2019, ICML.

[8]  Pascal Paillier,et al.  Public-Key Cryptosystems Based on Composite Degree Residuosity Classes , 1999, EUROCRYPT.

[9]  Ali Dehghantanha,et al.  A survey on security and privacy of federated learning , 2021, Future Gener. Comput. Syst..

[10]  Neoklis Polyzotis,et al.  Data Management Challenges in Production Machine Learning , 2017, SIGMOD Conference.

[11]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[12]  Preston Bukaty The California Consumer Privacy Act (CCPA) , 2019 .

[13]  Toniann Pitassi,et al.  Learning Fair Representations , 2013, ICML.

[14]  Blake Lemoine,et al.  Mitigating Unwanted Biases with Adversarial Learning , 2018, AIES.

[15]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[16]  Christopher Ré,et al.  The HoloClean Framework Dataset to be cleaned Denial Constraints External Information t 1 t 4 t 2 t 3 Johnnyo ’ s , 2017 .

[17]  Ziawasch Abedjan,et al.  From Cleaning before ML to Cleaning for ML , 2021, IEEE Data Eng. Bull..

[18]  Han Yu,et al.  FOCUS: Dealing with Label Quality Disparity in Federated Learning , 2020, Federated Learning.

[19]  Qiang Yang,et al.  SecureBoost: A Lossless Federated Learning Framework , 2019, IEEE Intelligent Systems.

[20]  Jan Philipp Albrecht,et al.  How the GDPR Will Change the World , 2016 .

[21]  Samuel Madden,et al.  Scorpion: Explaining Away Outliers in Aggregate Queries , 2013, Proc. VLDB Endow..

[22]  Reynold Cheng,et al.  SCODED: Statistical Constraint Oriented Data Error Detection , 2020, SIGMOD Conference.

[23]  Shuyue Wei,et al.  Profit Allocation for Federated Learning , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[24]  Michael Stonebraker,et al.  Raha: A Configuration-Free Error Detection System , 2019, SIGMOD Conference.

[25]  Jian Li,et al.  Sensitivity analysis and explanations for robust query evaluation in probabilistic databases , 2011, SIGMOD '11.

[26]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[27]  Yi Zhou,et al.  Towards Federated Graph Learning for Collaborative Financial Crimes Detection , 2019, ArXiv.

[28]  Dan Suciu,et al.  Explaining Query Answers with Explanation-Ready Databases , 2015, Proc. VLDB Endow..

[29]  Kejiang Ye,et al.  FFD: A Federated Learning Based Method for Credit Card Fraud Detection , 2019, BigData.

[30]  Beng Chin Ooi,et al.  Privacy preserving vertical federated learning for tree-based models , 2020, Proc. VLDB Endow..

[31]  Muntasir Raihan Rahman,et al.  Privacy-Preserving Decentralized Aggregation for Federated Learning , 2020, IEEE INFOCOM 2021 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[32]  Shengwen Yang,et al.  Parallel Distributed Logistic Regression for Vertical Federated Learning without Third-Party Coordinator , 2019, ArXiv.

[33]  Yue Wang,et al.  Error Diagnosis and Data Profiling with Data X-Ray , 2015, Proc. VLDB Endow..

[34]  Răzvan Viorescu 2018 REFORM OF EU DATA PROTECTION RULES , 2017 .

[35]  Sherif Sakr,et al.  Automated Machine Learning: State-of-The-Art and Open Challenges , 2019, ArXiv.

[36]  Tianjian Chen,et al.  Federated Machine Learning: Concept and Applications , 2019 .

[37]  Wei Shi,et al.  Federated learning of predictive models from federated Electronic Health Records , 2018, Int. J. Medical Informatics.

[38]  Benny Pinkas,et al.  Scalable Private Set Intersection Based on OT Extension , 2018, IACR Cryptol. ePrint Arch..

[39]  F. Biessmann,et al.  Automated Data Validation in Machine Learning Systems , 2021, IEEE Data Eng. Bull..

[40]  Tommy Färnqvist Number Theory Meets Cache Locality – Efficient Implementation of a Small Prime FFT for the GNU Multiple Precision Arithmetic Library , 2005 .

[41]  Jeffrey F. Naughton,et al.  DIFF: a relational interface for large-scale data explanation , 2018, The VLDB Journal.

[42]  Nicholas J. Higham,et al.  Accuracy and stability of numerical algorithms, Second Edition , 2002 .

[43]  Yunghsiang Sam Han,et al.  Privacy-Preserving Multivariate Statistical Analysis: Linear Regression and Classification , 2004, SDM.

[44]  Nezihe Merve Gurel,et al.  A Data Quality-Driven View of MLOps , 2021, IEEE Data Eng. Bull..

[45]  Toon Calders,et al.  Data preprocessing techniques for classification without discrimination , 2011, Knowledge and Information Systems.

[46]  Sanjay Krishnan,et al.  ActiveClean: Interactive Data Cleaning For Statistical Modeling , 2016, Proc. VLDB Endow..

[47]  Neoklis Polyzotis,et al.  Data Validation for Machine Learning , 2019, SysML.

[48]  Peter Bailis,et al.  Model Assertions for Monitoring and Improving ML Models , 2020, MLSys.

[49]  Dan Suciu,et al.  Causality and Explanations in Databases , 2014, Proc. VLDB Endow..

[50]  Yang Liu,et al.  Federated Learning , 2019, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[51]  Daniel Deutch,et al.  Provenance for aggregate queries , 2011, PODS.

[52]  Panagiotis Oikonomou,et al.  An Ensemble Interpretable Machine Learning Scheme for Securing Data Quality at the Edge , 2020, CD-MAKE.

[53]  Kristina Lerman,et al.  A Survey on Bias and Fairness in Machine Learning , 2019, ACM Comput. Surv..

[54]  J. Shewchuk An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[55]  Praveen K. Paritosh,et al.  “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI , 2021, CHI.

[56]  Dan Suciu,et al.  Tiresias: the database oracle for how-to queries , 2012, SIGMOD Conference.

[57]  Theodoros Rekatsinas,et al.  HoloDetect: Few-Shot Learning for Error Detection , 2019, SIGMOD Conference.

[58]  Eugene Wu,et al.  Complaint-driven Training Data Debugging for Query 2.0 , 2020, SIGMOD Conference.

[59]  Sanjay Krishnan,et al.  BoostClean: Automated Error Detection and Repair for Machine Learning , 2017, ArXiv.