Towards Federated Foundation Models: Scalable Dataset Pipelines for Group-Structured Learning

We introduce a library, Dataset Grouper, to create large-scale group-structured (e.g., federated) datasets, enabling federated learning simulation at the scale of foundation models. This library allows the creation of group-structured versions of existing datasets based on user-specified partitions, and directly leads to a variety of useful heterogeneous datasets that can be plugged into existing software frameworks. Dataset Grouper offers three key advantages. First, it scales to settings where even a single group's dataset is too large to fit in memory. Second, it provides flexibility, both in choosing the base (non-partitioned) dataset and in defining partitions. Finally, it is framework-agnostic. We empirically demonstrate that Dataset Grouper allows for large-scale federated language modeling simulations on datasets that are orders of magnitude larger than in previous work. Our experimental results show that algorithms like FedAvg operate more as meta-learning methods than as empirical risk minimization methods at this scale, suggesting their utility in downstream personalization and task-specific adaptation.

[1]  Ruiyi Zhang,et al.  Towards Building The Federatedgpt: Federated Instruction Tuning , 2023, ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Chenyou Fan,et al.  Federated Prompting and Chain-of-Thought Reasoning for Improving LLMs Answering , 2023, ArXiv.

[3]  H. B. McMahan,et al.  How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy , 2023, ArXiv.

[4]  Sai Praneeth Karimireddy,et al.  FLamby: Datasets and Benchmarks for Cross-Silo Federated Learning in Realistic Healthcare Settings , 2022, NeurIPS.

[5]  Tao Guo,et al.  PromptFL: Let Federated Participants Cooperatively Learn Prompts Instead of Models – Federated Learning in Age of Foundation Model , 2022, IEEE Transactions on Mobile Computing.

[6]  Kunal Talwar,et al.  FLAIR: Federated Learning Annotated Image Repository , 2022, NeurIPS.

[7]  Zachary B. Charles,et al.  Motley: Benchmarking Heterogeneity and Personalization in Federated Learning , 2022, ArXiv.

[8]  Yaliang Li,et al.  pFL-Bench: A Comprehensive Benchmark for Personalized Federated Learning , 2022, NeurIPS.

[9]  S. Shakkottai,et al.  FedAvg with Fine Tuning: Local Updates Lead to Representation Learning , 2022, NeurIPS.

[10]  Michael G. Rabbat,et al.  Federated Learning with Partial Model Personalization , 2022, ICML.

[11]  A. Suresh,et al.  Scaling Language Model Size in Cross-Device Federated Learning , 2022, FL4NLP.

[12]  H. B. McMahan,et al.  Improved Differential Privacy for SGD via Optimal Private Linear Operators on Adaptive Streams , 2022, NeurIPS.

[13]  Florian Tramèr,et al.  What Does it Mean for a Language Model to Preserve Privacy? , 2022, FAccT.

[14]  Zhiwei Steven Wu,et al.  Personalization Improves Privacy-Accuracy Tradeoffs in Federated Learning , 2022, ICML.

[15]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[16]  Z. Harchaoui,et al.  Federated learning with superquantile aggregation for heterogeneous data , 2021, Machine Learning.

[17]  Tatsunori B. Hashimoto,et al.  Extending the WILDS Benchmark for Unsupervised Adaptation , 2021, ICLR.

[18]  Emily Denton,et al.  Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research , 2021, NeurIPS Datasets and Benchmarks.

[19]  Zhilin Yang,et al.  P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks , 2021, ArXiv.

[20]  Zachary B. Charles,et al.  Iterated Vector Fields and Conservatism, with Applications to Federated Learning , 2021, ALT.

[21]  Alexander M. Rush,et al.  Datasets: A Community Library for Natural Language Processing , 2021, EMNLP.

[22]  Noah A. Smith,et al.  Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , 2021, ICLR.

[23]  Peter Kairouz,et al.  Back to the Drawing Board: A Critical Evaluation of Poisoning Attacks on Production Federated Learning , 2021, 2022 IEEE Symposium on Security and Privacy (SP).

[24]  Ananda Theertha Suresh,et al.  FedJAX: Federated learning simulation with JAX , 2021, ArXiv.

[25]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[26]  Virginia Smith,et al.  On Large-Cohort Training for Federated Learning , 2021, NeurIPS.

[27]  Sanjay Sri Vallabh Singapuram,et al.  FedScale: Benchmarking Model and System Performance of Federated Learning at Scale , 2021, ICML.

[28]  Sangseok Yun,et al.  Fast Federated Learning by Balancing Communication Trade-Offs , 2021, IEEE Transactions on Communications.

[29]  Jack Bandy,et al.  Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus , 2021, ArXiv.

[30]  Bill Yuchen Lin,et al.  FedNLP: Benchmarking Federated Learning Methods for Natural Language Processing Tasks , 2021, NAACL-HLT.

[31]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[32]  Jakub Konecný,et al.  Convergence and Accuracy Trade-Offs in Federated Learning and Meta-Learning , 2021, AISTATS.

[33]  Mehrdad Mahdavi,et al.  Distributionally Robust Federated Averaging , 2021, NeurIPS.

[34]  Rogier C. van Dalen,et al.  Federated Evaluation and Tuning for On-Device Personalization: System Design & Applications , 2021, ArXiv.

[35]  S. Shakkottai,et al.  Exploiting Shared Representations for Personalized Federated Learning , 2021, ICML.

[36]  P. Kairouz,et al.  The Distributed Discrete Gaussian Mechanism for Federated Learning with Secure Aggregation , 2021, ICML.

[37]  D. Murray,et al.  tf.data: A Machine Learning Data Processing Framework , 2021, Proc. VLDB Endow..

[38]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[39]  Colin Raffel,et al.  Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[40]  Pang Wei Koh,et al.  WILDS: A Benchmark of in-the-Wild Distribution Shifts , 2020, ICML.

[41]  Jianfeng Zhan,et al.  FLBench: A Benchmark Suite for Federated Learning , 2020, Communications in Computer and Information Science.

[42]  Shiau Hong Lim,et al.  Robustness and Personalization in Federated Learning: A Unified Approach via Regularization , 2020, 2022 IEEE International Conference on Edge Computing and Communications (EDGE).

[43]  M. Zaheer,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[44]  Daniel J. Beutel,et al.  Flower: A Friendly Federated Learning Research Framework , 2020, 2007.14390.

[45]  Ramesh Raskar,et al.  FedML: A Research Library and Benchmark for Federated Machine Learning , 2020, ArXiv.

[46]  Qinghua Liu,et al.  Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization , 2020, NeurIPS.

[47]  Jaime Fern'andez del R'io,et al.  Array programming with NumPy , 2020, Nature.

[48]  Sai Praneeth Karimireddy,et al.  Byzantine-Robust Learning on Heterogeneous Datasets via Bucketing , 2020, ICLR.

[49]  Ali Jadbabaie,et al.  Robust Federated Learning: The Case of Affine Distribution Shifts , 2020, NeurIPS.

[50]  Bingsheng He,et al.  The OARF Benchmark Suite: Characterization and Implications for Federated Learning Systems , 2020, ACM Trans. Intell. Syst. Technol..

[51]  Nguyen H. Tran,et al.  Personalized Federated Learning with Moreau Envelopes , 2020, NeurIPS.

[52]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[53]  Manzil Zaheer,et al.  Adaptive Federated Optimization , 2020, ICLR.

[54]  Yassine Laguel,et al.  Device Heterogeneity in Federated Learning: A Superquantile Approach , 2020, ArXiv.

[55]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[56]  Zaïd Harchaoui,et al.  Robust Aggregation for Federated Learning , 2019, IEEE Transactions on Signal Processing.

[57]  Richard Nock,et al.  Advances and Open Problems in Federated Learning , 2019, Found. Trends Mach. Learn..

[58]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[59]  H. Vincent Poor,et al.  Federated Learning With Differential Privacy: Algorithms and Performance Analysis , 2019, IEEE Transactions on Information Forensics and Security.

[60]  Farzin Haddadpour,et al.  Local SGD with Periodic Averaging: Tighter Analysis and Adaptive Synchronization , 2019, NeurIPS.

[61]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[62]  Sashank J. Reddi,et al.  SCAFFOLD: Stochastic Controlled Averaging for Federated Learning , 2019, ICML.

[63]  Jakub Konecný,et al.  Improving Federated Learning Personalization via Model Agnostic Meta Learning , 2019, ArXiv.

[64]  Tzu-Ming Harry Hsu,et al.  Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification , 2019, ArXiv.

[65]  Peter Richtárik,et al.  Tighter Theory for Local SGD on Identical and Heterogeneous Data , 2019, AISTATS.

[66]  Xiang Li,et al.  On the Convergence of FedAvg on Non-IID Data , 2019, ICLR.

[67]  Tian Li,et al.  Fair Resource Allocation in Federated Learning , 2019, ICLR.

[68]  Sebastian Caldas,et al.  LEAF: A Benchmark for Federated Settings , 2018, ArXiv.

[69]  Joshua Achiam,et al.  On First-Order Meta-Learning Algorithms , 2018, ArXiv.

[70]  H. Brendan McMahan,et al.  Learning Differentially Private Recurrent Language Models , 2017, ICLR.

[71]  Karen Kafadar,et al.  Letter-Value Plots: Boxplots for Large Data , 2017 .

[72]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[73]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[74]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[75]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[76]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[77]  S. Piantadosi Zipf’s word frequency law in natural language: A critical review and future directions , 2014, Psychonomic Bulletin & Review.

[78]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[79]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[80]  Dong-Jun Han,et al.  Sageflow: Robust Federated Learning against Both Stragglers and Adversaries , 2021, NeurIPS.

[81]  Prateek Jain,et al.  Differentially Private Model Personalization , 2021, NeurIPS.

[82]  Aryan Mokhtari,et al.  Personalized Federated Learning with Theoretical Guarantees: A Model-Agnostic Meta-Learning Approach , 2020, NeurIPS.

[83]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[84]  Aymeric Dieuleveut,et al.  Communication trade-offs for Local-SGD with large step size , 2019, NeurIPS.