Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset

One concern with the rise of large language models lies with their potential for significant harm, particularly from pretraining on biased, obscene, copyrighted, and private information. Emerging ethical approaches have attempted to filter pretraining material, but such approaches have been ad hoc and failed to take context into account. We offer an approach to filtering grounded in law, which has directly addressed the tradeoffs in filtering material. First, we gather and make available the Pile of Law, a 256GB (and growing) dataset of open-source English-language legal and administrative data, covering court opinions, contracts, administrative rules, and legislative records. Pretraining on the Pile of Law may help with legal tasks that have the promise to improve access to justice. Second, we distill the legal norms that governments have developed to constrain the inclusion of toxic or private content into actionable lessons for researchers and discuss how our dataset reflects these norms. Third, we show how the Pile of Law offers researchers the opportunity to learn such filtering rules directly from the data, providing an exciting new research direction in model-based processing.

[1]  Board , 2023, Médecine des Maladies Métaboliques.

[2]  Xi Victoria Lin,et al.  Lifting the Curse of Multilinguality by Pre-training Modular Transformers , 2022, NAACL.

[3]  Dragomir R. Radev,et al.  Data Governance in the Age of Large-Scale Data-Driven Language Technology , 2022, FAccT.

[4]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[5]  Dipankar Ray,et al.  ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection , 2022, ACL.

[6]  Jooyoung Lee,et al.  Do Language Models Plagiarize? , 2022, WWW.

[7]  Florian Tramèr,et al.  Quantifying Memorization Across Neural Language Models , 2022, ICLR.

[8]  Colin Raffel,et al.  Deduplicating Training Data Mitigates Privacy Risks in Language Models , 2022, ICML.

[9]  Florian Tramèr,et al.  What Does it Mean for a Language Model to Preserve Privacy? , 2022, FAccT.

[10]  Noah A. Smith,et al.  Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection , 2022, EMNLP.

[11]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[12]  Arkaitz Zubiaga,et al.  Abusive language detection in youtube comments leveraging replies as conversational context , 2021, PeerJ Comput. Sci..

[13]  Behnam Neyshabur,et al.  Exploring the Limits of Large Scale Pre-training , 2021, ICLR.

[14]  D. Katz,et al.  LexGLUE: A Benchmark Dataset for Legal Language Understanding in English , 2021, ACL.

[15]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[16]  Allison Hegel,et al.  The Law of Large Documents: Understanding the Structure of Legal Contracts Using Visual Cues , 2021, ArXiv.

[17]  Matthias Grabmair,et al.  Context-aware legal citation recommendation using deep learning , 2021, ICAIL.

[18]  Sherman S. M. Chow,et al.  Differential Privacy for Text Analytics via Natural Text Sanitization , 2021, FINDINGS.

[19]  Paolo Torroni,et al.  Detecting and explaining unfairness in consumer contracts through memory networks , 2021, Artificial Intelligence and Law.

[20]  Praveen K. Paritosh,et al.  “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI , 2021, CHI.

[21]  Daniel E. Ho,et al.  Executive Control of Agency Adjudication: Capacity, Selection and Precedential Rulemaking , 2021 .

[22]  Jesse Dodge,et al.  Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , 2021, EMNLP.

[23]  Daniel E. Ho,et al.  When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings , 2021, ICAIL.

[24]  Dan Hendrycks,et al.  CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review , 2021, NeurIPS Datasets and Benchmarks.

[25]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[26]  Arkaitz Zubiaga,et al.  Towards generalisable hate speech detection: a review on obstacles and solutions , 2021, PeerJ Comput. Sci..

[27]  Satyapriya Krishna,et al.  ADePT: Auto-encoder based Differentially Private Text Transformation , 2021, EACL.

[28]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[29]  Tom B. Brown,et al.  Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[30]  Abhinav Aggarwal,et al.  Research Challenges in Designing Differentially Private Text Generation Mechanisms , 2020, FLAIRS.

[31]  Evangelos Kanoulas,et al.  A Benchmark for Lease Contract Review , 2020, ArXiv.

[32]  B. Botz,et al.  Information leakage , 2020, Radiopaedia.org.

[33]  Łukasz Borchmann,et al.  Contract Discovery: Dataset and a Few-shot Semantic Retrieval Challenge with Competitive Baselines , 2020, FINDINGS.

[34]  Lingjuan Lyu,et al.  Differentially Private Representation for NLP: Formal Guarantee and An Empirical Study on Privacy and Fairness , 2020, FINDINGS.

[35]  Margaret E. Roberts,et al.  Mass Digitization of Chinese Court Decisions , 2020, Journal of Law and Courts.

[36]  Tom B. Brown,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[37]  A. Butte,et al.  Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes , 2020, npj Digital Medicine.

[38]  Congzheng Song,et al.  Information Leakage in Embedding Models , 2020, CCS.

[39]  Dana Burchardt Backlash against the Court of Justice of the EU? The Recent Jurisprudence of the German Constitutional Court on EU Fundamental Rights as a Standard of Review , 2020, German Law Journal.

[40]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[41]  Peter F. Edemekong,et al.  Health Insurance Portability and Accountability Act , 2020, Definitions.

[42]  Joelle Pineau,et al.  Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning , 2020, ArXiv.

[43]  Mark A. Lemley,et al.  Fair Learning , 2020, SSRN Electronic Journal.

[44]  Jeremy Blackburn,et al.  The Pushshift Reddit Dataset , 2020, ICWSM.

[45]  A. Sanchís,et al.  Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Vishrav Chaudhary,et al.  CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[47]  Emily Ahn,et al.  Finding Microaggressions in the Wild: A Case for Locating Elusive Phenomena in Social Media Posts , 2019, EMNLP.

[48]  Davis Liang,et al.  Masked Language Model Scoring , 2019, ACL.

[49]  Peter J. Liu,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[50]  Alexandre Lacoste,et al.  Quantifying the Carbon Emissions of Machine Learning , 2019, ArXiv.

[51]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[52]  A. Adams,et al.  AnonyMate: A Toolkit for Anonymizing Unstructured Chat Data , 2019 .

[53]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[54]  Ion Androutsopoulos,et al.  Large-Scale Multi-Label Text Classification on EU Legislation , 2019, ACL.

[55]  Vitaly Shmatikov,et al.  Differential Privacy Has Disparate Impact on Model Accuracy , 2019, NeurIPS.

[56]  Steven Ruggles,et al.  Differential Privacy and Census Data: Implications for Social and Economic Research , 2019, AEA Papers and Proceedings.

[57]  Tatishe M. Nteta,et al.  Racial bias in legal language , 2019, Research & Politics.

[58]  Alexandra Chouldechova,et al.  Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting , 2019, FAT.

[59]  Lucy Vasserman,et al.  Measuring and Mitigating Unintended Bias in Text Classification , 2018, AIES.

[60]  H. Brendan McMahan,et al.  A General Approach to Adding Differential Privacy to Iterative Training Procedures , 2018, ArXiv.

[61]  Hubert Eichner,et al.  Federated Learning for Mobile Keyboard Prediction , 2018, ArXiv.

[62]  Inioluwa Deborah Raji,et al.  Model Cards for Model Reporting , 2018, FAT.

[63]  Ralf Krestel,et al.  Challenges for Toxic Comment Classification: An In-Depth Error Analysis , 2018, ALW.

[64]  Shashi Narayan,et al.  Privacy-preserving Neural Representations of Text , 2018, EMNLP.

[65]  Nancy Morawetz A Better Balance for Federal Rules Governing Public Access to Appeal Records in Immigration Cases , 2018 .

[66]  I. Glenn Cohen,et al.  HIPAA and Protecting Health Information in the 21st Century , 2018, JAMA.

[67]  Daniel Martin Katz,et al.  LexNLP: Natural Language Processing and Information Extraction For Legal and Regulatory Texts , 2018, Research Handbook on Big Data Law.

[68]  Daniel Jurafsky,et al.  Deconfounded Lexicon Induction for Interpretable Social Science , 2018, NAACL.

[69]  Vincent N. Schiraldi,et al.  Youth Justice in Europe: Experience of Germany, the Netherlands, and Croatia in Providing Developmentally Appropriate Responses to Emerging Adults in the Criminal Justice System , 2018 .

[70]  Timothy Baldwin,et al.  Towards Robust and Privacy-preserving Text Representations , 2018, ACL.

[71]  Paolo Torroni,et al.  CLAUDETTE: an automated detector of potentially unfair clauses in online terms of service , 2018, Artificial Intelligence and Law.

[72]  Úlfar Erlingsson,et al.  The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks , 2018, USENIX Security Symposium.

[73]  Mauro Conti,et al.  All You Need is "Love": Evading Hate Speech Detection , 2018, ArXiv.

[74]  R. Sundstrom On Post-racialism , 2017 .

[75]  P. Weiler Canadian Judicial Council , 2017 .

[76]  Slava J. Mikhaylov,et al.  Understanding state preferences with text as data: Introducing the UN General Debate corpus , 2017, ArXiv.

[77]  Joel R. Tetreault,et al.  Abusive Language Detection in Online User Content , 2016, WWW.

[78]  K. Hioki,et al.  Judging Implicit Bias: A National Empirical Study Of Judicial Stereotypes , 2016 .

[79]  Vitaly Shmatikov,et al.  Privacy-preserving deep learning , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[80]  Daniel P. Miranker,et al.  Constitute: The world's constitutions to read, search, and compare , 2014, J. Web Semant..

[81]  U. Congress Chapter 25 - National Historical Publications and Records Commission , 2014 .

[82]  Elaine Craig The Ethical Obligations of Defence Counsel in Sexual Assault Cases , 2013, Osgoode Hall Law Journal.

[83]  Arman Sarvarian Common Ethical Standards for Counsel before the European Court of Justice and European Court of Human Rights , 2012 .

[84]  David S. Law,et al.  The Declining Influence of the United States Constitution , 2012 .

[85]  Evan J. Mandery,et al.  The Expungement Myth , 2012 .

[86]  A. Roland The Supreme Court of Canada , 2007 .

[87]  Michael Roe,et al.  Scanning electronic documents for personally identifiable information , 2006, WPES '06.

[88]  Gerald L. Lohse,et al.  International Differences in Information Privacy Concerns: A Global Survey of Consumers , 2004, Inf. Soc..

[89]  E. Shalev Ancient Masks, American Fathers: Classical Pseudonyms during the American Revolution and Early Republic , 2003 .

[90]  M. Tushnet The Warren Court in Historical and Political Perspective , 1993 .

[91]  H. Hai United States Court of Appeals: For the Second Circuit , 1962, International Legal Materials.

[92]  I. Kivlichan,et al.  Capturing Covertly Toxic Speech via Crowdsourcing , 2021, HCINLP.

[93]  P. Quaresma,et al.  ECHR: Legal Corpus for Argument Mining , 2020, ARGMINING.

[94]  Janette C. Brown,et al.  Social Security Administration , 2020, Federal Regulatory Guide.

[95]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[96]  P. Prasad Implicit Racial Biases in Prosecutorial Summations: Proposing an Integrated Response , 2018 .

[97]  N. Pace,et al.  Provider Fraud in California Workers' Compensation: Selected Issues , 2017 .

[98]  Alan Garfield To Swear or Not to Swear: Using Foul Language During a Supreme Court Oral Argument , 2012 .

[99]  John H. Blume,et al.  Racial Epithets in the Criminal Process , 2011 .

[100]  L. Strahilevitz Pseudonymous Litigation , 2010 .

[101]  Michael Lissner,et al.  CourtListener.com: A platform for researching and staying abreast of the latest in the law , 2010 .

[102]  Johannes Fürnkranz,et al.  An Evaluation of Efficient Multilabel Classification Algorithms for Large-Scale Problems in the Legal Domain , 2007, LWA.

[103]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[104]  M. Berry The Pig Farmer's Daughter and Other Tales of American Justice: Episodes of Racism and Sexism in the Courts from 1865 to the Present , 1999 .

[105]  R. Clarke All you need is love. , 1996, Modern midwife.

[106]  Naomi R. Cahn,et al.  Looseness of Legal Language: The Reasonable Woman Standard in Theory and in Practice , 1992 .

[107]  D. Sperber,et al.  Irony and the Use-Mention Distinction , 1981 .

[108]  C. Sargent Report of the Director of the Arnold Arboretum, presented to the President and Fellows of Harvard University , 1874, Bulletin of the Bussey Institution.