Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
暂无分享,去创建一个
Christopher D. Manning | Daniel E. Ho | Dan Jurafsky | M. Krass | Peter Henderson | Neel Guha | Lucia Zheng
[1] Board , 2023, Médecine des Maladies Métaboliques.
[2] Xi Victoria Lin,et al. Lifting the Curse of Multilinguality by Pre-training Modular Transformers , 2022, NAACL.
[3] Dragomir R. Radev,et al. Data Governance in the Age of Large-Scale Data-Driven Language Technology , 2022, FAccT.
[4] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.
[5] Dipankar Ray,et al. ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection , 2022, ACL.
[6] Jooyoung Lee,et al. Do Language Models Plagiarize? , 2022, WWW.
[7] Florian Tramèr,et al. Quantifying Memorization Across Neural Language Models , 2022, ICLR.
[8] Colin Raffel,et al. Deduplicating Training Data Mitigates Privacy Risks in Language Models , 2022, ICML.
[9] Florian Tramèr,et al. What Does it Mean for a Language Model to Preserve Privacy? , 2022, FAccT.
[10] Noah A. Smith,et al. Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection , 2022, EMNLP.
[11] Po-Sen Huang,et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.
[12] Arkaitz Zubiaga,et al. Abusive language detection in youtube comments leveraging replies as conversational context , 2021, PeerJ Comput. Sci..
[13] Behnam Neyshabur,et al. Exploring the Limits of Large Scale Pre-training , 2021, ICLR.
[14] D. Katz,et al. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English , 2021, ACL.
[15] Michael S. Bernstein,et al. On the Opportunities and Risks of Foundation Models , 2021, ArXiv.
[16] Allison Hegel,et al. The Law of Large Documents: Understanding the Structure of Legal Contracts Using Visual Cues , 2021, ArXiv.
[17] Matthias Grabmair,et al. Context-aware legal citation recommendation using deep learning , 2021, ICAIL.
[18] Sherman S. M. Chow,et al. Differential Privacy for Text Analytics via Natural Text Sanitization , 2021, FINDINGS.
[19] Paolo Torroni,et al. Detecting and explaining unfairness in consumer contracts through memory networks , 2021, Artificial Intelligence and Law.
[20] Praveen K. Paritosh,et al. “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI , 2021, CHI.
[21] Daniel E. Ho,et al. Executive Control of Agency Adjudication: Capacity, Selection and Precedential Rulemaking , 2021 .
[22] Jesse Dodge,et al. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , 2021, EMNLP.
[23] Daniel E. Ho,et al. When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings , 2021, ICAIL.
[24] Dan Hendrycks,et al. CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review , 2021, NeurIPS Datasets and Benchmarks.
[25] Emily M. Bender,et al. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.
[26] Arkaitz Zubiaga,et al. Towards generalisable hate speech detection: a review on obstacles and solutions , 2021, PeerJ Comput. Sci..
[27] Satyapriya Krishna,et al. ADePT: Auto-encoder based Differentially Private Text Transformation , 2021, EACL.
[28] Charles Foster,et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.
[29] Tom B. Brown,et al. Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.
[30] Abhinav Aggarwal,et al. Research Challenges in Designing Differentially Private Text Generation Mechanisms , 2020, FLAIRS.
[31] Evangelos Kanoulas,et al. A Benchmark for Lease Contract Review , 2020, ArXiv.
[32] B. Botz,et al. Information leakage , 2020, Radiopaedia.org.
[33] Łukasz Borchmann,et al. Contract Discovery: Dataset and a Few-shot Semantic Retrieval Challenge with Competitive Baselines , 2020, FINDINGS.
[34] Lingjuan Lyu,et al. Differentially Private Representation for NLP: Formal Guarantee and An Empirical Study on Privacy and Fairness , 2020, FINDINGS.
[35] Margaret E. Roberts,et al. Mass Digitization of Chinese Court Decisions , 2020, Journal of Law and Courts.
[36] Tom B. Brown,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[37] A. Butte,et al. Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes , 2020, npj Digital Medicine.
[38] Congzheng Song,et al. Information Leakage in Embedding Models , 2020, CCS.
[39] Dana Burchardt. Backlash against the Court of Justice of the EU? The Recent Jurisprudence of the German Constitutional Court on EU Fundamental Rights as a Standard of Review , 2020, German Law Journal.
[40] 知秀 柴田. 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .
[41] Peter F. Edemekong,et al. Health Insurance Portability and Accountability Act , 2020, Definitions.
[42] Joelle Pineau,et al. Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning , 2020, ArXiv.
[43] Mark A. Lemley,et al. Fair Learning , 2020, SSRN Electronic Journal.
[44] Jeremy Blackburn,et al. The Pushshift Reddit Dataset , 2020, ICWSM.
[45] A. Sanchís,et al. Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[46] Vishrav Chaudhary,et al. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.
[47] Emily Ahn,et al. Finding Microaggressions in the Wild: A Case for Locating Elusive Phenomena in Social Media Posts , 2019, EMNLP.
[48] Davis Liang,et al. Masked Language Model Scoring , 2019, ACL.
[49] Peter J. Liu,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[50] Alexandre Lacoste,et al. Quantifying the Carbon Emissions of Machine Learning , 2019, ArXiv.
[51] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.
[52] A. Adams,et al. AnonyMate: A Toolkit for Anonymizing Unstructured Chat Data , 2019 .
[53] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[54] Ion Androutsopoulos,et al. Large-Scale Multi-Label Text Classification on EU Legislation , 2019, ACL.
[55] Vitaly Shmatikov,et al. Differential Privacy Has Disparate Impact on Model Accuracy , 2019, NeurIPS.
[56] Steven Ruggles,et al. Differential Privacy and Census Data: Implications for Social and Economic Research , 2019, AEA Papers and Proceedings.
[57] Tatishe M. Nteta,et al. Racial bias in legal language , 2019, Research & Politics.
[58] Alexandra Chouldechova,et al. Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting , 2019, FAT.
[59] Lucy Vasserman,et al. Measuring and Mitigating Unintended Bias in Text Classification , 2018, AIES.
[60] H. Brendan McMahan,et al. A General Approach to Adding Differential Privacy to Iterative Training Procedures , 2018, ArXiv.
[61] Hubert Eichner,et al. Federated Learning for Mobile Keyboard Prediction , 2018, ArXiv.
[62] Inioluwa Deborah Raji,et al. Model Cards for Model Reporting , 2018, FAT.
[63] Ralf Krestel,et al. Challenges for Toxic Comment Classification: An In-Depth Error Analysis , 2018, ALW.
[64] Shashi Narayan,et al. Privacy-preserving Neural Representations of Text , 2018, EMNLP.
[65] Nancy Morawetz. A Better Balance for Federal Rules Governing Public Access to Appeal Records in Immigration Cases , 2018 .
[66] I. Glenn Cohen,et al. HIPAA and Protecting Health Information in the 21st Century , 2018, JAMA.
[67] Daniel Martin Katz,et al. LexNLP: Natural Language Processing and Information Extraction For Legal and Regulatory Texts , 2018, Research Handbook on Big Data Law.
[68] Daniel Jurafsky,et al. Deconfounded Lexicon Induction for Interpretable Social Science , 2018, NAACL.
[69] Vincent N. Schiraldi,et al. Youth Justice in Europe: Experience of Germany, the Netherlands, and Croatia in Providing Developmentally Appropriate Responses to Emerging Adults in the Criminal Justice System , 2018 .
[70] Timothy Baldwin,et al. Towards Robust and Privacy-preserving Text Representations , 2018, ACL.
[71] Paolo Torroni,et al. CLAUDETTE: an automated detector of potentially unfair clauses in online terms of service , 2018, Artificial Intelligence and Law.
[72] Úlfar Erlingsson,et al. The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks , 2018, USENIX Security Symposium.
[73] Mauro Conti,et al. All You Need is "Love": Evading Hate Speech Detection , 2018, ArXiv.
[74] R. Sundstrom. On Post-racialism , 2017 .
[75] P. Weiler. Canadian Judicial Council , 2017 .
[76] Slava J. Mikhaylov,et al. Understanding state preferences with text as data: Introducing the UN General Debate corpus , 2017, ArXiv.
[77] Joel R. Tetreault,et al. Abusive Language Detection in Online User Content , 2016, WWW.
[78] K. Hioki,et al. Judging Implicit Bias: A National Empirical Study Of Judicial Stereotypes , 2016 .
[79] Vitaly Shmatikov,et al. Privacy-preserving deep learning , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).
[80] Daniel P. Miranker,et al. Constitute: The world's constitutions to read, search, and compare , 2014, J. Web Semant..
[81] U. Congress. Chapter 25 - National Historical Publications and Records Commission , 2014 .
[82] Elaine Craig. The Ethical Obligations of Defence Counsel in Sexual Assault Cases , 2013, Osgoode Hall Law Journal.
[83] Arman Sarvarian. Common Ethical Standards for Counsel before the European Court of Justice and European Court of Human Rights , 2012 .
[84] David S. Law,et al. The Declining Influence of the United States Constitution , 2012 .
[85] Evan J. Mandery,et al. The Expungement Myth , 2012 .
[86] A. Roland. The Supreme Court of Canada , 2007 .
[87] Michael Roe,et al. Scanning electronic documents for personally identifiable information , 2006, WPES '06.
[88] Gerald L. Lohse,et al. International Differences in Information Privacy Concerns: A Global Survey of Consumers , 2004, Inf. Soc..
[89] E. Shalev. Ancient Masks, American Fathers: Classical Pseudonyms during the American Revolution and Early Republic , 2003 .
[90] M. Tushnet. The Warren Court in Historical and Political Perspective , 1993 .
[91] H. Hai. United States Court of Appeals: For the Second Circuit , 1962, International Legal Materials.
[92] I. Kivlichan,et al. Capturing Covertly Toxic Speech via Crowdsourcing , 2021, HCINLP.
[93] P. Quaresma,et al. ECHR: Legal Corpus for Argument Mining , 2020, ARGMINING.
[94] Janette C. Brown,et al. Social Security Administration , 2020, Federal Regulatory Guide.
[95] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[96] P. Prasad. Implicit Racial Biases in Prosecutorial Summations: Proposing an Integrated Response , 2018 .
[97] N. Pace,et al. Provider Fraud in California Workers' Compensation: Selected Issues , 2017 .
[98] Alan Garfield. To Swear or Not to Swear: Using Foul Language During a Supreme Court Oral Argument , 2012 .
[99] John H. Blume,et al. Racial Epithets in the Criminal Process , 2011 .
[100] L. Strahilevitz. Pseudonymous Litigation , 2010 .
[101] Michael Lissner,et al. CourtListener.com: A platform for researching and staying abreast of the latest in the law , 2010 .
[102] Johannes Fürnkranz,et al. An Evaluation of Efficient Multilabel Classification Algorithms for Large-Scale Problems in the Legal Domain , 2007, LWA.
[103] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.
[104] M. Berry. The Pig Farmer's Daughter and Other Tales of American Justice: Episodes of Racism and Sexism in the Courts from 1865 to the Present , 1999 .
[105] R. Clarke. All you need is love. , 1996, Modern midwife.
[106] Naomi R. Cahn,et al. Looseness of Legal Language: The Reasonable Woman Standard in Theory and in Practice , 1992 .
[107] D. Sperber,et al. Irony and the Use-Mention Distinction , 1981 .
[108] C. Sargent. Report of the Director of the Arnold Arboretum, presented to the President and Fellows of Harvard University , 1874, Bulletin of the Bussey Institution.