CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review

Many specialized domains remain untouched by deep learning, as large labeled datasets require expensive expert annotators. We address this bottleneck within the legal domain by introducing the Contract Understanding Atticus Dataset (CUAD), a new dataset for legal contract review. CUAD was created with dozens of legal experts from The Atticus Project and consists of over 13,000 annotations. The task is to highlight salient portions of a contract that are important for a human to review. We find that Transformer models have nascent performance, but that this performance is strongly influenced by model design and training dataset size. Despite these promising results, there is still substantial room for improvement. As one of the only large, specialized NLP benchmarks annotated by experts, CUAD can serve as a challenging research benchmark for the broader NLP community.

[1]  Ion Androutsopoulos,et al.  Large-Scale Multi-Label Text Classification on EU Legislation , 2019, ACL.

[2]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[3]  Maosong Sun,et al.  How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence , 2020, ACL.

[4]  Evangelos Kanoulas,et al.  A Benchmark for Lease Contract Review , 2020, ArXiv.

[5]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[6]  Dawn Song,et al.  Measuring Mathematical Problem Solving With the MATH Dataset , 2021, NeurIPS Datasets and Benchmarks.

[7]  Andrew Y. Ng,et al.  CheXseg: Combining Expert Annotations with DNN-generated Saliency Maps for X-ray Segmentation , 2021, MIDL.

[8]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[9]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[10]  Ion Androutsopoulos,et al.  Obligation and Prohibition Extraction Using Hierarchical RNNs , 2018, ACL.

[11]  Zhiyuan Liu,et al.  CJRC: A Reliable Human-Annotated Benchmark DataSet for Chinese Judicial Reading Comprehension , 2019, CCL.

[12]  Benjamin Van Durme,et al.  A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering , 2020, NLLP@KDD.

[13]  Ion Androutsopoulos,et al.  Extracting contract elements , 2017, ICAIL.

[14]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[15]  Dawn Song,et al.  Measuring Coding Challenge Competence With APPS , 2021, NeurIPS Datasets and Benchmarks.

[16]  Xiaogang Wang,et al.  Learning from massive noisy labeled data for image classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[18]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[19]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[20]  Dawn Song,et al.  Measuring Massive Multitask Language Understanding , 2020, ICLR.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Randy Goebel,et al.  COLIEE-2018: Evaluation of the Competition on Legal Information Extraction and Entailment , 2018, JSAI-isAI Workshops.