Multi-label legal document classification: A deep learning-based approach with label-attention and domain-specific pre-training

Abstract Multi-label document classification has a broad range of applicability to various practical problems, such as news article topic tagging, sentiment analysis, medical code classification, etc. A variety of approaches (e.g., tree-based methods, neural networks and deep learning systems that are specifically based on pre-trained language models) have been developed for multi-label document classification problems and have achieved satisfying performance on different datasets. In the legal domain, however, one is often faced with several key challenges when working with multi-label classification tasks. One critical challenge is the lack of high-quality human labeled datasets, which prevents researchers and practitioners from achieving decent performance on respective tasks. Also, existing methods on multi-label classification typically focus on the majority classes, which results in an unsatisfying performance for other important classes that do not have sufficient training samples. In order to tackle the above challenges, in this paper, we first present POSTURE50K, a novel legal extreme multi-label classification dataset, which we will release to the research community. The dataset contains 50,000 legal opinions and their manually labeled legal procedural postures. Labels in this dataset follow a Zipfian distribution, leaving many of the classes with only a few samples. Furthermore, we propose a deep learning architecture that adopts domain-specific pre-training and a label-attention mechanism for multi-label document classification. We evaluate our proposed architecture on POSTURE50K and another legal multi-label dataset EUROLEX57K, and show that our approach achieves better performances than two baseline systems and another four recent state-of-the-art methods on both datasets.

[1]  Frank Schilder,et al.  Litigation Analytics: Extracting and querying motions and orders from US federal courts , 2019, NAACL-HLT.

[2]  Yukihiro Tagami,et al.  AnnexML: Approximate Nearest Neighbor Search for Extreme Multi-label Classification , 2017, KDD.

[3]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[4]  Pradeep Ravikumar,et al.  PD-Sparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification , 2016, ICML.

[5]  Ion Androutsopoulos,et al.  Neural Legal Judgment Prediction in English , 2019, ACL.

[6]  Jure Leskovec,et al.  Hidden factors and hidden topics: understanding rating dimensions with review text , 2013, RecSys.

[7]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[8]  Peng Zhou,et al.  FastBERT: a Self-distilling BERT with Adaptive Inference Time , 2020, ACL.

[9]  Masha Medvedeva,et al.  Using machine learning to predict decisions of the European Court of Human Rights , 2019, Artificial Intelligence and Law.

[10]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[11]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[12]  Ramakanth Kavuluru,et al.  Few-Shot and Zero-Shot Multi-Label Learning for Structured Label Spaces , 2018, EMNLP.

[13]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[14]  Ming-Wei Chang,et al.  Zero-Shot Entity Linking by Reading Entity Descriptions , 2019, ACL.

[15]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[16]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[17]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[18]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[19]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[20]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[21]  Venkatesh Balasubramanian,et al.  Slice: Scalable Linear Extreme Classifiers Trained on 100 Million Labels for Related Searches , 2019, WSDM.

[22]  Pascale Kuntz,et al.  CRAFTML, an Efficient Clustering-based Random Forest for Extreme Multi-label Learning , 2018, ICML.

[23]  Ion Androutsopoulos,et al.  Large-Scale Multi-Label Text Classification on EU Legislation , 2019, ACL.

[24]  Jimeng Sun,et al.  Explainable Prediction of Medical Codes from Clinical Text , 2018, NAACL.

[25]  Jinbo Bi,et al.  Large Scale Diagnostic Code Classification for Medical Patient Records , 2008, IJCNLP.

[26]  Brian D. Davison,et al.  Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Clusters for Extreme Multi-label Text Classification , 2020, ICML.

[27]  Zhiyuan Liu,et al.  CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction , 2018, ArXiv.

[28]  A. Viera,et al.  Understanding interobserver agreement: the kappa statistic. , 2005, Family medicine.

[29]  A. Zubiaga Enhancing Navigation on Wikipedia with Social Tags , 2012, ArXiv.

[30]  Jiun-Hung Chen,et al.  A multi-label classification based approach for sentiment classification , 2015, Expert Syst. Appl..

[31]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[32]  Kai Zou,et al.  EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks , 2019, EMNLP.

[33]  Linlin Liu,et al.  DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks , 2020, EMNLP.

[34]  I. Dhillon,et al.  Taming Pretrained Transformers for Extreme Multi-label Text Classification , 2019, KDD.

[35]  Manik Varma,et al.  FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning , 2014, KDD.

[36]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[37]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[38]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[39]  Zihan Zhang,et al.  AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification , 2019, NeurIPS.

[40]  Xin Chen,et al.  Mining Social Media Data for Understanding Students’ Learning Experiences , 2014, IEEE Transactions on Learning Technologies.

[41]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[42]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[43]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[44]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[45]  Johannes Fürnkranz,et al.  Efficient Pairwise Multilabel Classification for Large-Scale Problems in the Legal Domain , 2008, ECML/PKDD.