Exploiting EuroVoc's Hierarchical Structure for Classifying Legal Documents

Multi-label document classification is a challenging problem because of the potentially huge number of classes. Furthermore, real-world datasets often exhibit a strongly varying number of labels per document, and a power-law distribution of those class labels. Multi-label classification of legal documents is additionally complicated by long document texts and domain-specific use of language. In this paper we use different approaches to compare the performance of text classification algorithms on existing datasets and corpora of legal documents, and contrast the results of our experiments with results on general-purpose multi-label text classification datasets. Moreover, for the EUR-Lex legal datasets, we show that exploiting the hierarchy of the EuroVoc thesaurus helps to improve classification performance by reducing the number of potential classes while retaining the informative value of the classification itself.

[1]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[2]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[3]  Christoph Rensing,et al.  Multi-label Text Classification Using Semantic Features and Dimensionality Reduction with Autoencoders , 2017, LDK.

[4]  Yoav Goldberg,et al.  Understanding Convolutional Neural Networks for Text Classification , 2018, BlackboxNLP@EMNLP.

[5]  Murat Can Ganiz,et al.  Semantic text classification: A survey of past and recent advances , 2018, Inf. Process. Manag..

[6]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[7]  Sebastian Ruder,et al.  Fine-tuned Language Models for Text Classification , 2018, ArXiv.

[8]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[9]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[10]  Manik Varma,et al.  FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning , 2014, KDD.

[11]  Christoph Rensing,et al.  Towards Ontology-Based Training-Less Multi-label Text Classification , 2018, NLDB.

[12]  Hsuan-Tien Lin,et al.  Feature-aware Label Space Dimension Reduction for Multi-label Classification , 2012, NIPS.

[13]  Jiun-Hung Chen,et al.  A multi-label classification based approach for sentiment classification , 2015, Expert Syst. Appl..

[14]  Livio Robaldo,et al.  Multi-label Classification of Legislative Text into EuroVoc , 2012, JURIX.

[15]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[16]  Jack G. Conrad,et al.  Legal document clustering with built-in topic segmentation , 2011, CIKM '11.

[17]  Yun Zhu,et al.  Support vector machines and Word2vec for text classification with semantic features , 2015, 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC).

[18]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[19]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[20]  Ralf Steinberger,et al.  JRC Eurovoc Indexer JEX - A freely available multi-label categorisation tool , 2012, LREC.

[21]  Prateek Jain,et al.  Sparse Local Embeddings for Extreme Multi-label Classification , 2015, NIPS.

[22]  Hongyuan Zha,et al.  Deep Extreme Multi-label Learning , 2017, ICMR.

[23]  Livio Robaldo,et al.  Linking legal open data: breaking the accessibility and language barrier in european legislation and case law , 2015, ICAIL.

[24]  Teresa Gonçalves,et al.  Using Linguistic Information and Machine Learning Techniques to Identify Entities from Juridical Documents , 2010, Semantic Processing of Legal Texts.

[25]  Johannes Fürnkranz,et al.  An Evaluation of Efficient Multilabel Classification Algorithms for Large-Scale Problems in the Legal Domain , 2007, LWA.

[26]  Krys J. Kochut,et al.  A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques , 2017, ArXiv.