Is linguistic information relevant for the classification of legal texts?

Text classification is an important task in the legal domain. In fact, most of the legal information is stored as text in a quite unstructured format and it is important to be able to automatically classify these texts into a predefined set of concepts.Support Vector Machines (SVM), a machine learning algorithm, has shown to be a good classifier for text bases [12]. In this paper, SVMs are applied to the classification of European Portuguese legal texts - the Portuguese Attorney General's Office Decisions - and the relevance of linguistic information in this domain, namely lemmatisation and part-of-speech tags, is evaluated.The obtained results show that some linguistic information (namely, lemmatisation and the part-of-speech tags) can be successfully used to improve the classification results and, simultaneously, to decrease the number of features needed by the learning algorithm.

[1]  Paulo Quaresma,et al.  PGR: Portuguese Attorney General's Office Decisions on the Web , 2001, INAP.

[2]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[3]  Artificial neural networks and legal categorization , 2003 .

[4]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[5]  Richard M. Tong,et al.  Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment) , 1993, TREC.

[6]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[7]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[8]  Chao-Lin Liu,et al.  Classification and clustering for case-based criminal summary judgments , 2003, ICAIL.

[9]  Paul Thompson Automatic categorization of case law , 2001, ICAIL '01.

[10]  Eckhard Bick A Constraint Grammar Based Question Answering System for Portuguese , 2003, EPIA.

[11]  Dawn Wilkins,et al.  The effectiveness of machine learning techniques for predicting time to case disposition , 1997, ICAIL '97.

[12]  Kevin D. Ashley,et al.  Finding factors: learning to classify case opinions under abstract fact categories , 1997, ICAIL '97.

[13]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[14]  Teresa Gonçalves,et al.  A Preliminary Approach to the Multilabel Classification Problem of Portuguese Juridical Documents , 2003, EPIA.

[15]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[16]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[18]  Andrew Stranieri,et al.  The split-up system: integrating neural networks and rule-based reasoning in the legal domain , 1995, ICAIL '95.

[19]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[20]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .

[21]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[22]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[23]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[24]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[25]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[26]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[27]  Peter Jackson,et al.  A machine learning approach to prior case retrieval , 2001, ICAIL '01.

[28]  Dieter Merkl,et al.  A learning technique for legal document analysis , 1999, ICAIL '99.

[29]  Thomas M. Cover,et al.  Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) , 2006 .

[30]  Andreas Rauber,et al.  Automatic text representation, classification and labeling in European law , 2001, ICAIL '01.

[31]  Kevin D. Ashley,et al.  Predicting outcomes of case based legal arguments , 2003, ICAIL.

[32]  Renata Vieira,et al.  Mining Linguistically Interpreted Texts , 2004 .