Using Linguistic Information and Machine Learning Techniques to Identify Entities from Juridical Documents

Information extraction from legal documents is an important and open problem. A mixed approach, using linguistic information and machine learning techniques, is described in this paper. In this approach, top-level legal concepts are identified and used for document classification using Support Vector Machines. Named entities, such as, locations, organizations, dates, and document references, are identified using semantic information from the output of a natural language parser. This information, legal concepts and named entities, may be used to populate a simple ontology, allowing the enrichment of documents and the creation of high-level legal information retrieval systems. The proposed methodology was applied to a corpus of legal documents - from the EUR-Lex site – and it was evaluated. The obtained results were quite good and indicate this may be a promising approach to the legal information extraction problem.

[1]  A. Campbell,et al.  Progress in Artificial Intelligence , 1995, Lecture Notes in Computer Science.

[2]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[3]  Peter Jackson,et al.  A machine learning approach to prior case retrieval , 2001, ICAIL '01.

[4]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[5]  Katharina Morik,et al.  Combining Statistical Learning with a Knowledge-Based Approach - A Case Study in Intensive Care Monitoring , 1999, ICML.

[6]  Lynette Hirschman,et al.  MITRE: Description of the Alembic System Used for MUC-6 , 1995, MUC.

[7]  Kevin D. Ashley,et al.  Finding factors: learning to classify case opinions under abstract fact categories , 1997, ICAIL '97.

[8]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[9]  Alaa A. Kharbouch,et al.  Three models for the description of language , 1956, IRE Trans. Inf. Theory.

[10]  Kevin D. Ashley,et al.  Improving the representation of legal case texts with information extraction methods , 2001, ICAIL '01.

[11]  Andrew Stranieri,et al.  Knowledge Discovery from Legal Databases , 2005 .

[12]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[13]  Andrew Stranieri,et al.  The split-up system: integrating neural networks and rule-based reasoning in the legal domain , 1995, ICAIL '95.

[14]  Wendy G. Lehnert,et al.  Information extraction , 1996, CACM.

[15]  Paul Thompson Automatic categorization of case law , 2001, ICAIL '01.

[16]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[17]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[19]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[20]  Kevin D. Ashley,et al.  Predicting outcomes of case based legal arguments , 2003, ICAIL.

[21]  Andreas Rauber,et al.  Automatic text representation, classification and labeling in European law , 2001, ICAIL '01.

[22]  Vibhu O. Mittal,et al.  Applying Machine Learning for High‐Performance Named‐Entity Extraction , 2000, Comput. Intell..

[23]  Paulo Quaresma,et al.  A Question Answer System for Legal Information Retrieval , 2005, JURIX.

[24]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[25]  Chao-Lin Liu,et al.  Classification and clustering for case-based criminal summary judgments , 2003, ICAIL.

[26]  Dawn Wilkins,et al.  The effectiveness of machine learning techniques for predicting time to case disposition , 1997, ICAIL '97.

[27]  John W. Backus,et al.  The syntax and semantics of the proposed international algebraic language of the Zurich ACM-GAMM Conference , 1959, IFIP Congress.

[28]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[29]  Paulo Quaresma,et al.  A question-answering system for Portuguese juridical documents , 2005, ICAIL '05.

[30]  Eckhard Bick A Constraint Grammar Based Question Answering System for Portuguese , 2003, EPIA.

[31]  Teresa Gonçalves,et al.  A Preliminary Approach to the Multilabel Classification Problem of Portuguese Juridical Documents , 2003, EPIA.

[32]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[33]  Dieter Merkl,et al.  A learning technique for legal document analysis , 1999, ICAIL '99.

[34]  Thorsten Joachims,et al.  Estimating the Generalization Performance of an SVM Efficiently , 2000, ICML.

[35]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[36]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[37]  Artificial neural networks and legal categorization , 2003 .

[38]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[39]  Teresa Gonçalves,et al.  Is linguistic information relevant for the classification of legal texts? , 2005, ICAIL '05.

[40]  Richard M. Tong,et al.  Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment) , 1993, TREC.

[41]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .