The HOLJ Corpus. Supporting Summarisation of Legal Texts

We describe an XML-encoded corpus of texts in the legal domain which was gathered for an automatic summarisation project. We describe two distinct layers of annotation: manual annotation of the rhetorical status of sentences and an entirely automatic annotation process incorporating a host of individual linguistic processors. The manual rhetorical status annotation has been developed as training and testing material for a summarisation system based on the work of Teufel and Moens, while the automatic layer of annotation encodes linguistic information as features for a machine learning approach to rhetorical status classification.

[1]  Hervé Déjean,et al.  Introduction to the CoNLL-2001 shared task: clause identification , 2001, CoNLL.

[2]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[3]  Stefan Evert,et al.  The NITE XML Toolkit: Flexible annotation for multimodal language data , 2003, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[4]  Claire Grover,et al.  A Rhetorical Status Classifier for Legal Text Summarisation , 2004 .

[5]  Kathleen McKeown,et al.  The decomposition of human-written summary sentences , 1999, SIGIR '99.

[6]  John A. Carroll,et al.  Robust, applied morphological generation , 2000, INLG.

[7]  Simone Teufel,et al.  Sentence extraction as a classification task , 1997 .

[8]  Marc Moens,et al.  Articles Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status , 2002, CL.

[9]  Marc Moens,et al.  Argumentative Classification of Extracted Sentences as a First Step Towards Flexible Abstracting , 1999 .

[10]  Gareth W. Dunleavy,et al.  The Language of the Law , 1963 .

[11]  Claire Grover,et al.  Automatic summarisation of legal documents , 2003, ICAIL.

[12]  Benjamin C. Hachey,et al.  Recognising Clauses Using Symbolic and Machine Learning Approaches , 2002 .

[13]  William C. Mann,et al.  Rhetorical Structure Theory: Description and Construction of Text Structures , 1987 .

[14]  Inderjeet Mani,et al.  Machine Learning of Generic and User-Focused Summarization , 1998, AAAI/IAAI.

[15]  James R. Curran,et al.  Language Independent NER using a Maximum Entropy Tagger , 2003, CoNLL.

[16]  G. Myers ‘In this paper we report …'’: Speech acts and scientific facts , 1992 .

[17]  Daniel Marcu,et al.  The automatic construction of large-scale corpora for summarization research , 1999, SIGIR '99.

[18]  Marc Moens,et al.  LT TTT - A Flexible Tokenisation Tool , 2000, LREC.

[19]  John M. Swales,et al.  Genre Analysis: English in Academic and Research Settings , 1993 .

[20]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[21]  Andrei Mikheev,et al.  Automatic Rule Induction for Unknown-Word Guessing , 1997, CL.

[22]  Michele Banko,et al.  Generating Extraction-Based Summaries from Hand-Written Summaries by Aligning Text Spans , 1999 .