Validation: A Critical First Step in the Evaluation of Systems for Legal Corpus Determination

The continued growth of very large data environments, both proprietary and Web-based, increases the importance of effective and ecient legal corpus selection and searching. Current “database selection” research focuses largely on completely autonomous and automatic selection, searching, and results merging in distributed environments. This fully automatic approach has significant deficiencies, including reliance upon thresholds below which data sets with relevant documents are not searched (compromised recall). It also merges result sets, often from disparate data sources, some that users may have discarded before their source selection task completed (diluted precision). We examine the impact that user interaction can have on the process of legal corpus selection. After analyzing thousands of real user queries, we show that precision can be significantly increased when queries are categorized by the users themselves, then interpreted and treated accurately by the system. As a precursor to evaluation, in this workshop, we present three behind-thescenes system validation exercises to assist us in determining whether certain system design decisions are justified in the context of our long-term goals of providing a corpus selection tool to legal practitioners. We ultimately show that by avoiding a one-size-fits-all approach that restricts the role users can play in information discovery, legal corpus selection eectiveness can be appreciably improved.

[1]  Donna K. Harman,et al.  Overview of the Sixth Text REtrieval Conference (TREC-6) , 1997, Inf. Process. Manag..

[2]  Ellen M. Voorhees,et al.  Evaluating evaluation measure stability , 2000, SIGIR '00.

[3]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[4]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[5]  John Vergo,et al.  A user-centered design approach to personalization , 2000, CACM.

[6]  Mark Sanderson,et al.  Word sense disambiguation and information retrieval , 1994, SIGIR '94.

[7]  Peter Jackson,et al.  Database Selection Using Actual Physical and Acquired Logical Collection Resources in a Massive Domain-specific Operational Environment , 2002, VLDB.

[8]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[9]  Tefko Saracevic Users lost: reflections on the past, future, and limits of information science , 1997, SIGF.

[10]  James P. Callan,et al.  Collection selection and results merging with topically organized U.S. patents and TREC data , 2000, CIKM '00.

[11]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[12]  James C. French,et al.  The impact of database selection on distributed searching , 2000, SIGIR '00.

[13]  James P. Callan,et al.  Effective retrieval with distributed collections , 1998, SIGIR '98.

[14]  Paul Thompson,et al.  TREC-3 Ad Hoc Retrieval and Routing Experiments using the WIN System , 1994, TREC.

[15]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[16]  Soyeon Park,et al.  Usability, user preferences, effectiveness, and user behaviors when searching individual and integrated full-text databases: implications for digital libraries , 2000, J. Am. Soc. Inf. Sci..

[17]  V. Rich Personal communication , 1989, Nature.

[18]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences. , 1957 .

[19]  James Allan,et al.  INQUERY Does Battle With TREC-6 , 1997, TREC.

[20]  Jack G. Conrad,et al.  Client-system collaboration for legal corpus selection in an online production environment , 2003, ICAIL.

[21]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[22]  W. Bruce Croft,et al.  INQUERY System Overview , 1993, TIPSTER.

[23]  Nicholas J. Belkin,et al.  Helping people find what they don't know , 2000, CACM.

[24]  Raya Fidel,et al.  The Role of Subject Access in Information Filtering , 1998 .