论文信息 - On Retrieving Legal Files: Shortening Documents and Weeding Out Garbage

On Retrieving Legal Files: Shortening Documents and Weeding Out Garbage

This paper describes our participation in the TREC Legal experiments in 2007. We have applied novel normalization techniques that are designed to slightly favor longer documents instead of assuming that all documents should have equal weight. We have also developed a new method for reformulating query text when background information is provided with an information request. We have also experimented with using enhanced OCR error detection to reduce the size of the term list and remove noise in the data. In this article, we discuss the impact of these effects on the TREC 2007 data sets. We show that the use of simple normalization methods significantly outperforms cosine normalization in the legal domain.

Scott Kulp | April Kontostathis

[1] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[2] W. Bruce Croft,et al. An Association Thesaurus for Information Retrieval , 1994, RIAO.

[3] Stephen E. Robertson,et al. Okapi at TREC-3 , 1994, TREC.

[4] Gerard Salton,et al. Length Normalization in Degraded Text Collections , 1995 .

[5] Efthimis N. Efthimiadis,et al. A user-centred evaluation of ranking algorithms for interactive query expansion , 1993, SIGIR.

[6] Scott Kulp. Improving Search and Retrieval Performance through Shortening Documents, Detecting Garbage, and Throwing Out Jargon , 2007 .

[7] Chris Buckley,et al. Pivoted Document Length Normalization , 1996, SIGIR Forum.

[8] Susan Gauch,et al. Search improvement via automatic query reformulation , 1991, TOIS.

[9] Kazem TAGHVA,et al. Automatic Removal of “ Garbage Strings ” in OCR Text : An Implementation , .