Book search: indexing the valuable parts

With massive book digitization efforts underway, there is a need for developing effective book retrieval strategies. This paper explores the relative contribution of different parts of digitized and OCR'ed books towards effective retrieval. The examined parts include the entire content of books, book headings, book titles, and table of content entries. Results show that indexing the headers and titles of books is nearly as effective as indexing the entire contents of books. These results indicate that certain portions of the books, specifically titles and headers, are more valuable than other parts of books. This is akin to web search where hypertext and page titles are more valuable to index than the rest of the webpage. Also, using a combination of evidence approach provides further improved retrieval effectiveness compared to using any portion of the book in isolation.

[1]  D. Harris,et al.  Results and Implications of the Noisy Data , 1994 .

[2]  Julie Borsack,et al.  Expert system for automatically correcting OCR output , 1994, Electronic Imaging.

[3]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[4]  David S. Doermann,et al.  The Indexing and Retrieval of Document Images: A Survey , 1998, Comput. Vis. Image Underst..

[5]  Douglas W. Oard,et al.  Document Image Retrieval Techniques for Chinese , 2001 .

[6]  George R. Thoma,et al.  Automated data entry system: performance issues , 2001, IS&T/SPIE Electronic Imaging.

[7]  S. M. Hardingy,et al.  An Evaluation of Information Retrieval Accuracy with Simulated Ocr Output , 1992 .

[8]  W. Bruce Croft,et al.  Probabilistic Retrieval of OCR Degraded Text Using N-Grams , 1997, ECDL.

[9]  Donna Harman,et al.  Overview of the First Text REtrieval Conference. , 1993, SIGIR 1993.

[10]  Kazem Taghva,et al.  Effects of OCR Errors on Ranking and Feedback Using the Vector Space Model , 1996, Inf. Process. Manag..

[11]  David S. Doermann,et al.  The retrieval of document images: a brief survey , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[12]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[13]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[14]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[15]  Ellen M. Voorhees,et al.  Report on the TREC-5 Confusion Track , 1996, TREC.

[16]  Donna K. Harman,et al.  Overview of the first TREC conference , 1993, SIGIR.

[17]  Kazem Taghva,et al.  Evaluation of model-based retrieval effectiveness with OCR text , 1996, TOIS.

[18]  William A. Barrett,et al.  Digital mountain: from granite archive to global access , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[19]  Gareth J. F. Jones,et al.  Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents , 2006, Inf. Process. Manag..

[20]  Steven J. Simske,et al.  Creating digital libraries: content generation and re-mastering , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[21]  Gabriella Kazai,et al.  Book Search Experiments: Investigating IR Methods for the Indexing and Retrieval of Books , 2008, ECIR.

[22]  Tao Qin,et al.  Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004 , 2004, TREC.

[23]  Douglas W. Oard,et al.  Term selection for searching printed Arabic , 2002, SIGIR '02.

[24]  Julie Borsack,et al.  Querying Short OCR'd Documents , 1995 .

[25]  Christopher J. C. Burges,et al.  High accuracy retrieval with multiple nested ranker , 2006, SIGIR.

[26]  David Hawking Document Retrieval In OCR-Scanned Text , 2007 .