Digital weight watching: reconstruction of scanned documents

A web portal providing access to over 250.000 scanned and OCRed cultural heritage documents is analyzed. The collection consists of the complete Dutch Hansard from 1917 to 1995. Each document consists of facsimile images of the original pages plus hidden OCRed text. The inclusion of images yields large file sizes of which less than 2% is the actual text. The search user interface of the portal provides poor ranking and not very informative document summaries (snippets). Thus, users are instrumental in weeding out non-relevant results. For that, they must assess the complete documents. This is a time-consuming and frustrating process because of long download and processing times of the large files. Instead of using the scanned images for relevance assessment, we propose to use a reconstruction of the original document from a purely semantic representation. Evaluation on the Dutch dataset shows that these reconstructions become two orders of magnitude smaller and still resemble the original to a high degree. In addition, they are easier to speed-read and evaluate for relevance, due to added hyperlinks and a presentation optimized for reading from a terminal. We describe the reconstruction process and evaluate the costs, the benefits, and the quality.

[1]  Maarten Marx,et al.  Exemelification of parliamentary debates , 2009 .

[2]  Maarten Marx,et al.  Long, often quite boring, notes of meetings , 2009, ESAIR '09.

[3]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[4]  Martin Reynaert,et al.  Non-interactive OCR Post-correction for Giga-Scale Digitization Projects , 2008, CICLing.

[5]  Bertram Ludäscher,et al.  A Transducer-Based XML Query Processor , 2002, VLDB.

[6]  Anne Schuth,et al.  DutchParl: A corpus of parliamentary documents in Dutch , 2010 .

[7]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[8]  Maarten Marx,et al.  DutchParl. The Parliamentary Documents in Dutch , 2010, LREC.

[9]  Maarten Marx,et al.  Who said what to whom?: capturing the structure of debates , 2009, SIGIR.

[10]  Jeffrey R. van der Hoeven,et al.  Development of a Universal Virtual Computer (UVC) for long-term preservation of digital objects , 2005, J. Inf. Sci..

[11]  Thomas M. Breuel,et al.  High Performance Document Layout Analysis , 2003 .

[12]  Mounia Lalmas,et al.  Workshop on aggregated search , 2008, SIGF.

[13]  Azriel Rosenfeld,et al.  Document structure analysis algorithms: a literature survey , 2003, IS&T/SPIE Electronic Imaging.

[14]  Thomas Kieninger,et al.  Document Structure Analysis Based on Layout and Textual Features , 2000 .

[15]  Maureen Pennock,et al.  Data without meaning: Establishing the significant properties of digital research , 2008, iPRES.

[16]  Anette Hulth,et al.  Automatic Keyword Extraction Using Domain Knowledge , 2001, CICLing.

[17]  Marti A. Hearst Search User Interfaces , 2009 .

[18]  Song Mao,et al.  Style-independent document labeling: design and performance evaluation , 2003, IS&T/SPIE Electronic Imaging.

[19]  Airi Salminen,et al.  Building Digital Government by XML , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[20]  Raymond A. Lorie,et al.  Trustworthy 100-year digital objects: durable encoding for when it's too late to ask , 2004, TOIS.

[21]  M. Lynn Hawaii International Conference on System Sciences , 1996 .

[22]  Raghu Ramakrishnan,et al.  Managing information extraction: state of the art and research directions , 2006, SIGMOD Conference.

[23]  Liangrui Peng,et al.  Hierarchical logical structure extraction of book documents by analyzing tables of contents , 2003, IS&T/SPIE Electronic Imaging.

[24]  Börkur Sigurbjörnsson,et al.  Focused information access using XML element retrieval , 2006 .

[25]  Charles L. A. Clarke,et al.  The influence of caption features on clickthrough patterns in web search , 2007, SIGIR.