Digital weight watching: reconstruction of scanned documents

Scanned and OCRed data leads to large file sizes if facsimile images are included. This makes storage of, and providing online access to large data sets costly. Manually analyzing such data is cumbersome because of long download and processing times. It may thus be advantageous to reconstruct the scanned documents as documents without scanned images which nevertheless closely resemble the original. We have done this reconstruction for a data set of Dutch parliamentary proceedings with positive results. 1.5% of the original storage space was needed, while the documents resembled the originals to a high degree. We describe the reconstruction process and evaluate the costs, the benefits and the quality.

[1]  Maarten Marx,et al.  Long, often quite boring, notes of meetings , 2009, ESAIR '09.

[2]  Charles L. A. Clarke,et al.  The influence of caption features on clickthrough patterns in web search , 2007, SIGIR.

[3]  Martin Reynaert,et al.  Non-interactive OCR Post-correction for Giga-Scale Digitization Projects , 2008, CICLing.

[4]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[5]  Marti A. Hearst Search User Interfaces , 2009 .

[6]  Song Mao,et al.  Style-independent document labeling: design and performance evaluation , 2003, IS&T/SPIE Electronic Imaging.

[7]  Raymond A. Lorie,et al.  Trustworthy 100-year digital objects: durable encoding for when it's too late to ask , 2004, TOIS.

[8]  Thomas Kieninger,et al.  Document Structure Analysis Based on Layout and Textual Features , 2000 .

[9]  Börkur Sigurbjörnsson,et al.  Focused information access using XML element retrieval , 2006 .

[10]  Jeffrey R. van der Hoeven,et al.  Development of a Universal Virtual Computer (UVC) for long-term preservation of digital objects , 2005, J. Inf. Sci..

[11]  Anette Hulth,et al.  Automatic Keyword Extraction Using Domain Knowledge , 2001, CICLing.

[12]  Maarten Marx,et al.  Who said what to whom?: capturing the structure of debates , 2009, SIGIR.

[13]  Mounia Lalmas,et al.  Workshop on aggregated search , 2008, SIGF.

[14]  Airi Salminen,et al.  Building Digital Government by XML , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[15]  Maureen Pennock,et al.  Data without meaning: Establishing the significant properties of digital research , 2008, iPRES.

[16]  Liangrui Peng,et al.  Hierarchical logical structure extraction of book documents by analyzing tables of contents , 2003, IS&T/SPIE Electronic Imaging.

[17]  Raghu Ramakrishnan,et al.  Managing information extraction: state of the art and research directions , 2006, SIGMOD Conference.

[18]  Thomas M. Breuel,et al.  High Performance Document Layout Analysis , 2003 .

[19]  Bertram Ludäscher,et al.  A Transducer-Based XML Query Processor , 2002, VLDB.

[20]  Anne Schuth,et al.  DutchParl: A corpus of parliamentary documents in Dutch , 2010 .

[21]  Maarten Marx,et al.  DutchParl. The Parliamentary Documents in Dutch , 2010, LREC.

[22]  Azriel Rosenfeld,et al.  Document structure analysis algorithms: a literature survey , 2003, IS&T/SPIE Electronic Imaging.

[23]  Jim Holder,et al.  User interfaces , 1985, ALET.

[24]  Maarten Marx,et al.  Exemelification of parliamentary debates , 2009 .

[25]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..