Searching Corrupted Document Collections

Historical documents are typically digitized using optical Character Recognition. While effective, the results may not always be accurate and are highly dependent on the input. Consequently, degraded documents are often corrupted. Our focus is finding flexible, reliable methods to correct for such degradation, in the face of limited resources. We extend upon our substring and context fusion based retrieval system known as Segments, to consider metadata. By extracting topics from documents, and supplementing and weighting our lexicon with co-occurring terms found in documents with those topics, we achieve a statistically significant improvement over the state-of-the-art in all but one test configuration. Our mean reciprocal rank measured on two free, publicly available, independently judged datasets is 0.7657 and 0.5382.

[1]  Srijana Poudel Post Processing of Optically Recognized Text via Second Order Hidden Markov Model , 2012 .

[2]  Ophir Frieder,et al.  Revisiting Known-Item Retrieval in Degraded Document Collections , 2016, Document Recognition and Retrieval.

[3]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[4]  ChengXiang Zhai,et al.  CloudSpeller: query spelling correction by using a unified hidden markov model with web-scale resources , 2012, WWW.

[5]  Analysis and and Character Character Recognition Recognition , .

[6]  Ophir Frieder,et al.  On searching misspelled collections , 2015, J. Assoc. Inf. Sci. Technol..

[7]  Apostolos Antonacopoulos,et al.  The IMPACT dataset of historical document images , 2013, HIP '13.

[8]  Kazem Taghva,et al.  Evaluation of model-based retrieval effectiveness with OCR text , 1996, TOIS.

[9]  Youssef Bassil,et al.  OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set , 2012, ArXiv.

[10]  Ophir Frieder,et al.  Yizkor books: a voice for the silent past , 2008, CIKM '08.

[11]  Peter Ingwersen,et al.  Data fusion according to the principle of polyrepresentation , 2009, J. Assoc. Inf. Sci. Technol..

[12]  A. Mishra,et al.  Spell Checker for OCR , 2013 .

[13]  Ruxandra Domenig,et al.  SPIDER Retrieval System at TREC-5 , 1996, TREC.

[14]  Ellen M. Voorhees,et al.  The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text , 2000, Information Retrieval.

[15]  Jacques Savoy,et al.  Information Retrieval Strategies for Digitized Handwritten Medieval Documents , 2011, AIRS.

[16]  Ophir Frieder,et al.  On Foreign Name Search , 2010, ECIR.

[17]  Ilya Zavorin,et al.  A filter based post-OCR accuracy boost system , 2004, HDP '04.

[18]  Kazem Taghva,et al.  Hairetes: A Search Engine for OCR Documents , 2002, Document Analysis Systems.

[19]  Siyuan Chen,et al.  Efficient automatic OCR word validation using word partial format derivation and language model , 2010, Electronic Imaging.

[20]  Alvaro Barreiro,et al.  Revisiting N-Gram Based Models for Retrieval in Degraded Large Collections , 2009, ECIR.

[21]  Kazem Taghva,et al.  Information access in the presence of OCR errors , 2004, HDP '04.

[22]  Saket S. R. Mengle,et al.  Passage detection using text classification , 2009, J. Assoc. Inf. Sci. Technol..

[23]  Ben Hutchinson,et al.  Using the Web for Language Independent Spellchecking and Autocorrection , 2009, EMNLP.

[24]  Jason J. Soo A non-learning approach to spelling correction in web queries , 2013, WWW '13 Companion.

[25]  Youssef Bassil,et al.  OCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion , 2012, ArXiv.