A word spotting framework for historical machine-printed documents

In this paper, we propose a word spotting framework for accessing the content of historical machine-printed documents without the use of an optical character recognition engine. A preprocessing step is performed in order to improve the quality of the document images, while word segmentation is accomplished with the use of two complementary segmentation methodologies. In the proposed methodology, synthetic word images are created from keywords, and these images are compared to all the words in the digitized documents. A user feedback process is used in order to refine the search procedure. The methodology has been evaluated in early Modern Greek documents printed during the seventeenth and eighteenth century. In order to improve the efficiency of accessing and search, natural language processing techniques have been addressed that comprise a morphological generator that enables searching in documents using only a base word-form for locating all the corresponding inflected word-forms and a synonym dictionary that further facilitates access to the semantic context of documents.

[1]  Fred Popowich,et al.  Adapting a synonym database to specific domains , 2000 .

[2]  Angela Ralli,et al.  Greek Compounds: A challenging case for the parsing techniques of PC-KIMMO v.2 , 2005 .

[3]  R. Manmatha,et al.  Word spotting: indexing handwritten manuscripts , 1997 .

[4]  David S. Doermann,et al.  The detection of duplicates in document image databases , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[5]  R. Manmatha,et al.  Features for word spotting in historical manuscripts , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[6]  Nikos Papamarkos,et al.  Block decomposition and segmentation for fast Hough transform evaluation , 1999, Pattern Recognit..

[7]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[8]  Nikos Fakotakis,et al.  A PC-KIMMO-Based Morphological Description of Modern Greek , 1995 .

[9]  Ioannis Pratikakis,et al.  Automatic Table Detection in Document Images , 2005, ICAPR.

[10]  Kalervo Järvelin,et al.  Evaluating the effectiveness of relevance feedback based on a user simulation model: effects of a user scenario on cumulated gain value , 2008, Information Retrieval.

[11]  Peng-Yeng Yin Skew detection and block classification of printed documents , 2001, Image Vis. Comput..

[12]  Markus Dickinson,et al.  Computational approaches to morphology and syntax (review) , 2010 .

[13]  ChengXiang Zhai,et al.  Semantic term matching in axiomatic approaches to information retrieval , 2006, SIGIR.

[14]  Zhiguo Gong,et al.  Web Query Expansion by WordNet , 2005, DEXA.

[15]  Ulrich Heid,et al.  SMOR: A German Computational Morphology Covering Derivation, Composition and Inflection , 2004, LREC.

[16]  Mindy Bokser,et al.  Omnidocument technologies , 1992, Proc. IEEE.

[17]  Ching Y. Suen,et al.  HMM word recognition engine , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[18]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[19]  Hui Fang,et al.  A Re-examination of Query Expansion Using Lexical Resources , 2008, ACL.

[20]  Sergios Theodoridis,et al.  Keyword-guided word spotting in historical printed documents using synthetic data and user feedback , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[21]  Kemal Oflazer,et al.  Introduction to the Special Issue on Finite-State Methods in NLP , 2000, Computational Linguistics.

[22]  Christodoulos Chamzas,et al.  A binary-tree-based OCR technique for machine-printed characters , 1997 .

[23]  ChengXiang Zhai,et al.  An exploration of axiomatic approaches to information retrieval , 2005, SIGIR '05.

[24]  Clement T. Yu,et al.  An effective approach to document retrieval via utilizing WordNet and recognizing phrases , 2004, SIGIR '04.

[25]  Apostolos Antonacopoulos,et al.  Semantics-based content extraction in typewritten historical documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[26]  J. V. Rauff,et al.  Finite State Morphology , 2007 .

[27]  Remco C. Veltkamp,et al.  Shape Similarity Measures, Properties and Constructions , 2000, VISUAL.

[28]  Ioannis Pratikakis,et al.  Adaptive degraded document image binarization , 2006, Pattern Recognit..

[29]  Rodney M. Goodman,et al.  Keyword spotting for cursive document retrieval , 1997, Proceedings Workshop on Document Image Analysis (DIA'97).

[30]  Yue Lu,et al.  An approach to word image matching based on weighted Hausdorff distance , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[31]  Friedrich M. Wahl,et al.  Block segmentation and text extraction in mixed text/image documents , 1982, Comput. Graph. Image Process..

[32]  Christiane Fellbaum,et al.  Using Wordnet for Text Retrieval , 1998 .

[33]  Norbert Fuhr,et al.  Generating Search Term Variants for Text Collections with Historic Spellings , 2006, ECIR.

[34]  Helmut Schmid,et al.  A Programming Language for Finite State Transducers , 2005, FSMNLP.

[35]  Jian-Yun Nie,et al.  Integrating word relationships into language models , 2005, SIGIR '05.

[36]  João Rogério Caldas Pinto,et al.  Line and Word Matching in Old Documents , 2004, ArXiv.

[37]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984, ACL.

[38]  Jian-Yun Nie,et al.  Query expansion using term relationships in language models for information retrieval , 2005, CIKM '05.

[39]  B. Gatos,et al.  Automatic Borders Detection of Camera Document Images , 2007 .

[40]  Richard Sproat,et al.  Review of PC-KIMMO: a two-level processor for morphological analysis by Evan L. Antworth. Summer Institute of Linguistics 1990 , 1991 .

[41]  Aristomenis S. Lampropoulos,et al.  A finite-state approach to the computational morphology of early modern Greek , 2007 .

[42]  Jean-Michel Jolion,et al.  Object count/area graphs for the evaluation of object detection and segmentation algorithms , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[43]  Takenobu Tokunaga,et al.  Combining multiple evidence from different types of thesaurus for query expansion , 1999, SIGIR '99.