Recognizing records from the extracted cells of microfilm tables

Microfilm documents contain a wealth of information, but extracting and organizing this information by hand is slow, error-prone, and tedious. As an initial step toward automating access to this information, we describe in this paper an algorithmic process to automatically identify record patterns found in microfilm tables for pre-specified application domains. Our table-processing algorithm accepts an XML input file describing the individual cells of a table taken from a microfilm document, and finds for each record in the document the cells that together comprise the record. Two key features drive the algorithm: (1) geometric layout and (2) label matching with respect to a given domain-specific application ontology. The algorithm achieved an accuracy of 92% on our test corpus of genealogical microfilm tables.

[1]  George Nagy,et al.  Online handwriting recognition based on bigram cooccurrence , 2002, Object recognition supported by user interaction for service robots.

[2]  W. Bruce Croft,et al.  TINTIN: a system for retrieval in text tables , 1997, DL '97.

[3]  Shona Douglas,et al.  Layout and language: preliminary investigations in recognizing the structure of tables , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[4]  Lindsay J. Evett,et al.  Segmenting documents using multiple lexical features , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[5]  Takashi Saitoh,et al.  User-defined template for identifying document type and extracting information from documents , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[6]  Daniel P. Lopresti,et al.  Why table ground-truthing is hard , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[7]  Toyohide Watanabe,et al.  Layout Recognition of Multi-Kinds of Table-Form Documents , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Hsieh S. Hou,et al.  Digital document processing , 1983 .

[9]  Naoki Asada,et al.  Table form document synthesis by grammar-based structure analysis , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[10]  Stefan Jäger Recovering dynamic information from static, handwritten word images: bridging the gap between on-line and off-line handwriting recognition , 1998 .

[11]  George Nagy,et al.  Optical character recognition: an illustrated guide to the frontier , 1999, Electronic Imaging.

[12]  Yuan Yan Tang,et al.  Automatic document processing: A survey , 1996, Pattern Recognit..

[13]  Konstantin Zuyev Table image segmentation , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[14]  Edward A. Green,et al.  Model-based analysis of printed tables , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.