Transforming paper documents into XML format with WISDOM++

Abstract. The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported.

[1]  Apostolos Antonacopoulos Local skew angle estimation from background space in text regions , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[2]  Friedrich M. Wahl,et al.  Document Analysis System , 1982, IBM J. Res. Dev..

[3]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[4]  Mahesh Viswanathan,et al.  Two complementary techniques for digitized document analysis , 2000, DOCPROCS '88.

[5]  Donato Malerba,et al.  A knowledge-based approach to the layout analysis , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[6]  V. F. Maergner,et al.  On benchmarking of document analysis systems , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[7]  Patty Curthoys,et al.  Developing user interfaces: Ensuring usability through product and process , 1997 .

[8]  S.C. Hinds,et al.  A rule-based system for document image segmentation , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[9]  Paul E. Utgoff,et al.  An Improved Algorithm for Incremental Induction of Decision Trees , 1994, ICML.

[10]  Mikey Williams Developing User Interfaces: Ensuring Usability Through Product and Process, by Deborah Hix and H. Rex Hartson, Wiley, 1993 (Book Review) , 1994, Softw. Test. Verification Reliab..

[11]  Marcel Worring,et al.  Content based internet access to paper documents , 1999, International Journal on Document Analysis and Recognition.

[12]  Henry S. Baird,et al.  The skew angle of printed documents , 1995 .

[13]  Rolf Ingold,et al.  Using XML in Document Recognition , 1999 .

[14]  Donato Malerba,et al.  Handling Continuous Data in Top-Down Induction of First-Order Rules , 1997, AI*IA.

[15]  Donato Malerba,et al.  WISDOM++: an interactive and adaptive document analysis system , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[16]  Andreas Dengel,et al.  High Level Document Analysis Guided by Geometric Aspects , 1988, Int. J. Pattern Recognit. Artif. Intell..

[17]  Donato Malerba,et al.  Machine Learning for Intelligent Document Processing: The WISDOM System , 1999, ISMIS.

[18]  H. Rex Hartson,et al.  Developing user interfaces: ensuring usability through product & process , 1993 .

[19]  Alin Deutsch,et al.  A Query Language for XML , 1999, Comput. Networks.

[20]  Yuan Yan Tang,et al.  Document skew detection based on the fractal and least squares method , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[21]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[22]  Frank Y. Shih,et al.  Adaptive document block segmentation and classification , 1996, IEEE Trans. Syst. Man Cybern. Part B.

[23]  Sargur N. Srihari,et al.  Classification of newspaper image blocks using texture analysis , 1989, Comput. Vis. Graph. Image Process..

[24]  Donato Malerba,et al.  Multistrategy Learning for Document Recognition , 1994, Appl. Artif. Intell..

[25]  R. Smith A simple and efficient skew detection algorithm via text row accumulation , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[26]  Donato Malerba,et al.  Processing Paper Documents with WISDOM , 1997, AI*IA.

[27]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .