Machine Learning for Intelligent Processing of Printed Documents

A paper document processing system is an information system component which transforms information on printed or handwritten documents into a computer-revisable form. In intelligent systems for paper document processing this information capture process is based on knowledge of the specific layout and logical structures of the documents. This article proposes the application of machine learning techniques to acquire the specific knowledge required by an intelligent document processing system, named WISDOM++, that manages printed documents, such as letters and journals. Knowledge is represented by means of decision trees and first-order rules automatically generated from a set of training documents. In particular, an incremental decision tree learning system is applied for the acquisition of decision trees used for the classification of segmented blocks, while a first-order learning system is applied for the induction of rules used for the layout-based classification and understanding of documents. Issues concerning the incremental induction of decision trees and the handling of both numeric and symbolic data in first-order rule learning are discussed, and the validity of the proposed solutions is empirically evaluated by processing a set of real printed documents.

[1]  Donato Malerba,et al.  Incorporating statistical techniques into empirical symbolic learning systems , 1993 .

[2]  Donato Malerba,et al.  Machine Learning for Map Interpretation: An Intelligent Tool for Environmental Planning , 1997, Appl. Artif. Intell..

[3]  Donato Malerba,et al.  Empirical learning methods for digitized document recognition: an integrated approach to inductive generalization , 1990, Sixth Conference on Artificial Intelligence for Applications.

[4]  Donato Malerba,et al.  A Multistrategy Approach to Learning Multiple Dependent Concepts , 1996 .

[5]  Usama M. Fayyad,et al.  On the Handling of Continuous-Valued Attributes in Decision Tree Generation , 1992, Machine Learning.

[6]  Andreas Dengel,et al.  ANASTASIL: A Hybrid Knowledge-Based System for Document Layout Analysis , 1989, IJCAI.

[7]  Nicolas Helft,et al.  Inductive Generalization: A Logical Framework , 1987, EWSL.

[8]  Donato Malerba,et al.  Processing Paper Documents with WISDOM , 1997, AI*IA.

[9]  Donato Malerba,et al.  WISDOM++: an interactive and adaptive document analysis system , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[10]  Sašo Džeroski,et al.  Handling imperfect data in inductive logic programming , 1993 .

[11]  S.C. Hinds,et al.  A rule-based system for document image segmentation , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[12]  Paul E. Utgoff,et al.  An Improved Algorithm for Incremental Induction of Decision Trees , 1994, ICML.

[13]  Saso Dzeroski,et al.  Inductive Logic Programming: Techniques and Applications , 1993 .

[14]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[15]  Donato Malerba,et al.  Adding machine learning and knowledge intensive techniques to a digital library service , 1998, International Journal on Digital Libraries.

[16]  Friedrich M. Wahl,et al.  Document Analysis System , 1982, IBM J. Res. Dev..

[17]  Michael Brady,et al.  Generating and Generalizing Models of Visual Objects , 1987, Artif. Intell..

[18]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[19]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[20]  Yuan Yan Tang,et al.  Document Processing for Automatic Knowledge Acquisition , 1994, IEEE Trans. Knowl. Data Eng..

[21]  Sargur N. Srihari,et al.  Classification of newspaper image blocks using texture analysis , 1989, Comput. Vis. Graph. Image Process..

[22]  Nada Lavrač Handling Imperfect Data in Inductive Logic Programming , 1996 .

[23]  Francesco Bergadano,et al.  Constructive Learning with Continuous-Valued Attributes , 1988, IPMU.

[24]  L. Saitta,et al.  Rigel: An inductive learning system , 2004, Machine Learning.

[25]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[26]  L. Dublin Vital Statistics. , 1961, British medical journal.

[27]  Donato Malerba,et al.  Multistrategy Learning for Document Recognition , 1994, Appl. Artif. Intell..

[28]  Bernard Pagurek,et al.  Letter pattern recognition , 1990, Sixth Conference on Artificial Intelligence for Applications.

[29]  Ivan Bratko,et al.  Applications of inductive logic programming , 1995, CACM.

[30]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[31]  R. Mike Cameron-Jones,et al.  FOIL: A Midterm Report , 1993, ECML.

[32]  D. Hand,et al.  Artificial Intelligence Frontiers in Statistics , 2020 .

[33]  Ivan Bratko,et al.  Applications of inductive logic programming , 1995, SGAR.

[34]  Marco Botta,et al.  Learning Quantitative Features in a Symbolic Environment , 1991, ISMIS.

[35]  Donato Malerba,et al.  Handling Continuous Data in Top-Down Induction of First-Order Rules , 1997, AI*IA.