FOL Learning for Knowledge Discovery in Documents

This chapter proposes the application of machine learning techniques, based on first-order logic as a representation language, to the real-world application domain of document processing. First, the tasks and problems involved in document processing are presented, along with the prototypical system DOMINUS and its architecture, whose components are aimed at facing these issues. Then, a closer look is provided for the learning component of the system, and the two sub-systems that are in charge of performing supervised and unsupervised learning as a support to the system performance. Finally, some experiments are reported that assess the quality of the learning performance. This is intended to prove to researchers and practitioners of the field that first-order logic learning can be a viable solution to tackle the domain complexity, and to solve problems such as incremental evolution of the document repository.

[1]  Morten Lind,et al.  A Generic Framework for Feature Representations in Image Categorization Tasks , 2009, Int. J. Softw. Sci. Comput. Intell..

[2]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[3]  Ryszard S. Michalski,et al.  Inferential Theory of Learning: Developing Foundations for Multistrategy Learning , 1992 .

[4]  Hui Chao,et al.  Graphics extraction in PDF document , 2003, IS&T/SPIE Electronic Imaging.

[5]  Giovanni Soda,et al.  Tree clustering for layout-based document image retrieval , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[6]  Manas Ranjan Patra,et al.  Intelligent Techniques in Recommendation Systems: Contextual Advancements and New Methods , 2012 .

[7]  Donato Malerba,et al.  Classification in Noisy Environments Using a Distance Measure Between Structural Symbolic Descriptions , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Pat Langley,et al.  Incremental Concept Formation with Composite Objects , 1989, ML.

[9]  Juan Barceló,et al.  Computational Intelligence in Archaeology , 2008 .

[10]  Yue Lu,et al.  Constructing area Voronoi diagram in document images , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[11]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Yingxu Wang,et al.  The Formal Design Models of Digraph Architectures and Behaviors , 2012, Int. J. Softw. Sci. Comput. Intell..

[13]  Guoyin Wang,et al.  System Uncertainty Based Data-Driven Knowledge Acquisition , 2009, Int. J. Softw. Sci. Comput. Intell..

[14]  Jian Fan,et al.  Layout and Content Extraction for PDF Documents , 2004, Document Analysis Systems.

[15]  Kevin Laven,et al.  A statistical learning approach to document image analysis , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[16]  Stefano Ferilli,et al.  Similarity-Guided Clause Generalization , 2007, AI*IA.

[17]  Jean-Yves Ramel,et al.  Detection, extraction and representation of tables , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[18]  Yolande Belaïd,et al.  A Case-Based Reasoning Approach for Invoice Structure Extraction , 2007 .

[19]  Giovanni Soda,et al.  Self-Organizing Maps for Clustering in Document Image Analysis , 2008, Machine Learning in Document Analysis and Recognition.

[20]  Fuchun Sun,et al.  Quotient space-based boundary condition for particle swarm optimization algorithm , 2010, 9th IEEE International Conference on Cognitive Informatics (ICCI'10).

[21]  Masakazu Fujio,et al.  Information Management System Using Structure Analysis of Paper/Electronic Documents and Its Applications , 2007 .

[22]  Michèle Sebag,et al.  Distance Induction in First Order Logic , 1997, ILP.

[23]  Thomas M. Breuel,et al.  Distance measures for layout-based document image retrieval , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[24]  Hiromichi Fujisawa,et al.  Multiple Hypotheses Document Analysis , 2008, Machine Learning in Document Analysis and Recognition.

[25]  Thomas M. Breuel,et al.  Two Geometric Algorithms for Layout Analysis , 2002, Document Analysis Systems.

[26]  Shan-Hwei Nienhuys-Cheng Distances and Limits on Herbrand Interpretations , 1998, ILP.

[27]  Abdel Belaïd,et al.  Structure Extraction in Printed Documents Using Neural Approaches , 2008, Machine Learning in Document Analysis and Recognition.

[28]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[29]  Henry S. Baird Background Structure in Document Images , 1994, Int. J. Pattern Recognit. Artif. Intell..

[30]  Yingxu Wang,et al.  Software and Intelligent Sciences: New Transdisciplinary Findings , 2012 .

[31]  Maurizio Rigamonti,et al.  Xed: a new tool for extracting hidden structures from electronic documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[32]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[33]  Gordon Plotkin,et al.  A Note on Inductive Generalization , 2008 .

[34]  A. Tversky Features of Similarity , 1977 .

[35]  I. V. Ramakrishnan,et al.  A General Approach for Partitioning Web Page Content Based on Geometric and Style Information , 2007 .

[36]  Stefano Ferilli,et al.  Generalization-Based Similarity for Conceptual Clustering , 2007, MCD.

[37]  Jan Ramon Thesis: clustering and instance based learning in first order logic , 2002 .

[38]  Dmitry Deryagin,et al.  Universal data capture technology from semi-structured forms , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[39]  Jan Ramon,et al.  Upgrading Bayesian Clustering to First Order Logic , 1999 .

[40]  Stefano Ferilli,et al.  Machine Learning for Digital Document Processing: from Layout Analysis to Metadata Extraction , 2008, Machine Learning in Document Analysis and Recognition.

[41]  Dietrich Wettschereck,et al.  Relational Instance-Based Learning , 1996, ICML.

[42]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[43]  C. V. Jawahar,et al.  Digitizing a Million Books: Challenges for Document Analysis , 2006, Document Analysis Systems.

[44]  Thomas M. Breuel,et al.  Example-Based Logical Labeling of Document Title Page Images , 2007 .

[45]  Anjo Anjewierden AIDAS: incremental logical structure discovery in PDF documents , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[46]  Éric Trupin,et al.  Distance Based Strategy for Supervised Document Image Classification , 2004, SSPR/SPR.

[47]  Éric Trupin,et al.  Multi-view hac for Semi-supervised Document Image Classification , 2004, Document Analysis Systems.

[48]  Abdel Belaïd,et al.  Document Logical Structure Analysis Based on Perceptive Cycles , 2006, Document Analysis Systems.

[49]  Venu Govindaraju,et al.  Multi-scale techniques for document page segmentation , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[50]  Andreas Dengel Learning of Pattern-Based Rules for Document Classification , 2007 .

[51]  Junlan Feng,et al.  A learning approach to discovering Web page semantic structures , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[52]  Jean-Daniel Zucker,et al.  Semantic Abstraction for Concept Representation and Learning , 2001 .

[53]  Gilles Bisson Conceptual Clustering in a First Order Logic Representation , 1992, ECAI.

[54]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[55]  Xiaofan Lin,et al.  Capturing the layout of electronic documents for reuse in variable data printing , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[56]  Jean-Luc Bloechle,et al.  Towards a canonical and structured representation of PDF documents through reverse engineering , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[57]  Motoi Iwata,et al.  Segmentation of Page Images Using the Area Voronoi Diagram , 1998, Comput. Vis. Image Underst..

[58]  Pedro M. Domingos Rule Induction and Instance-Based Learning: A Unified Approach , 1995, IJCAI.

[59]  John R. Anderson,et al.  MACHINE LEARNING An Artificial Intelligence Approach , 2009 .

[60]  Siyuan Chen,et al.  Simultaneous Layout Style and Logical Entity Recognition in a Heterogeneous Collection of Documents , 2007 .

[61]  Giovanni Soda,et al.  Artificial neural networks for document analysis and recognition , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  Gilles Bisson,et al.  Learning in FOL with a Similarity Measure , 1992, AAAI.

[63]  Letizia Tanca,et al.  Logic Programming and Databases , 1990, Surveys in Computer Science.

[64]  Céline Rouveirol,et al.  Extensions of Inversion of Resolution Applied to Theory Completion , 1992 .

[65]  Radek Burget Layout Based Information Extraction from HTML Documents , 2007 .

[66]  J. W. Lloyd,et al.  Foundations of logic programming; (2nd extended ed.) , 1987 .

[67]  Donato Malerba,et al.  A Logic Framework for the Incremental Inductive Synthesis of Datalog Theories , 1997, LOPSTR.

[68]  Robert P. Futrelle,et al.  Extraction,layout analysis and classification of diagrams in PDF documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[69]  Luc De Raedt,et al.  Top-Down Induction of Clustering Trees , 1998, ICML.

[70]  Eric Bouillet,et al.  Semantic Matching, Propagation and Transformation for Composition in Component-Based Systems , 2009, Int. J. Softw. Sci. Comput. Intell..