Automatic Document Organization Exploiting FOL Similarity-based Techniques

The organization of a document collection into meaningful groups is a fundamental issue in document management systems. The grouping can be carried out by performing a comparison among the layout structure of the documents. To this aim, a powerful representation language able to describe the relations among all the document components is necessary. First-Order Logic formulae are a powerful representation formalism characterized by the use of relations, that, however, cause serious computational problems due to the phenomenon of indeterminacy. Furthermore, a mechanism to perform the comparison among the resulting descriptions must be provided. This paper proposes the exploitation of a novel similarity formula and evaluation criteria for automatically grouping documents in a collection according to their layout structure. This is done by identifying the description components that are more similar and hence more likely to correspond to each other, based only on their syntactic structure. Experiments on a real-world dataset prove the effectiveness of the proposal.

[1]  Gilles Bisson,et al.  Learning in FOL with a Similarity Measure , 1992, AAAI.

[2]  J. Lloyd Foundations of Logic Programming , 1984, Symbolic Computation.

[3]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[4]  Gordon Plotkin,et al.  A Note on Inductive Generalization , 2008 .

[5]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[6]  Jan Ramon,et al.  Upgrading Bayesian Clustering to First Order Logic , 1999 .

[7]  Stefano Ferilli,et al.  Machine Learning for Digital Document Processing: from Layout Analysis to Metadata Extraction , 2008, Machine Learning in Document Analysis and Recognition.

[8]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[9]  Pat Langley,et al.  Models of Incremental Concept Formation , 1990, Artif. Intell..

[10]  J. W. Lloyd,et al.  Foundations of logic programming; (2nd extended ed.) , 1987 .

[11]  Stefano Ferilli,et al.  Similarity-Guided Clause Generalization , 2007, AI*IA.

[12]  Jan Ramon,et al.  Clustering and instance based learning in first order logic , 2002, AI Communications.

[13]  Céline Rouveirol,et al.  Extensions of Inversion of Resolution Applied to Theory Completion , 1992 .

[14]  Jan Ramon Thesis: clustering and instance based learning in first order logic , 2002 .

[15]  Thomas M. Breuel,et al.  Two Geometric Algorithms for Layout Analysis , 2002, Document Analysis Systems.

[16]  Shan-Hwei Nienhuys-Cheng Distances and Limits on Herbrand Interpretations , 1998, ILP.

[17]  Gilles Bisson Conceptual Clustering in a First Order Logic Representation , 1992, ECAI.

[18]  R. Michalski,et al.  Learning from Observation: Conceptual Clustering , 1983 .

[19]  Nicola Fanizzi,et al.  A Generalization Model Based on OI-implication for Ideal Theory Refinement , 2001, Fundam. Informaticae.

[20]  Luc De Raedt,et al.  Top-Down Induction of Clustering Trees , 1998, ICML.

[21]  Pedro M. Domingos Rule Induction and Instance-Based Learning: A Unified Approach , 1995, IJCAI.

[22]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[23]  Michèle Sebag,et al.  Distance Induction in First Order Logic , 1997, ILP.

[24]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[25]  Donato Malerba,et al.  Classification in Noisy Environments Using a Distance Measure Between Structural Symbolic Descriptions , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Pat Langley,et al.  Incremental Concept Formation with Composite Objects , 1989, ML.

[27]  Dietrich Wettschereck,et al.  Relational Instance-Based Learning , 1996, ICML.