g-DICE: graph mining-based document information content exploitation

In this paper, we present document information content (i.e. text fields) extraction technique via graph mining. Real-world users first provide a set of key text fields from the document image which they think are important. These fields are used to initialise a graph where nodes are labelled with the field names in addition to other features such as size, type and number of words, and edges are attributed with relative positioning between them. Such an attributed relational graph is then used to mine similar graphs from document images which are used to update the initial graph iteratively each time we extract them, to produce a graph model. Graph models, therefore, are employed in the absence of users. We have validated the proposed technique and evaluated its scientific impact on real-world industrial problem with the performance of 86.64 % precision and 90.80 % recall by considering all zones, viz. header, body and footer. More specifically, the proposed technique is well suited for table processing (i.e. extracting repeated patterns from the table) and it outperforms the state-of-the-art method by approximately more than 3 %.

[1]  Lawrence B. Holder,et al.  Graph-Based Data Mining , 2000, IEEE Intell. Syst..

[2]  H.S. Baird,et al.  A retargetable table reader , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[3]  Bi Liu,et al.  A Normalized Levenshtein Distance Metric , 2007, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Robert M. Haralick,et al.  A statistically based, highly accurate text-line segmentation method , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[5]  Abdel Belaïd,et al.  Client-Driven Content Extraction Associated with Table , 2013, MVA.

[6]  Bertin Klein,et al.  Results of a Study on Invoice-Reading Systems in Germany , 2004, Document Analysis Systems.

[7]  Francesca Cesarini,et al.  Trainable table location in document images , 2002, Object recognition supported by user interaction for service robots.

[8]  Daniel P. Lopresti,et al.  Medium-independent table detection , 1999, Electronic Imaging.

[9]  Brian Gallagher,et al.  Matching Structure and Semantics: A Survey on Graph-Based Pattern Matching , 2006, AAAI Fall Symposium: Capturing and Using Patterns for Evidence Detection.

[10]  Abdel Belaïd,et al.  Document Information Extraction and Its Evaluation Based on Client's Relevance , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[11]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[12]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Marcus Liwicki,et al.  Faster subgraph isomorphism detection by well-founded total order indexing , 2012, Pattern Recognit. Lett..

[14]  Luís Torgo,et al.  Design of an end-to-end method to extract information from tables , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[15]  Bertrand Coüasnon DMOS, a generic document recognition method: application to table structure analysis in a general and in a specific way , 2005, International Journal of Document Analysis and Recognition (IJDAR).

[16]  Raymond W. Smith Hybrid Page Layout Analysis via Tab-Stop Detection , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[17]  Ioannis Pratikakis,et al.  Automatic Table Detection in Document Images , 2005, ICAPR.

[18]  Azriel Rosenfeld,et al.  Document structure analysis algorithms: a literature survey , 2003, IS&T/SPIE Electronic Imaging.

[19]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[20]  Rangachar Kasturi,et al.  Structural recognition of tabulated data , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[21]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[22]  Daniel P. Lopresti,et al.  Table Detection in Noisy Off-line Handwritten Documents , 2011, 2011 International Conference on Document Analysis and Recognition.

[23]  Jean-Yves Ramel,et al.  Detection, extraction and representation of tables , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[24]  Richard Zanibbi,et al.  A survey of table recognition , 2004, Document Analysis and Recognition.

[25]  Thomas Kieninger,et al.  Applying the T-Recs table recognition system to the business letter domain , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[26]  King-Sun Fu,et al.  Error-Correcting Isomorphisms of Attributed Relational Graphs for Pattern Analysis , 1979, IEEE Transactions on Systems, Man, and Cybernetics.

[27]  Osamu Hori,et al.  Robust table-form structure analysis based on box-driven reasoning , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[28]  Daniel P. Lopresti,et al.  Evaluating the performance of table processing algorithms , 2002, International Journal on Document Analysis and Recognition.

[29]  Matthew Hurst,et al.  A constraint-based approach to table structure derivation , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[30]  Tamir Hassan,et al.  Table Recognition and Understanding from PDF Files , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[31]  Sekhar Mandal,et al.  A simple and effective table detection system from document images , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[32]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[33]  Lawrence B. Holder,et al.  Mining Graph Data , 2006 .

[34]  David W. Embley,et al.  Table-processing paradigms: a research survey , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[35]  Yolande Belaïd,et al.  Adaptive technology for mail-order form segmentation , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[36]  Dimitris Papadias,et al.  Spatial Relations, Minimum Bounding Rectangles, and Spatial Data Structures , 1997, Int. J. Geogr. Inf. Sci..

[37]  Kaspar Riesen,et al.  Graph Classification and Clustering Based on Vector Space Embedding , 2010, Series in Machine Perception and Artificial Intelligence.

[38]  Dennis Shasha,et al.  GraphGrep: A fast and universal method for querying graphs , 2002, Object recognition supported by user interaction for service robots.

[39]  Tamir Hassan User-Guided Wrapping of PDF Documents Using Graph Matching Techniques , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[40]  Yolande Belaïd,et al.  Case-Based Reasoning for Invoice Analysis and Recognition , 2007, ICCBR.

[41]  Mechthild Stoer,et al.  A simple min-cut algorithm , 1997, JACM.

[42]  B Gallagher,et al.  The State of the Art in Graph-Based Pattern Matching , 2006 .

[43]  Abdel Belaïd,et al.  Pattern-Based Approach to Table Extraction , 2013, IbPRIA.

[44]  Edward A. Green,et al.  Model-based analysis of printed tables , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[45]  Eric Saund A Graph Lattice Approach to Maintaining and Learning Dense Collections of Subgraphs as Image Features , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Yalin Wang,et al.  Table Detection via Probability Optimization , 2002, Document Analysis Systems.

[47]  Daniel P. Lopresti,et al.  A Tabular Survey of Automated Table Processing , 1999, GREC.

[48]  Thomas G Kieninger,et al.  Table structure recognition based on robust block segmentation , 1998, Electronic Imaging.

[49]  Thomas Kieninger,et al.  The T-Recs Table Recognition and Analysis System , 1998, Document Analysis Systems.

[50]  Yalin Wang,et al.  Automatic table ground truth generation and a background-analysis-based table structure extraction method , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[51]  Takashi Washio,et al.  State of the art of graph-based data mining , 2003, SKDD.

[52]  Evgeniy Bart,et al.  Information extraction by finding repeated structure , 2010, DAS '10.

[53]  Jiawei Han,et al.  Mining closed relational graphs with connectivity constraints , 2005, 21st International Conference on Data Engineering (ICDE'05).

[54]  Toyohide Watanabe,et al.  Toward a practical document understanding of table-form documents: its framework and knowledge representation , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[55]  Venu Govindaraju,et al.  Document image analysis: A primer , 2002 .

[56]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[57]  Thomas Kieninger,et al.  Three approaches to "industrial" table spotting , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[58]  Faisal Shafait,et al.  Table detection in heterogeneous documents , 2010, DAS '10.

[59]  Horst Bunke,et al.  Subgraph Isomorphism in Polynomial Time , 1995 .

[60]  Horst Bunke,et al.  A graph distance metric based on the maximal common subgraph , 1998, Pattern Recognit. Lett..

[61]  David Doermann,et al.  Handbook of Document Image Processing and Recognition , 2014, Springer London.

[62]  Bidyut Baran Chaudhuri,et al.  An End-to-End Administrative Document Analysis System , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[63]  Matthew Hurst Towards a theory of tables , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[64]  Claudia Wenzel,et al.  Precise Table Recognition by Making Use of Reference Tables , 1998, Document Analysis Systems.

[65]  Marco Aiello,et al.  Document understanding for a broad class of documents , 2002, Int. J. Document Anal. Recognit..