Recognition and Classification of Figures in PDF Documents

Graphics recognition for raster-based input discovers primitives such as lines, arrowheads, and circles. This paper focuses on graphics recognition of figures in vector-based PDF documents. The first stage consists of extracting the graphic and text primitives corresponding to figures. An interpreter was constructed to translate PDF content into a set of self-contained graphics and text objects (in Java), freed from the intricacies of the PDF file. The second stage consists of discovering simple graphics entities which we call graphemes, e.g., a pair of primitive graphic objects satisfying certain geometric constraints. The third stage uses machine learning to classify figures using grapheme statistics as attributes. A boosting-based learner (LogitBoost in the Weka toolkit) was able to achieve 100% classification accuracy in hold-out-one training/testing using 16 grapheme types extracted from 36 figures from BioMed Central journal research papers. The approach can readily be adapted to raster graphics recognition.

[1]  Robert P. Futrelle,et al.  Extraction,layout analysis and classification of diagrams in PDF documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[2]  R. P. Futrelle Strategies for diagram understanding: generalized equivalence, spatial/object pyramids and animate vision , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[3]  Andreas R. Dengel,et al.  Making documents work: challenges for document understanding , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[4]  Kim Marriott,et al.  Automatic generation of intelligent diagram editors , 2003, TCHI.

[5]  Sergey Ablameyko,et al.  Machine Interpretation of Line Drawing Images , 2000 .

[6]  Ioannis A. Kakadiaris,et al.  Understanding diagrams in technical documents , 1992, Computer.

[7]  Josep Lladós,et al.  Graphics Recognition. Recent Advances and Perspectives , 2003, Lecture Notes in Computer Science.

[8]  David F. Brailsford,et al.  Creating structured PDF files using XML templates , 2004, DocEng '04.

[9]  Jian Fan,et al.  Layout and Content Extraction for PDF Documents , 2004, Document Analysis Systems.

[10]  Josep Lladós,et al.  Graphics recognition : recent advances and perspectives : 5th international workshop, GREC 2003 Barcelona, Spain, July 30-31, 2003 : revised selected papers , 2004 .

[11]  Chew Lim Tan,et al.  Model-Based Chart Image Recognition , 2003, GREC.

[12]  Robert P. Futrelle,et al.  Constraint based vectorization , 1989, ICS '89.

[13]  Ian Witten,et al.  Data Mining , 2000 .

[14]  Maurizio Rigamonti,et al.  Xed: a new tool for extracting hidden structures from electronic documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[15]  Salvatore Tabbone,et al.  Vectorization in graphics recognition: to thin or not to thin , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[16]  Robert P. Futrelle Ambiguity in visual language theory and its role in diagram parsing , 1999, Proceedings 1999 IEEE Symposium on Visual Languages.

[17]  Sergey Ablameyko,et al.  Machine interpretation of line drawing images - technical drawings, maps and diagrams , 2000 .

[18]  Yan Luo,et al.  Interactive Recognition of Graphic Objects in Engineering Drawings , 2003, GREC.