论文信息 - Recognition and Classification of Figures in PDF Documents

Recognition and Classification of Figures in PDF Documents

Graphics recognition for raster-based input discovers primitives such as lines, arrowheads, and circles. This paper focuses on graphics recognition of figures in vector-based PDF documents. The first stage consists of extracting the graphic and text primitives corresponding to figures. An interpreter was constructed to translate PDF content into a set of self-contained graphics and text objects (in Java), freed from the intricacies of the PDF file. The second stage consists of discovering simple graphics entities which we call graphemes, e.g., a pair of primitive graphic objects satisfying certain geometric constraints. The third stage uses machine learning to classify figures using grapheme statistics as attributes. A boosting-based learner (LogitBoost in the Weka toolkit) was able to achieve 100% classification accuracy in hold-out-one training/testing using 16 grapheme types extracted from 36 figures from BioMed Central journal research papers. The approach can readily be adapted to raster graphics recognition.

Robert P. Futrelle | Mingyan Shao | R. Futrelle | Mingyan Shao

[1] Robert P. Futrelle,et al. Extraction,layout analysis and classification of diagrams in PDF documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[2] R. P. Futrelle. Strategies for diagram understanding: generalized equivalence, spatial/object pyramids and animate vision , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[3] Andreas R. Dengel,et al. Making documents work: challenges for document understanding , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[4] Kim Marriott,et al. Automatic generation of intelligent diagram editors , 2003, TCHI.

[5] Sergey Ablameyko,et al. Machine Interpretation of Line Drawing Images , 2000 .

[6] Ioannis A. Kakadiaris,et al. Understanding diagrams in technical documents , 1992, Computer.

[7] Josep Lladós,et al. Graphics Recognition. Recent Advances and Perspectives , 2003, Lecture Notes in Computer Science.

[8] David F. Brailsford,et al. Creating structured PDF files using XML templates , 2004, DocEng '04.

[9] Jian Fan,et al. Layout and Content Extraction for PDF Documents , 2004, Document Analysis Systems.

[10] Josep Lladós,et al. Graphics recognition : recent advances and perspectives : 5th international workshop, GREC 2003 Barcelona, Spain, July 30-31, 2003 : revised selected papers , 2004 .

[11] Chew Lim Tan,et al. Model-Based Chart Image Recognition , 2003, GREC.

[12] Robert P. Futrelle,et al. Constraint based vectorization , 1989, ICS '89.

[13] Ian Witten,et al. Data Mining , 2000 .

[14] Maurizio Rigamonti,et al. Xed: a new tool for extracting hidden structures from electronic documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[15] Salvatore Tabbone,et al. Vectorization in graphics recognition: to thin or not to thin , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[16] Robert P. Futrelle. Ambiguity in visual language theory and its role in diagram parsing , 1999, Proceedings 1999 IEEE Symposium on Visual Languages.

[17] Sergey Ablameyko,et al. Machine interpretation of line drawing images - technical drawings, maps and diagrams , 2000 .

[18] Yan Luo,et al. Interactive Recognition of Graphic Objects in Engineering Drawings , 2003, GREC.