Caption-guided patent image segmentation

The paper presents a method of splitting patent drawings into subimages. For the image based patent retrieval and automatic document understanding it is required to use the individual subimages that are referenced in the text of a patent document. Our method utilizes the fact that subimages have their individual captions inscribed into the compound image. To find the approximate positions of subimages, first the specific captions are localized. Then subimages are found using the empirical rules concerning the relative positions of connected components to the subimage captions. These rules are based on the common sense observation that distances between connected components belonging to the same subimage are smaller than distances between connected components belonging to various subimages and that captions are located close to the corresponding subimages. Alternatively, the image segmentation can be defined as a specific optimization problem, that is aimed on maximizing the gaps between hypothetical subimages while preserving their relations to corresponding captions. The proposed segmentation method can be treated as the approximate solution of this problem.

[1]  Christopher Andreas Clark,et al.  Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers , 2015, AAAI Workshop: Scholarly Big Data.

[2]  David Hunt,et al.  Patent searching : tools & techniques , 2007 .

[3]  James Ze Wang,et al.  Automatic categorization of figures in scientific documents , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[4]  Jerzy Sas,et al.  Three-Stage Method of Text Region Extraction from Diagram Raster Images , 2013, CORES.

[5]  James Ze Wang,et al.  Automated analysis of images in documents for intelligent document search , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[6]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[8]  Symeon Papadopoulos,et al.  Towards content-based patent image retrieval: A framework perspective , 2010 .

[9]  Javier Nogueras-Iso,et al.  Automatic Extraction of Figures from Scientific Publications in High-Energy Physics , 2013 .

[10]  C. Lee Giles,et al.  Automatic Extraction of Figures from Scholarly Documents , 2015, DocEng.

[11]  Cathy H. Wu,et al.  Robust segmentation of biomedical figures for image-based document retrieval , 2012, 2012 IEEE International Conference on Bioinformatics and Biomedicine.

[12]  Xiaohui Yuan,et al.  A novel figure panel classification and extraction method for document image understanding , 2014, Int. J. Data Min. Bioinform..

[13]  George R. Thoma,et al.  Extraction and labeling high-resolution images from PDF documents , 2013, Electronic Imaging.

[14]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[15]  Henning Müller,et al.  Separating compound figures in journal articles to allow for subfigure classification , 2013, Medical Imaging.