Figures in digital documents contain important information. Current digital libraries do not summarize and index information available within figures for document retrieval. We present our system on automatic categorization of figures and extraction of data from 2-D plots. A machine-learning based method is used to categorize figures into a set of predefined types based on image features. An automated algorithm is designed to extract data values from solid line curves in 2-D plots. The semantic type of figures and extracted data values from 2-D plots can be integrated with textual information within documents to provide more effective document retrieval services for digital library users. Experimental evaluation has demonstrated that our system can produce results suitable for real-world use.
[1]
Stephanie Elzer Schwartz,et al.
Information graphics: an untapped resource for digital libraries
,
2006,
SIGIR.
[2]
C. Lee Giles,et al.
CiteSeer: an automatic citation indexing system
,
1998,
DL '98.
[3]
Lawrence O'Gorman,et al.
Practical Algorithms for Image Analysis with CD-ROM
,
2008
.
[4]
Lawrence O'Gorman,et al.
Practical Algorithms for Image Analysis: Description, Examples and Code
,
2000
.
[5]
James Ze Wang,et al.
Automatic categorization of figures in scientific documents
,
2006,
Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).