论文信息 - A PDF Text Extractor Based on PDF-Renderer

A PDF Text Extractor Based on PDF-Renderer

— In this paper we propose a new solution for PDF (Portable Document File) text extraction. Firstly, we made a comparison of some PDF text extractor tools. We started with a brief presentation of some available tools that have been used in some research works. Secondly, we analyzed the performance of ICEpdf and PDFBox (Java Open Source tools). Our experimental results showed that none of the tools strictly subsumes another. Both of them have a clear font and overlapping problem. Thus, to overcome these issues we proposed a new text extractor engine project based on Java PDF-Renderer, whish shows a good rendering compared to the previous ones. Our result can be helpful for researchers who need such a tool, to understand the characteristics of each one, and to choose a suitable tool for their works.

Aqeel ur Rehman | Moulay Abderrahim

[1] Ian H. Witten,et al. The Greenstone plugin architecture , 2002, JCDL '02.

[2] Henk L. Muller,et al. Keyword and metadata extraction from pre-prints , 2008, ELPUB.

[3] Anju Vyas. Print , 2003 .

[4] Francis C. M. Lau,et al. A Context-Aware Decision Engine for Content Adaptation , 2002, IEEE Pervasive Comput..

[5] Jia-Lang Seng,et al. An Intelligent information segmentation approach to extract financial data for business valuation , 2010, Expert Syst. Appl..

[6] Vasudeva Varma,et al. Sentence Extraction Based Single Document Summarization , .

[7] Gaurav Shukla,et al. Development of ETD Repository at IITK Library using DSpace , 2007 .