PDF2XML: Converting PDF to XML

XML is a markup language for documents containing structured information. It is designed to make it easy to interchange structured documents over the Internet and further integrate them with management database system. PDF is a document format intended to electronically reproduce the look of a page. There is a huge demand of converting existing PDF documents into XML documents, so that they will be searchable and manageable. Since PDF is basically a page layout format and does not carry original document structure, converting PDF to XML remains a challenging task. This paper addresses the related technique problems and explores approaches. As part of the Data Conversion Project under development at the Data Conversion Center funded by DoD, we present a system, PDF2XML, designed to automatically perform the conversion with minimum human interaction.

[1]  Kevin Williams,et al.  Professional XML , 2001 .

[2]  Michael Kay,et al.  Professional XML , 2000 .

[3]  Mark Baker Internet Programming with OmniMark , 2002, Springer US.

[4]  A. Karmouch,et al.  Converting Web pages into well-formed XML documents , 1999, 1999 IEEE International Conference on Communications (Cat. No. 99CH36311).

[5]  Frank P. Coyle,et al.  XML, Web Services, and the Data Revolution , 2002 .