Converting PDF to HTML approach based on text detection

Converting PDF document to HTML document with the same layout format is a very important and interesting research problem. After the conversion, it is easy for PDF document to be browsed online and information extracted. Based on the extraction result of the PDF document of the open source tool PDFBox, the paper described a method that can detect the layout information of the PDF document and convert the PDF document to HTML page effectively.

[1]  Kazuo Sumita,et al.  Table structure analysis based on cell classification and cell modification for XML document transformation , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[2]  Thomas G Kieninger,et al.  Table structure recognition based on robust block segmentation , 1998, Electronic Imaging.

[3]  Marco Aiello,et al.  Document understanding for a broad class of documents , 2002, Int. J. Document Anal. Recognit..

[4]  Wolfgang Gatterbauer,et al.  Table Extraction Using Spatial Reasoning on the CSS2 Visual Box Model , 2006, AAAI.

[5]  Tamir Hassan,et al.  Table Recognition and Understanding from PDF Files , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).