Cross Domain Assessment of Document to HTML Conversion Tools to Quantify Text and Structural Loss during Document Analysis

During forensic text analysis, the automation of the process is key when working with large quantities of documents. As documents often come in a wide variety of different file types, this creates the need for tailored tools to be developed to analyze each document type to correctly identify and extract text elements for analysis without loss. These text extraction tools often omit sections of text that are unreadable from documents leaving drastic inconsistencies during the forensic text analysis process. As a solution to this a single output format, HTML, was chosen as a unified analysis format. Document to HTML/CSS extraction tools each with varying techniques to convert common document formats to rich HTML/CSS counterparts were tested. This approach can reduce the amount of analysis tools needed during forensic text analysis by utilizing a single document format. Two tests were designed, a 10 point document overview test and a 48 point detailed document analysis test to assess and quantify the level of loss, rate of error and overall quality of outputted HTML structures. This study concluded that tools that utilize a number of different approaches and have an understanding of the document structure yield the best results with the least amount of loss.

[1]  Yanhui Feng,et al.  Using HTML Tags to Improve Parallel Resources Extraction , 2011, 2011 International Conference on Asian Language Processing.

[2]  Ge Yu,et al.  A Study on Information Extraction from PDF Files , 2005, ICMLC.

[3]  Jie Zou,et al.  Combining DOM tree and geometric layout analysis for online medical journal article segmentation , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[4]  Sarang Pitale,et al.  Information Extraction Tools for Portable Document Format , 2011 .

[5]  F. Rahman,et al.  Conversion of PDF documents into HTML: a case study of document image analysis , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[6]  Chengjie Sun,et al.  A Block Segmentation Based Approach for Web Information Extraction , 2010, 2010 International Conference on Asian Language Processing.

[7]  Jer Lang Hong,et al.  ViWER- data extraction for search engine results pages using visual cue and DOM Tree , 2010, 2010 International Conference on Information Retrieval & Knowledge Management (CAMP).

[8]  Erik G. Learned-Miller,et al.  Learning on the Fly: Font-Free Approaches to Difficult OCR Problems , 2009, 2009 10th International Conference on Document Analysis and Recognition.