Analysis of Documents Born Digital

While traditional document analysis has focused on printed media, an increasingly large portion of the documents today are generated in digital form from the start. Such “documents born digital” range from plain text documents such as emails to more sophisticated forms such as PDF documents and Web documents. On the one hand, the existence of the digital encoding of documents eliminates the need for scanning, image processing, and character recognition in most situations (a notable exception being the prevalent use of text embedded in images for Web documents, as elaborated upon in section “Analysis of Text in Web Images”). On the other hand, many higher-level processing tasks remain due to the fact that the design purpose of almost existing digital document encoding systems (i.e., HTML, PDF) is for display or printing for human consumption, not for machine-level information exchange and extraction. As such, significant amount of processing is still required for automatic information extraction, indexing, and content repurposing from such documents, and many challenges exist in this process. This chapter describes in detail the key technologies for processing documents born digital, with a focus on PDF and Web document processing.

[1]  Barry Smyth,et al.  Fact or Fiction: Content Classification for Digital Libraries , 2001, DELOS.

[2]  Kun Bai,et al.  TableSeer: automatic table metadata extraction and searching in digital libraries , 2007, JCDL '07.

[3]  Deepayan Chakrabarti,et al.  A graph-theoretic approach to webpage segmentation , 2008, WWW.

[4]  Tamir Hassan,et al.  Object-level document analysis of PDF files , 2009, DocEng '09.

[5]  David F. Brailsford,et al.  Document analysis of PDF files: methods, results and implications , 1995 .

[6]  David F. Brailsford,et al.  Towards structured, block-based PDF , 1995 .

[7]  Matti Pietikäinen,et al.  Page Segmentation and Zone Classification: The State of the Art , 1999 .

[8]  Din-Chang Tseng,et al.  Overlapped-character separation and reconstruction for table-form documents , 1996, Proceedings of 3rd IEEE International Conference on Image Processing.

[9]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[10]  Anil K. Jain,et al.  Automatic text location in images and video frames , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[11]  Apostolos Antonacopoulos,et al.  Colour text segmentation in web images based on human perception , 2007, Image Vis. Comput..

[12]  Wolfgang Gatterbauer,et al.  Table Extraction Using Spatial Reasoning on the CSS2 Visual Box Model , 2006, AAAI.

[13]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[14]  Yalin Wang,et al.  Detecting Tables in HTML Documents , 2002, Document Analysis Systems.

[15]  Simone Marinai,et al.  Metadata Extraction from PDF Papers for Digital Library Ingest , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[16]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[17]  Gunther Wyszecki,et al.  Color Science: Concepts and Methods, Quantitative Data and Formulae, 2nd Edition , 2000 .

[18]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[19]  Anthony G. Cohn,et al.  Qualitative Spatial Representation and Reasoning Techniques , 1997, KI.

[20]  Kathleen R. McKeown,et al.  Columbia multi-document summarization : Approach and evaluation , 2001 .

[21]  Andreas Dengel,et al.  Document Analysis Systems VI , 2004, Lecture Notes in Computer Science.

[22]  Wei-Ying Ma,et al.  Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[23]  Jian Fan,et al.  Layout and Content Extraction for PDF Documents , 2004, Document Analysis Systems.

[24]  G. Grisetti,et al.  Further Reading , 1984, IEEE Spectrum.

[25]  Yalin Wang,et al.  Statistical-based approach to word segmentation , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[26]  Wolfgang Gatterbauer,et al.  Using visual cues for extraction of tabular data from arbitrary HTML documents , 2005, WWW '05.

[27]  Neha Gupta,et al.  A Heuristic Approach for Web Content Extraction , 2011 .

[28]  Tamir Hassan User-Guided Wrapping of PDF Documents Using Graph Matching Techniques , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[29]  Jun Kong,et al.  Spatial graph grammars for graphical user interfaces , 2006, TCHI.

[30]  Wei-Ying Ma,et al.  2D Conditional Random Fields for Web information extraction , 2005, ICML.

[31]  Ping Luo,et al.  Web article extraction for web printing: a DOM+visual based approach , 2009, DocEng '09.

[32]  Xing Xie,et al.  Adapting Web pages for small-screen devices , 2005, IEEE Internet Computing.

[33]  Katharina Kaiser,et al.  pdf2table: A Method to Extract Table Information from PDF Files , 2005, IICAI.

[34]  Maurizio Rigamonti,et al.  Xed: a new tool for extracting hidden structures from electronic documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[35]  Massimo Ruffolo,et al.  PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[36]  S da SilvaAltigran,et al.  A brief survey of web data extraction tools , 2002 .

[37]  S da SilvaAltigran,et al.  DEByE - Date extraction by example , 2002 .

[38]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[39]  Pinar Duygulu Sahin,et al.  A hierarchical representation of form documents for identification and retrieval , 2002, International Journal on Document Analysis and Recognition.

[40]  Anjo Anjewierden AIDAS: incremental logical structure discovery in PDF documents , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[41]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[42]  Bon K. Sy,et al.  A Theoretical Foundation and a Method for Document Table Structure Extraction and Decompositon , 2002, Document Analysis Systems.

[43]  Matthew Hurst,et al.  Layout and Language: Challenges for Table Understanding on the Web , 2001 .

[44]  W D Wright,et al.  Color Science, Concepts and Methods. Quantitative Data and Formulas , 1967 .

[45]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[46]  Robert P. Futrelle,et al.  Extraction,layout analysis and classification of diagrams in PDF documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[47]  Dimosthenis A. Karatzas,et al.  Text segmentation in web images using colour perception and topological features , 2003 .

[48]  Apostolos Antonacopoulos,et al.  Web Document Analysis: Challenges and Opportunities , 2003 .

[49]  Daniel P. Lopresti,et al.  Locating and Recognizing Text in WWW Images , 2000, Information Retrieval.

[50]  J. Cordy,et al.  A Survey of Table Recognition : Models , Observations , Transformations , and Inferences , 2003 .

[51]  Nir Ailon,et al.  Aggregating inconsistent information: Ranking and clustering , 2008 .