Visual Segmentation for Information Extraction from Heterogeneous Visually Rich Documents

Physical and digital documents often contain visually rich information. With such information, there is no strict ordering or positioning in the document where the data values must appear. Along with textual cues, these documents often also rely on salient visual features to define distinct semantic boundaries and augment the information they disseminate. When performing information extraction (IE), traditional techniques fall short, as they use a text-only representation and do not consider the visual cues inherent to the layout of these documents. We propose VS2, a generalized approach for information extraction from heterogeneous visually rich documents. There are two major contributions of this work. First, we propose a robust segmentation algorithm that decomposes a visually rich document into a bag of visually isolated but semantically coherent areas, called logical blocks. Document type agnostic low-level visual and semantic features are used in this process. Our second contribution is a distantly supervised search-and-select method for identifying the named entities within these documents by utilizing the context boundaries defined by these logical blocks. Experimental results on three heterogeneous datasets suggest that the proposed approach significantly outperforms its text-only counterparts on all datasets. Comparing it against the state-of-the-art methods also reveal that VS2 performs comparably or better on all datasets.

[1]  Ignazio Gallo,et al.  Content Extraction from Marketing Flyers , 2015, CAIP.

[2]  Angel X. Chang,et al.  SUTime: A library for recognizing and normalizing time expressions , 2012, LREC.

[3]  Andreas Dengel,et al.  ICDAR 2011 Robust Reading Competition Challenge 2: Reading Text in Scene Images , 2011, 2011 International Conference on Document Analysis and Recognition.

[4]  Ray Smith An Overview of the Tesseract OCR Engine , 2007 .

[5]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[6]  Mohammed J. Zaki Efficiently mining frequent trees in a forest , 2002, KDD.

[7]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[8]  Christopher Ré,et al.  DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference , 2012, VLDS.

[9]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[10]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[11]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[12]  Joseph M. Hellerstein,et al.  Shreddr: pipelined paper digitization for low-resource organizations , 2012, ACM DEV '12.

[13]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[14]  Christopher D. Manning,et al.  Combining Distant and Partial Supervision for Relation Extraction , 2014, EMNLP.

[15]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[16]  Noriko Tomuro,et al.  Combining Visual and Textual Features for Information Extraction from Online Flyers , 2014, EMNLP.

[17]  Daniel Jurafsky,et al.  Learning Syntactic Patterns for Automatic Hypernym Discovery , 2004, NIPS.

[18]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[19]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[20]  Sunita Sarawagi,et al.  Information Extraction , 2008 .

[21]  Luciano Del Corro,et al.  ClausIE: clause-based open information extraction , 2013, WWW.

[22]  Martha Palmer,et al.  Verbnet: a broad-coverage, comprehensive verb lexicon , 2005 .

[23]  Mitsuru Ishizuka,et al.  Relation Extraction from Wikipedia Using Subtree Mining , 2007, AAAI.

[24]  Mahesh Viswanathan,et al.  Syntactic Segmentation and Labeling of Digitized Pages from Technical Journals , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Mita Nasipuri,et al.  A multi-objective approach towards cost effective isolated handwritten Bangla character and digit recognition , 2016, Pattern Recognit..

[26]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[27]  Jasbir S. Arora,et al.  Survey of multi-objective optimization methods for engineering , 2004 .

[28]  Marcin Mironczuk,et al.  The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction , 2018, Knowledge and Information Systems.

[29]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[30]  Christopher Ré,et al.  Fonduer: Knowledge Base Construction from Richly Formatted Data , 2017, SIGMOD Conference.

[31]  Lejian Liao,et al.  DOM based content extraction via text density , 2011, SIGIR.

[32]  Mahantapas Kundu,et al.  A multi-scale deep quad tree based feature extraction method for the recognition of isolated handwritten characters of popular indic scripts , 2017, Pattern Recognit..

[33]  Xiaowen Zhang,et al.  A construction scheme of web page comment information extraction system based on frequent subtree mining , 2017 .

[34]  Frederick Reiss,et al.  Enterprise information extraction: recent developments and open challenges , 2010, SIGMOD Conference.

[35]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[36]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[37]  Pasquale De Meo,et al.  Web Data Extraction , Applications and Techniques : A Survey , 2010 .

[38]  Keishi Tajima,et al.  Extracting Logical Hierarchical Structure of HTML Documents Based on Headings , 2015, Proc. VLDB Endow..

[39]  Jeffrey F. Naughton,et al.  Information extraction challenges in managing unstructured data , 2009, SGMD.

[40]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..