Unleashing Tabular Content to Open Data: A Survey on PDF Table Extraction Methods and Tools

Portable Document Format (PDF) has been a popular way to exchange data in documents since Adobe introduced the format in 1993. Its report-like characteristic which preserves and prioritizes graphical visualization was part of the main publishing concerns among several segments including government agencies. In this way, tabular data started to be enclosed within PDF documents and disclosed in government portals. This situation, apart being surprisingly contradictory to data openness, is still found even in the major open data initiatives. It is estimated that roughly 13% of published files in some main open data portals around the world have their data made available in PDF. Thus, there is a need for effective tools capable of extracting tabular content (a main placeholder for data) from PDF to allow its data to be published in more open formats such as the well-known CSV which complies with accessible and machine processable open data principles. This paper aims at providing a structured and comprehensive overview of the research in tabular content extraction specifically from PDF documents as well as to provide an overview of most recent practical results in the literature. The contribution of this work goes beyond theoretical discussions by helping data practitioners to understand to what extent methods and tools regarding tabular content extraction from PDF can benefit the open data initiatives in practical and effective ways.

[1]  Marco Antonio Carvalho Teixeira,et al.  Dados abertos nos municípios, estados e governo federal brasileiros , 2015 .

[2]  Mark Frank,et al.  'There's no such thing as raw data': exploring the socio-technical life of a government dataset , 2013, WebSci.

[3]  Giorgio Orsi,et al.  Table Modelling, Extraction and Processing , 2016, DocEng.

[4]  Massimo Ruffolo,et al.  PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[5]  Kun Bai,et al.  TableSeer: automatic table metadata extraction and searching in digital libraries , 2007, JCDL '07.

[6]  Katharina Kaiser,et al.  pdf2table: A Method to Extract Table Information from PDF Files , 2005, IICAI.

[7]  David F. Brailsford,et al.  Document analysis of PDF files: methods, results and implications , 1995 .

[8]  C. Lee Giles,et al.  Identifying table boundaries in digital documents via sparse line detection , 2008, CIKM '08.

[9]  Tamir Hassan,et al.  ICDAR 2013 Table Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[10]  Ruiheng Qiu,et al.  A Table Detection Method for Multipage PDF Documents via Visual Seperators and Tabular Structures , 2011, 2011 International Conference on Document Analysis and Recognition.

[11]  Marco Antonio Carvalho Teixeira,et al.  Transparência governamental nos estados e grandes municípios brasileiros: uma “dança dos sete véus” incompleta? , 2018, Cadernos Gestão Pública e Cidadania.

[12]  Ying Liu,et al.  TableSeer: Automatic Table Extraction, Search, and Understanding. , 2009 .

[13]  José Maria Parente de Oliveira,et al.  DIGO: An Open Data Architecture for e-Government , 2011, 2011 IEEE 15th International Enterprise Distributed Object Computing Conference Workshops.

[14]  Giorgio Orsi,et al.  A methodology for evaluating algorithms for table understanding in PDF documents , 2012, DocEng '12.

[15]  Flávio S. Corrêa da Silva,et al.  Transparency portals versus open government data: an assessment of openness in Brazilian municipalities , 2014, DG.O.

[16]  Anne Marsden,et al.  International Organization for Standardization , 2014 .

[17]  Roya Rastan,et al.  Towards generic framework for tabular data extraction and management in documents , 2013, PIKM '13.

[18]  Sören Auer,et al.  A systematic review of open government data initiatives , 2015, Gov. Inf. Q..

[19]  Xiaohu Yang,et al.  Converting PDF to HTML approach based on text detection , 2009, ICIS '09.

[20]  Hye-Young Paik,et al.  TEXUS: A Task-based Approach for Table Extraction and Understanding , 2015, DocEng.

[21]  Ali Bou Nassif,et al.  Data mining techniques in social media: A survey , 2016, Neurocomputing.

[22]  Tamir Hassan,et al.  Table Recognition and Understanding from PDF Files , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[23]  M. Petró‐Turza,et al.  The International Organization for Standardization. , 2003 .

[24]  P. Bunyakiati,et al.  Dissemination formats and major statistic data sets of the AEC countries: A survey , 2012, 2012 6th International Conference on New Trends in Information Science, Service Science and Data Mining (ISSDM2012).

[25]  Ricardo César Gonçalves Sant'Ana,et al.  Acessando dados para visualização de afinidades nas votações entre parlamentares do Senado , 2013 .

[26]  Ricardo Matheus,et al.  New perspectives for electronic governance: the adoption of open government data in Brazil 1 , 2011 .

[27]  Edward A. Lee,et al.  Parts that add up to a whole : a framework for the analysis of tables , 2007 .

[28]  Leonid Stoimenov,et al.  Benchmarking open government: An open data perspective , 2014, Gov. Inf. Q..

[29]  Flávio S. Corrêa da Silva,et al.  A collaborative-oriented middleware for structuring information to open government data , 2015, DG.O.

[30]  Pasquale De Meo,et al.  Web Data Extraction , Applications and Techniques : A Survey , 2010 .

[31]  Anssi Nurminen,et al.  Algorithmic extraction of data in tables in PDF documents , 2013 .

[32]  Wayne M. Itano,et al.  Portable Document Format , 2019, Springer Reference Medizin.

[33]  Burcu Yildiz,et al.  Information Extraction - Utilizing Table Patterns , 2004 .

[34]  Miao Fan,et al.  Detecting Table Region in PDF Documents Using Distant Supervision , 2015 .