Disentangling the Structure of Tables in Scientific Literature

Within the scientific literature, tables are commonly used to present factual and statistical information in a compact way, which is easy to digest by readers. The ability to “understand” the structure of tables is key for information extraction in many domains. However, the complexity and variety of presentation layouts and value formats makes it difficult to automatically extract roles and relationships of table cells. In this paper, we present a model that structures tables in a machine readable way and a methodology to automatically disentangle and transform tables into the modelled data structure. The method was tested in the domain of clinical trials: it achieved an F-score of 94.26 % for cell function identification and 94.84 % for identification of inter-cell relationships.

[1]  Yiming Yang,et al.  Learning Table Extraction from Examples , 2004, COLING.

[2]  Seong-Bae Park,et al.  Discriminating Meaningful Web Tables from Decorative Tables Using a Composite Kernel , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[3]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[4]  Hwee Tou Ng,et al.  Learning to Recognize Tables in Free Text , 1999, ACL.

[5]  David Martínez,et al.  Extraction of Named Entities from Tables in Gene Mutation Literature , 2009, BioNLP@HLT-NAACL.

[6]  Timothy W. Finin,et al.  Using Linked Data to Interpret Tables , 2010, COLD.

[7]  Carole A. Goble,et al.  Rendering tables in audio: the interaction of structure and reading styles , 2003, Assets '04.

[8]  Sung-Won Jung,et al.  A Scalable Hybrid Approach for Extracting Head Components from Web Tables , 2006, IEEE Trans. Knowl. Data Eng..

[9]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[10]  Marti A. Hearst,et al.  Full Text and Figure Display Improves Bioscience Literature Search , 2010, PloS one.

[11]  Cheng Zhang,et al.  Biomedical text mining and its applications in cancer research , 2013, J. Biomed. Informatics.

[12]  Matthew Francis Hurst,et al.  The interpretation of tables in texts , 2000 .

[13]  Michael Alley,et al.  The Craft of Scientific Writing , 1987 .

[14]  Goran Nenadic,et al.  Text mining of cancer-related information: Review of current status and future directions , 2014, Int. J. Medical Informatics.

[15]  Goran Nenadic,et al.  Extracting Patient Data from Tables in Clinical Literature - Case Study on Extraction of BMI, Weight and Number of Patients , 2016, HEALTHINF.

[16]  Katharina Kaiser,et al.  pdf2table: A Method to Extract Table Information from PDF Files , 2005, IICAI.

[17]  Constantine Stephanidis,et al.  Universal access in the information society , 1999, HCI.

[18]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[19]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[20]  W. Bruce Croft,et al.  Table extraction for answer retrieval , 2006, Information Retrieval.

[21]  Gianluca Quercini,et al.  Entity discovery and annotation in tables , 2013, EDBT '13.

[22]  Steve Pettifer,et al.  Utopia documents: linking scholarly literature with research data , 2010, Bioinform..

[23]  Yalin Wang,et al.  A machine learning based approach for table detection on the web , 2002, WWW '02.

[24]  Preslav Nakov,et al.  BioText Search Engine: beyond abstract search , 2007, Bioinform..

[25]  Enrico Pontelli,et al.  Non-visual navigation of spreadsheets , 2012, Universal Access in the Information Society.

[26]  Thomas Kieninger,et al.  The T-Recs Table Recognition and Analysis System , 1998, Document Analysis Systems.

[27]  Madhuri M. Chavan,et al.  A Methodology for Extracting Head Contents from Meaningful Tables in Web Pages , 2011, 2011 International Conference on Communication Systems and Network Technologies.