Ontology-based Information Extraction from Technical Documents

This paper presents a novel system for extracting user relevant tabular information from documents. The presented system is generic and can be applied to any documents irrespective of their domain and the information they contain. In addition to the generic nature of the presented approach, it is robust and can deal with different document layouts followed while creating those documents. The presented system has two main modules; table detection and ontological information extraction. The table detection module extracts all tables from a given technical document while, the ontological information extraction module extracts only relevant tables from all of the detected tables. The generalization in this system is achieved by using ontologies, thus enabling the system to adapt itself, to a new set of documents from any other domain, according to any provided ontology. Furthermore, the presented system also provides a confidence score and explanation of the score for each of the extracted tables in terms of its relevancy. The system was evaluated on 80 real technical documents of hardware parts containing 2033 tables from 20 different brands of Industrial Boilers domain. The evaluation results show that the presented system extracted all of the relevant tables and achieves an overall precision, recall, and F-measure of 0.88, 1 and 0.93 respectively.

[1]  Goran Nenadic,et al.  Extracting Patient Data from Tables in Clinical Literature - Case Study on Extraction of BMI, Weight and Number of Patients , 2016, HEALTHINF.

[2]  Massimo Ruffolo,et al.  XONTO: An Ontology-Based System for Semantic Information Extraction from PDF Documents , 2008, 2008 20th IEEE International Conference on Tools with Artificial Intelligence.

[3]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[4]  Wolfgang Gatterbauer,et al.  Table Extraction Using Spatial Reasoning on the CSS2 Visual Box Model , 2006, AAAI.

[5]  Yiming Yang,et al.  Learning Table Extraction from Examples , 2004, COLING.

[6]  A. F. R. Rahman,et al.  Content Extraction from HTML Documents , 2001 .

[7]  Kun Bai,et al.  Automatic extraction of table metadata from digital documents , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[8]  Dayne Freitag,et al.  Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[9]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[10]  Yonatan Aumann,et al.  Structural extraction from visual layout of documents , 2002, CIKM '02.

[11]  W. Bruce Croft,et al.  Table extraction for answer retrieval , 2006, Information Retrieval.

[12]  Andrew McCallum,et al.  Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..

[13]  Jian Fan,et al.  Layout and Content Extraction for PDF Documents , 2004, Document Analysis Systems.

[14]  Hanan Samet,et al.  Schema Extraction for Tabular Data on the Web , 2013, Proc. VLDB Endow..

[15]  Eduard H. Hovy,et al.  Layout-aware text extraction from full-text PDF of scientific articles , 2012, Source Code for Biology and Medicine.