Data Extraction Using NLP Techniques and Its Transformation to Linked Data

We present a system that extracts a knowledge base from raw unstructured texts that is designed as a set of entities and their relations and represented in an ontological framework. The extraction pipeline processes input texts by linguistically-aware tools and extracts entities and relations from their syntactic representation. Consequently, the extracted data is represented according to the Linked Data principles. The system is designed both domain and language independent and provides users with data for more intelligent search than full-text search. We present our first case study on processing Czech legal texts.

[1]  R. Doyle The American terrorist. , 2001, Scientific American.

[2]  Witold Abramowicz Business Information Systems Workshops , 2014, Lecture Notes in Business Information Processing.

[3]  Petr Pajas,et al.  System for Querying Syntactically Annotated Corpora , 2009, ACL/IJCNLP.

[4]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[5]  Christian Chiarcos,et al.  Linked Data in Linguistics , 2012, Springer Berlin Heidelberg.

[6]  Pierre Zweigenbaum,et al.  Automatic extraction of semantic relations between medical entities: a rule based approach , 2011, J. Biomed. Semant..

[7]  Vít Baisa,et al.  Information Extraction for Czech Based on Syntactic Analysis , 2011, LTC.

[8]  Oren Etzioni,et al.  Strategies for lifelong knowledge extraction from the web , 2007, K-CAP '07.

[9]  Pierre Nugues,et al.  Entity Extraction: From Unstructured Text to DBpedia RDF triples , 2012, WoLE@ISWC.

[10]  Christian Biemann,et al.  Ontology Learning from Text: A Survey of Methods , 2005, LDV Forum.

[11]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[12]  Zdenek Zabokrtský,et al.  TectoMT: Modular NLP Framework , 2010, IceTAL.

[13]  Karel Pala,et al.  Automatic Identification of Legal Terms in Czech Law Texts , 2010, Semantic Processing of Legal Texts.

[14]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[15]  Jan Hajic,et al.  The Prague Dependency Treebank , 2003 .

[16]  Karel Pala,et al.  Legal Terms and Word Sketches: A Case Study , 2010, RASLAN.

[17]  Russell J. Molyneux,et al.  Introduction and Overview , 2007, Computational Analysis of Storylines.

[18]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[19]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[20]  Marie-Francine Moens,et al.  Approaches to Text Mining Arguments from Legal Cases , 2010, Semantic Processing of Legal Texts.

[21]  Marie Mikulová,et al.  Prague Dependency Treebank 3.0 , 2013 .

[22]  Sabine Schulte im Walde,et al.  Proceedings of the ACL-IJCNLP 2009 Software Demonstrations , 2009 .

[23]  Simonetta Montemagni,et al.  The SPLeT-2012 Shared Task on Dependency Parsing of Legal Texts , 2012 .

[24]  Christopher D. Manning,et al.  Advances in natural language processing , 2015, Science.

[25]  Gerhard Weikum,et al.  SOFIE: a self-organizing framework for information extraction , 2009, WWW '09.

[26]  Claudia Soria,et al.  Automatic semantics extraction in law documents , 2005, ICAIL '05.

[27]  L. Thorne McCarty Deep semantic interpretations of legal texts , 2007, ICAIL.

[28]  Irena Holubová,et al.  Linked Open Data for Legislative Domain - Ontology and Experimental Data , 2013, BIS.

[29]  Luis Gravano,et al.  Extracting Relations from Large Plain-Text Collections , 1999 .