NLP Data Cleansing Based on Linguistic Ontology Constraints

Linked Data comprises of an unprecedented volume of structured data on the Web and is adopted from an increasing number of domains. However, the varying quality of published data forms a barrier for further adoption, especially for Linked Data consumers. In this paper, we extend a previously developed methodology of Linked Data quality assessment, which is inspired by test-driven software development. Specifically, we enrich it with ontological support and different levels of result reporting and describe how the method is applied in the Natural Language Processing (NLP) area. NLP is – compared to other domains, such as biology – a late Linked Data adopter. However, it has seen a steep rise of activity in the creation of data and ontologies. NLP data quality assessment has become an important need for NLP datasets. In our study, we analysed 11 datasets using the lemon and NIF vocabularies in 277 test cases and point out common quality issues.

[1]  Sebastian Hellmann,et al.  Leveraging the Crowdsourcing of Lexical Resources for Bootstrapping a Linguistic Data Cloud , 2012, JIST.

[2]  Andreas Harth,et al.  Weaving the Pedantic Web , 2010, LDOW.

[3]  Jens Lehmann,et al.  Test-driven evaluation of linked data quality , 2014, WWW.

[4]  Jens Lehmann,et al.  Integrating NLP Using Linked Data , 2013, SEMWEB.

[5]  Jiao Tao,et al.  Towards Integrity Constraints in OWL , 2009, OWLED.

[6]  Steffen Staab,et al.  The Semantic Web - ISWC 2008, 7th International Semantic Web Conference, ISWC 2008, Karlsruhe, Germany, October 26-30, 2008. Proceedings , 2008, SEMWEB.

[7]  Jens Lehmann,et al.  Assessing Linked Data Mappings Using Network Measures , 2012, ESWC.

[8]  Lora Aroyo,et al.  The Semantic Web – ISWC 2013 , 2013, Lecture Notes in Computer Science.

[9]  Christian Chiarcos,et al.  lemonUby - A large, interlinked, syntactically-rich lexical resource for ontologies , 2015, Semantic Web.

[10]  Lora Aroyo,et al.  The Semantic Web: Research and Applications , 2009, Lecture Notes in Computer Science.

[11]  Deborah L. McGuinness,et al.  PROV-O: The PROV Ontology , 2013 .

[12]  Martin Brümmer,et al.  Lemon-aid: using Lemon to aid quantitative historical linguistic analysis , 2013, LDL.

[13]  Christian Bizer,et al.  Sieve: linked data quality assessment and fusion , 2012, EDBT-ICDT '12.

[14]  James Cheney,et al.  PROV-O: The PROV ontology:W3C recommendation 30 April 2013 , 2013 .

[15]  Asunción Gómez-Pérez,et al.  Interchanging lexical resources on the Semantic Web , 2012, Language Resources and Evaluation.

[16]  Christian Bizer,et al.  Quality-driven information filtering using the WIQA policy framework , 2009, J. Web Semant..

[17]  Jens Lehmann,et al.  Pattern Based Knowledge Base Enrichment , 2013, SEMWEB.

[18]  Claudio Gutiérrez,et al.  The Expressive Power of SPARQL , 2008, SEMWEB.

[19]  Martin Hepp,et al.  Using SPARQL and SPIN for Data Quality Management on the Semantic Web , 2010, BIS.

[20]  Sebastian Hellmann,et al.  N³ - A Collection of Datasets for Named Entity Recognition and Disambiguation in the NLP Interchange Format , 2014, LREC.

[21]  Harald Sack,et al.  Statistical Analyses of Named Entity Disambiguation Benchmarks , 2013, NLP-DBPEDIA@ISWC.