Multi-lingual Concept Extraction with Linked Data and Human-in-the-Loop

Ontologies are dynamic artifacts that evolve both in structure and content. Keeping them up-to-date is a very expensive and critical operation for any application relying on semantic Web technologies. In this paper we focus on evolving the content of an ontology by extracting relevant instances of ontological concepts from text. We propose a novel technique which is (i) completely language independent, (ii) combines statistical methods with human-in-the-loop and (iii) exploits Linked Data as bootstrapping source. Our experiments on a publicly available medical corpus and on a Twitter dataset show that the proposed solution achieves comparable performances regardless of language, domain and style of text. Given that the method relies on a human-in-the-loop, our results can be safely fed directly back into Linked Data resources.

[1]  Magnus Sahlgren,et al.  Automatic bilingual lexicon acquisition using random indexing of parallel corpora , 2005, Nat. Lang. Eng..

[2]  Raphaël Troncy,et al.  Analysis of named entity recognition and linking for tweets , 2014, Inf. Process. Manag..

[3]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[4]  Heiko Paulheim,et al.  Mining the Web of Linked Data with RapidMiner , 2015, J. Web Semant..

[5]  Oladimeji Farri,et al.  Adverse Drug Event Detection in Tweets with Semi-Supervised Convolutional Neural Networks , 2017, WWW.

[6]  Roman Kern,et al.  An Information Retrieval Based Approach for Multilingual Ontology Matching , 2016, NLDB.

[7]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[8]  Pierre Zweigenbaum,et al.  Text mining for pharmacovigilance: Using machine learning for drug name recognition and drug-drug interaction extraction and classification , 2015, J. Biomed. Informatics.

[9]  Xiaolong Wang,et al.  Effects of Semantic Features on Machine Learning-Based Drug Name Recognition Systems: Word Embeddings vs. Manually Constructed Dictionaries , 2015, Inf..

[10]  Stefan Feuerriegel,et al.  Generating Domain-Specific Dictionaries using Bayesian Learning , 2015, ECIS.

[11]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[12]  Edith Schonberg,et al.  Extracting Enterprise Vocabularies Using Linked Open Data , 2009, International Semantic Web Conference.

[13]  Jörg Tiedemann,et al.  News from OPUS — A collection of multilingual parallel corpora with tools and interfaces , 2009 .

[14]  Seong-Bae Park,et al.  An automatic ontology population with a machine learning technique from semi-structured documents , 2009, 2009 International Conference on Information and Automation.

[15]  Anna Lisa Gentile,et al.  Overview of the EVALITA 2016 Named Entity rEcognition and Linking in Italian Tweets (NEEL-IT) Task , 2016, CLiC-it/EVALITA.

[16]  Rumi Tokunaga,et al.  The modern Japanese color lexicon. , 2017, Journal of vision.

[17]  Jure Leskovec,et al.  Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora , 2016, EMNLP.

[18]  Roi Blanco,et al.  Lightweight Multilingual Entity Extraction and Linking , 2017, WSDM.

[19]  Antske Fokkens,et al.  NewsReader: Using knowledge resources in a cross-lingual reading machine to generate more knowledge from massive streams of news , 2016, Knowl. Based Syst..

[20]  Fabio Massimo Zanzotto,et al.  Terminology Extraction: An Analysis of Linguistic and Statistical Approaches , 2005 .

[21]  Isabelle Augenstein,et al.  Unsupervised wrapper induction using linked data , 2013, K-CAP.

[22]  Heiko Paulheim,et al.  Semantic Web in data mining and knowledge discovery: A comprehensive survey , 2016, J. Web Semant..

[23]  Thanos G. Stavropoulos,et al.  User-Driven Ontology Population from Linked Data Sources , 2016, KESW.

[24]  Diego Reforgiato Recupero,et al.  Semantic Web Machine Reading with FRED , 2017, Semantic Web.

[25]  Andrea Giovanni Nuzzolese,et al.  Open Knowledge Extraction Challenge , 2015, SemWebEval@ESWC.

[26]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[27]  Philipp Cimiano,et al.  Using the Web to Reduce Data Sparseness in Pattern-Based Information Extraction , 2007, PKDD.

[28]  Meena Nagarajan,et al.  A Method to Accelerate Human in the Loop Clustering , 2017, SDM.

[29]  Paloma Martínez,et al.  SemEval-2013 Task 9 : Extraction of Drug-Drug Interactions from Biomedical Texts (DDIExtraction 2013) , 2013, *SEMEVAL.

[30]  Neal Lewis,et al.  SPOT the Drug! An Unsupervised Pattern Matching Method to Extract Drug Names from Very Large Clinical Corpora , 2012, 2012 IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology.

[31]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[32]  Roberto Navigli,et al.  Entity Linking meets Word Sense Disambiguation: a Unified Approach , 2014, TACL.

[33]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.