Gestion de l'incertitude et de l'imprécision dans un processus d'extraction de connaissances à partir des textes. (Uncertainty and imprecision management in a knowledge extraction process from unstructured texts)

Knowledge discovery and inference are concepts tackled in different ways in the scientific literature. Indeed, a large number of domains are interested such as : information retrieval, textual inference or knowledge base population. These concepts are arousing increasing interest in both academic and industrial fields, promoting development of new methods. This manuscript proposes an automated approach to infer and evaluate knowledge from extracted relations in unstructured texts. Its originality is based on a novel framework making it possible to exploit (i) the linguistic uncertainty thanks to an uncertainty detection method described in this manuscript (ii) a generated partial ordering of studied objects (e.g. noun phrases) taking into account of syntactic implications and a priori knowledge defined into taxonomies, and (iii) an evaluation step of extracted and inferred relations by selection models exploiting a specific partial ordering of relations. This partial ordering allows to compute some criteria in using information propagation rules in order to evaluate the belief associated to a relation in taking into account of the linguistic uncertainty. The proposed approach is illustrated and evaluated through the definition of a system performing question answering by analysing texts available on the Web. This case study shows the benefits of structuring processed information (e.g. using prior knowledge), the impact of selection models and the role of the linguistic uncertainty for inferring and discovering new knowledge. These contributions have been validated by several international and national publications and our pipeline can be downloaded at https ://github.com/PAJEAN/.

[1]  Alan L. Rector,et al.  Frames and OWL Side by Side , 2006 .

[2]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[3]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[4]  Diego Calvanese,et al.  The Description Logic Handbook: Theory, Implementation, and Applications , 2003, Description Logic Handbook.

[5]  Jean-Pierre Chevallet,et al.  Wikipedia-based semantic query enrichment , 2013, ESAIR '13.

[6]  Saku Mantere,et al.  Two Strategies for Inductive Reasoning in Organizational Research , 2010 .

[7]  Maria Kvist,et al.  Negation Scope Delimitation in Clinical Text Using Three Approaches: NegEx, PyConTextNLP and SynNeg , 2013, NODALIDA.

[8]  Heng Ji,et al.  Overview of the TAC 2010 Knowledge Base Population Track , 2010 .

[9]  B. Bouchon-Meunier,et al.  La logique floue et ses applications , 1995 .

[10]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[11]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[12]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[13]  Alexander F. Gelbukh,et al.  On Some Optimization Heuristics for Lesk-Like WSD Algorithms , 2005, NLDB.

[14]  Pierre Zweigenbaum,et al.  MEANS: A medical question-answering system combining NLP techniques and semantic Web technologies , 2015, Inf. Process. Manag..

[15]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[16]  David A. Ferrucci,et al.  Introduction to "This is Watson" , 2012, IBM J. Res. Dev..

[17]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[18]  Wei Zhang,et al.  Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources , 2015, Proc. VLDB Endow..

[19]  Alexander Gelbukh,et al.  Evolutionary Approach to Natural Language Word Sense Disambiguation through Global Coherence Optimization , 2005 .

[20]  Davy Weissenbacher Influence des annotations imparfaites sur les systèmes de Traitement Automatique des Langues, un cadre applicatif: la résolution de l'anaphore pronominale. (Effects of imperfect annotations on Natural Language Processing systems, an applicative case study: the pronominal anaphora resolution) , 2008 .

[21]  Xiaolong Wang,et al.  A Cascade Method for Detecting Hedges and their Scope in Natural Language Text , 2010, CoNLL Shared Task.

[22]  Hung T. Nguyen,et al.  Les incertitudes dans les systèmes intelligents , 1996 .

[23]  P. Smets Imperfect information : Imprecision-Uncertainty , 1999 .

[24]  Philipp Cimiano,et al.  Proceedings of the 2008 conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge , 2008 .

[25]  Valentina Dragos,et al.  An ontological analysis of uncertainty in soft data , 2013, Proceedings of the 16th International Conference on Information Fusion.

[26]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[27]  Johanna Völker,et al.  Acquisition of OWL DL Axioms from Lexical Resources , 2007, ESWC.

[28]  E. Bosse,et al.  Uncertainty in a situation analysis perspective , 2003, Sixth International Conference of Information Fusion, 2003. Proceedings of the.

[29]  David Eugene Smith,et al.  A Source Book in Mathematics. , 1930 .

[30]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[31]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[32]  Marvin Minsky,et al.  A framework for representing knowledge , 1974 .

[33]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[34]  Gabriella Pasi,et al.  Credibility in social media: opinions, news, and health information—a survey , 2017, WIREs Data Mining Knowl. Discov..

[35]  Minh-Tien Nguyen,et al.  Lexical-Morphological Modeling for Legal Text Analysis , 2015, JSAI-isAI Workshops.

[36]  Terry Winograd,et al.  Procedures As A Representation For Data In A Computer Program For Understanding Natural Language , 1971 .

[37]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[38]  Valentin Jijkoun,et al.  Recognizing Textual Entailment Using Lexical Similarity , 2005 .

[39]  Deborah L. McGuinness,et al.  OWL Web ontology language overview , 2004 .

[40]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[41]  James A. Hendler,et al.  The Semantic Web 10 , 2011 .

[42]  Johan Bos,et al.  Recognising Textual Entailment with Logical Inference , 2005, HLT.

[43]  Romaric Besançon,et al.  Semantic relation clustering for unsupervised information extraction (Regroupement sémantique de relations pour l'extraction d'information non supervisée) [in French] , 2013, TALN.

[44]  Lucas Drumond,et al.  A Survey of Ontology Learning Procedures , 2008, WONTO.

[45]  Christopher D. Manning,et al.  Natural Logic for Textual Inference , 2007, ACL-PASCAL@ACL.

[46]  R. Ackoff From Data to Wisdom , 2014 .

[47]  Danièle Bourcier,et al.  Abduction in language interpretation and law making , 2000 .

[48]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[49]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[50]  Jennifer Chu-Carroll,et al.  Building Watson: An Overview of the DeepQA Project , 2010, AI Mag..

[51]  Barbara Di Eugenio,et al.  A Lucene and Maximum Entropy Model Based Hedge Detection System , 2010, CoNLL Shared Task.

[52]  Daniel L. Rubin,et al.  Evaluation of Negation and Uncertainty Detection and its Impact on Precision and Recall in Search , 2009, Journal of Digital Imaging.

[53]  Ramesh Nallapati,et al.  Multi-instance Multi-label Learning for Relation Extraction , 2012, EMNLP.

[54]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[55]  Patrice Bellot,et al.  Un modèle probabiliste pour la détection de l'incertitude dans le langage naturel , 2016, CORIA-CIFED.

[56]  Steffen Staab,et al.  SEAL - A Framework for Developing SEmantic Web PortALs , 2001, BNCOD.

[57]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[58]  Gokmen Zararsiz,et al.  Bagging Support Vector Machines for Leukemia Classification , 2012 .

[59]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[60]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[61]  Chitta Baral,et al.  Discovering drug–drug interactions: a text-mining and reasoning approach based on properties of drug metabolism , 2010, Bioinform..

[62]  Michael Collins,et al.  A New Statistical Parser Based on Bigram Lexical Dependencies , 1996, ACL.

[63]  Graeme Hirst,et al.  Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures , 2004 .

[64]  Asma Ben Abacha Recherche de réponses précises à des questions médicales : le système de questions-réponses MEANS. (Finding precise answers to medical questions : the question-answering system MEANS) , 2012 .

[65]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[66]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[67]  Paul Buitelaar,et al.  SemEval-2015 Task 17: Taxonomy Extraction Evaluation (TExEval) , 2015, SemEval@NAACL-HLT.

[68]  F. D. Saussure Cours de linguistique générale , 1924 .

[69]  Chih-Jen Lin,et al.  Combining SVMs with Various Feature Selection Strategies , 2006, Feature Extraction.

[70]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[71]  Sylvie Ranwez,et al.  How Can Ontologies Give You Clue for Truth-Discovery? An Exploratory Study , 2016, WIMS.

[72]  Cédric Baudrit,et al.  Représentation et propagation de connaissances imprécises et incertaines: Application à l'évaluation des risques liés aux sites et sols pollués. (Representation and propagation of imprecise and uncertain knowledge: Application to the assessment of risks related to contaminated sites) , 2005 .

[73]  Beth Sundheim,et al.  Overview of the Fourth Message Understanding Evaluation and Conference , 1992, MUC.

[74]  Lynda Tamine,et al.  Recherche d'information sémantique dans les documents biomédicaux : approche basée sur le sens précis des concepts , 2010 .

[75]  John McCarthy,et al.  A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence, August 31, 1955 , 2006, AI Mag..

[76]  Maurizio Lenzerini,et al.  TBox and ABox Reasoning in Expressive Description Logics , 1996, KR.

[77]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[78]  Aron Culotta,et al.  Dependency Tree Kernels for Relation Extraction , 2004, ACL.

[79]  Glenn Shafer,et al.  A Mathematical Theory of Evidence , 2020, A Mathematical Theory of Evidence.

[80]  Iryna Gurevych,et al.  Cross-Genre and Cross-Domain Detection of Semantic Uncertainty , 2012, CL.

[81]  Georgios Balikas,et al.  An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , 2015, BMC Bioinformatics.

[82]  Mohand Boughanem,et al.  Exploitation des Liens Sémantiques pour l'Expansion de Requêtes dans un Système de Recherche d'Information , 2003, INFORSID.

[83]  I. Alorainy,et al.  The criteria and analysis of good multiple choice questions in a health professional setting. , 2005, Saudi medical journal.

[84]  Bo Zhao,et al.  A Survey on Truth Discovery , 2015, SKDD.

[85]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[86]  Ion Muslea,et al.  Extraction Patterns for Information Extraction Tasks: A Survey , 1999 .

[87]  Jing Liu,et al.  Knowledge Base Completion via Coupled Path Ranking , 2016, ACL.

[88]  Jaime Carbonell,et al.  On the parameter optimization of Support Vector Machines for binary classification , 2012, J. Integr. Bioinform..

[89]  Oren Etzioni,et al.  TextRunner: Open Information Extraction on the Web , 2007, NAACL.

[90]  Dan Roth,et al.  Exploiting Syntactico-Semantic Structures for Relation Extraction , 2011, ACL.

[91]  Noam Chomsky,et al.  वाक्यविन्यास का सैद्धान्तिक पक्ष = Aspects of the theory of syntax , 1965 .

[92]  Veronika Vincze,et al.  Weasels, Hedges and Peacocks: Discourse-level Uncertainty in Wikipedia Articles , 2013, IJCNLP.

[93]  Andon Tchechmedjiev,et al.  État de l’art : mesures de similarité sémantique locales et algorithmes globaux pour la désambiguïsation lexicale à base de connaissances (State of the art : Local Semantic Similarity Measures and Global Algorithmes for Knowledge-based Word Sense Disambiguation) [in French] , 2012, JEP/TALN/RECITAL.

[94]  János Csirik,et al.  The CoNLL-2010 Shared Task: Learning to Detect Hedges and their Scope in Natural Language Text , 2010, CoNLL Shared Task.

[95]  Andrea Esuli,et al.  SentiWordNet: A High-Coverage Lexical Resource for Opinion Mining , 2006 .

[96]  Julien Delporte,et al.  Factorisation matricielle, application à la recommandation personnalisée de préférences. (Matrix factorization, application to preference prediction in recommender systems) , 2014 .

[97]  Catherine Fuchs L'incertitude interprétative dans l'activité de langage , 2008 .

[98]  Razvan C. Bunescu,et al.  A Shortest Path Dependency Kernel for Relation Extraction , 2005, HLT.

[99]  Christian Jacquemin,et al.  Syntagmatic and Paradigmatic Representations of Term Variation , 1999, ACL.

[100]  Fabian M. Suchanek,et al.  AMIE: association rule mining under incomplete evidence in ontological knowledge bases , 2013, WWW.

[101]  A. Reboul,et al.  La pragmatique aujourd'hui. Une nouvelle science de la communication , 1998 .

[102]  Dan I. Moldovan,et al.  Semantic Representation of Negation Using Focus Detection , 2011, ACL.

[103]  Paul Buitelaar,et al.  Ontology Learning from Text: An Overview , 2005 .

[104]  Didier Schwab,et al.  Ant Colony Algorithm for the Unsupervised Word Sense Disambiguation of Texts: Comparison and Evaluation , 2012, COLING.

[105]  Olivier Curé,et al.  Gestion de l'incertitude dans le cadre d'une extraction des connaissances à partir de texte , 2015, EGC.

[106]  Adrien Coulet,et al.  Construction et utilisation d'une base de connaissances pharmacogénomique pour l'intégration de données et la découverte de connaissances. (Construction and use of a pharmacogenomic knowledge base for data integration and knowledge discovery) , 2008 .

[107]  Anthony Mills,et al.  Data, Information, Knowledge, and Wisdom , 2011 .

[108]  Raymond J. Mooney,et al.  Learning to Parse Database Queries Using Inductive Logic Programming , 1996, AAAI/IAAI, Vol. 2.

[109]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[110]  Ludovic Bonnefoy,et al.  Large Scale Text Mining Approaches for Information Retrieval and Extraction , 2014, Innovations in Intelligent Machines.

[111]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[112]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[113]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[114]  Sylvie Ranwez,et al.  Gérer l'incertitude lors de l'extraction de relations et lors de l'inférence de nouvelles connaissances , 2017 .

[115]  Aditya Kalyanpur,et al.  Automatic knowledge extraction from documents , 2012, IBM J. Res. Dev..

[116]  Siddharth Patwardhan,et al.  Structured data and inference in DeepQA , 2012, IBM J. Res. Dev..

[117]  Eneko Agirre,et al.  SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity , 2012, *SEMEVAL.

[118]  János Csirik,et al.  The BioScope corpus: annotation for negation, uncertainty and their scope in biomedical texts , 2008, BioNLP.

[119]  Pablo N. Mendes,et al.  Improving efficiency and accuracy in multilingual entity extraction , 2013, I-SEMANTICS '13.

[120]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[121]  Romaric Besançon,et al.  Utilisation des relations d’une base de connaissances pour la désambiguïsation d’entités nommées (Using the Relations of a Knowledge Base to Improve Entity Linking )[In French] , 2016, JEPTALNRECITAL.

[122]  Berthold Crysmann,et al.  Question answering from structured knowledge sources , 2007, J. Appl. Log..

[123]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[124]  Montserrat Batet,et al.  An information theoretic approach to improve semantic similarity assessments across multiple ontologies , 2014, Inf. Sci..

[125]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[126]  S. Catellin,et al.  L'abduction: une pratique de la découverte scientifique et littéraire , 2004 .

[127]  Jason Weston,et al.  Weakly Supervised Memory Networks , 2015, ArXiv.

[128]  Christian Bizer,et al.  Sieve: linked data quality assessment and fusion , 2012, EDBT-ICDT '12.

[129]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[130]  A. Barabasi,et al.  Human symptoms–disease network , 2014, Nature Communications.

[131]  Ludovic Jean-Louis,et al.  Approches supervisées et faiblement supervisées pour l'extraction d'événements et le peuplement de bases de connaissances. (Supervised and weakly-supervised approaches for complex-event extraction and knowledge base population) , 2011 .

[132]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[133]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[134]  Michael Strube,et al.  Finding Hedges by Chasing Weasels: Hedge Detection Using Wikipedia Tags and Shallow Linguistic Features , 2009, ACL.

[135]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[136]  Randy Goebel,et al.  A Convolutional Neural Network in Legal Question Answering , 2015 .

[137]  Gérard Dray,et al.  An adaptive accuracy-weighted ensemble for inter-subjects classification in brain-computer interfacing , 2015, 2015 7th International IEEE/EMBS Conference on Neural Engineering (NER).