Fouille de motifs et CRF pour la reconnaissance de symptômes dans les textes biomédicaux (Pattern mining and CRF for symptoms recognition in biomedical texts)[In French]

Dans cet article, nous nous interessons a l'extraction d'entites medicales de type symptome dans les textes biomedicaux. Cette tâche est peu exploree dans la litterature et il n'existe pas a notre connaissance de corpus annote pour entrainer un modele d'apprentissage. Nous proposons deux approches faiblement supervisees pour extraire ces entites. Une premiere est fondee sur la fouille de motifs et introduit une nouvelle contrainte de similarite semantique. La seconde formule la tache comme une tache d'etiquetage de sequences en utilisant les CRF (champs conditionnels aleatoires). Nous decrivons les experimentations menees qui montrent que les deux approches sont complementaires en termes d'evaluation quantitative (rappel et precision). Nous montrons en outre que leur combinaison ameliore sensiblement les resultats.

[1]  Siddhartha Jonnalagadda,et al.  Pooling annotated corpora for clinical concept extraction , 2013, J. Biomed. Semant..

[2]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[3]  Bruno Crémilleux,et al.  Sequence mining under multiple constraints , 2015, SAC.

[4]  Thierry Charnois,et al.  Symptom extraction issue , 2014, BioNLP@ACL.

[5]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[6]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[7]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[10]  Dimitrios Kokkinakis Developing Resources for Swedish Bio-Medical Text Mining , 2006, SMBM.

[11]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[12]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[13]  Jian Pei,et al.  Constraint-based sequential pattern mining: the pattern-growth methods , 2007, Journal of Intelligent Information Systems.

[14]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[15]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[16]  Shuying Shen,et al.  Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease , 2009, BMC Bioinformatics.

[17]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[18]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[19]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[20]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[21]  Antoine Widlöcher,et al.  Automatic Symptom Extraction from Texts to Enhance Knowledge Discovery on Rare Diseases , 2015, AIME.