Corpus annoté de cas cliniques en français (Annotated corpus with clinical cases in French)

Annotated corpus with clinical cases in French. Textual corpora are important for several NLP tasks because they provide suitable information for designing, adapting and evaluating these NLP applications. Yet, in some domains, such as the medical one, for confidentiality and ethical reasons, access to representative data is complicated or even impossible. Still, real need exists for this kind of corpora, both for training and research. In this paper, we propose the CAS corpus in French containing clinical cases of patients, real or fake. They cover various medical specialities and focuse on different clinical situations. Currently, the corpus contains 3,600 cases (almost 1.3M word occurrences). This corpus is associated with additional information (discussions of clinical cases, key-words...) and annotations that we produced to answer common research issues in this domain. We also present results from preliminary experiments of information retrieval and extraction performed on this corpus. These experiments can provide a baseline for the researchers interested in working with these data. MOTS-CLÉS : Corpus clinique, cas clinique, annotations, catégorisation, extraction d’information.

[1]  Vincent Claveau,et al.  CAS: French Corpus with Clinical Cases , 2018, Louhi@EMNLP.

[2]  Natalia Grabar,et al.  Linguistic approach for identification of medication names and related information in clinical narratives , 2010, J. Am. Medical Informatics Assoc..

[3]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[4]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[5]  Chun-Nan Hsu,et al.  Identifying and characterizing highly similar notes in big clinical note datasets , 2018, J. Biomed. Informatics.

[6]  Vincent Claveau,et al.  Portée de la négation : détection par apprentissage supervisé en français et portugais brésilien (Negation scope : sequence labeling by supervised learning in French and Brazilian-Portuguese) , 2018, JEPTALNRECITAL.

[7]  A. Gheorghe,et al.  Improving the recruitment activity of clinicians in randomised controlled trials: a systematic review , 2012, BMJ Open.

[8]  Leo Anthony Celi,et al.  Transthoracic echocardiography and mortality in sepsis: analysis of the MIMIC-III database , 2018, Intensive Care Medicine.

[9]  Shuying Shen,et al.  Can Physicians Recognize Their Own Patients in De-identified Notes? , 2014, MIE.

[10]  François Yvon,et al.  Practical Very Large Scale CRFs , 2010, ACL.

[11]  Boris Campillo-Gimenez,et al.  Improving the pre-screening of eligible patients in order to increase enrollment in cancer clinical trials , 2015, Trials.

[12]  Stephen Pulman,et al.  Evaluating the State of the Art , 1995 .

[13]  Stephen E. Robertson,et al.  Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive , 1998, TREC.

[14]  János Csirik,et al.  The BioScope corpus: annotation for negation, uncertainty and their scope in biomedical texts , 2008, BioNLP.

[15]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[16]  Sanna Salanterä,et al.  Overview of the ShARe/CLEF eHealth Evaluation Lab 2013 , 2013, CLEF.

[17]  Yan Liu,et al.  Recurrent Neural Networks for Multivariate Time Series with Missing Values , 2016, Scientific Reports.

[18]  Yi Pan,et al.  Automated ICD-9 Coding via A Deep Learning Approach , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[19]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[20]  Gareth J. F. Jones,et al.  ShARe/CLEF eHealth Evaluation Lab 2014, Task 3: User-centred Health Information Retrieval , 2014, CLEF.

[21]  Peter J. Embi,et al.  Development of an Electronic Health Record-based Clinical Trial Alert System to Enhance Recruitment at the Point of Care , 2005, AMIA.

[22]  Ozlem Uzuner,et al.  Second i2b2 workshop on natural language processing challenges for clinical records. , 2008, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[23]  Pierre Zweigenbaum,et al.  Automatic De-Identification of French Clinical Records: Comparison of Rule-Based and Machine-Learning Approaches , 2013, MedInfo.

[24]  Robert H. Baud,et al.  Medical document anonymization with a semantic lexicon , 2000, AMIA.

[25]  Frank D. Wood,et al.  Diagnosis code assignment: models and evaluation metrics , 2013, J. Am. Medical Informatics Assoc..

[26]  Christophe Roeder,et al.  Reproducibility in Natural Language Processing: A Case Study of Two R Libraries for Mining PubMed/MEDLINE. , 2016, LREC ... International Conference on Language Resources & Evaluation : [proceedings]. International Conference on Language Resources and Evaluation.

[27]  Chunhua Weng,et al.  EliIE: An open-source information extraction system for clinical trial eligibility criteria , 2017, J. Am. Medical Informatics Assoc..

[28]  F. Collins,et al.  NIH plans to enhance reproducibility , 2014 .

[29]  Prakash M. Nadkarni,et al.  Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions , 2011, J. Am. Medical Informatics Assoc..

[30]  Vincent Claveau,et al.  Numerical Eligibility Criteria in Clinical Protocols: Annotation, Automatic Detection and Interpretation , 2017, AIME.

[31]  Anna Rumshisky,et al.  Evaluating temporal relations in clinical text: 2012 i2b2 Challenge , 2013, J. Am. Medical Informatics Assoc..

[32]  Elizabeth S. Chen,et al.  Predicting Mortality in Diabetic ICU Patients Using Machine Learning and Severity Indices , 2018, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[33]  Cyril Grouin,et al.  Is it possible to recover personal health information from an automatically de-identified corpus of French EHRs? , 2015, Louhi@EMNLP.