CAS: French Corpus with Clinical Cases

Textual corpora are extremely important for various NLP applications as they provide information necessary for creating, setting and testing these applications and the corresponding tools. They are also crucial for designing reliable methods and reproducible results. Yet, in some areas, such as the medical area, due to confidentiality or to ethical reasons, it is complicated and even impossible to access textual data representative of those produced in these areas. We propose the CAS corpus built with clinical cases, such as they are reported in the published scientific literature in French. We describe this corpus, currently containing over 397,000 word occurrences, and the existing linguistic and semantic annotations.

[1]  A. Gheorghe,et al.  Improving the recruitment activity of clinicians in randomised controlled trials: a systematic review , 2012, BMJ Open.

[2]  Stéfan Jacques Darmoni,et al.  CISMeF: Cataloque and Index of French speaking health resources , 1999, MIE.

[3]  Frank D. Wood,et al.  Diagnosis code assignment: models and evaluation metrics , 2013, J. Am. Medical Informatics Assoc..

[4]  Boris Campillo-Gimenez,et al.  Improving the pre-screening of eligible patients in order to increase enrollment in cancer clinical trials , 2015, Trials.

[5]  Christophe Roeder,et al.  Reproducibility in Natural Language Processing: A Case Study of Two R Libraries for Mining PubMed/MEDLINE. , 2016, LREC ... International Conference on Language Resources & Evaluation : [proceedings]. International Conference on Language Resources and Evaluation.

[6]  Gareth J. F. Jones,et al.  ShARe/CLEF eHealth Evaluation Lab 2014, Task 3: User-centred Health Information Retrieval , 2014, CLEF.

[7]  Peter J. Embi,et al.  Development of an Electronic Health Record-based Clinical Trial Alert System to Enhance Recruitment at the Point of Care , 2005, AMIA.

[8]  Elizabeth S. Chen,et al.  Predicting Mortality in Diabetic ICU Patients Using Machine Learning and Severity Indices , 2018, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[9]  Natalia Grabar,et al.  Linguistic approach for identification of medication names and related information in clinical narratives , 2010, J. Am. Medical Informatics Assoc..

[10]  János Csirik,et al.  The BioScope corpus: annotation for negation, uncertainty and their scope in biomedical texts , 2008, BioNLP.

[11]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[12]  Chun-Nan Hsu,et al.  Identifying and characterizing highly similar notes in big clinical note datasets , 2018, J. Biomed. Informatics.

[13]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[14]  Ozlem Uzuner,et al.  Second i2b2 workshop on natural language processing challenges for clinical records. , 2008, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[15]  Anna Rumshisky,et al.  Evaluating temporal relations in clinical text: 2012 i2b2 Challenge , 2013, J. Am. Medical Informatics Assoc..

[16]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[17]  Leo Anthony Celi,et al.  Transthoracic echocardiography and mortality in sepsis: analysis of the MIMIC-III database , 2018, Intensive Care Medicine.

[18]  Shuying Shen,et al.  Can Physicians Recognize Their Own Patients in De-identified Notes? , 2014, MIE.

[19]  Cyril Grouin,et al.  Is it possible to recover personal health information from an automatically de-identified corpus of French EHRs? , 2015, Louhi@EMNLP.

[20]  Prakash M. Nadkarni,et al.  Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions , 2011, J. Am. Medical Informatics Assoc..

[21]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.

[22]  Chunhua Weng,et al.  EliIE: An open-source information extraction system for clinical trial eligibility criteria , 2017, J. Am. Medical Informatics Assoc..

[23]  Vincent Claveau,et al.  Portée de la négation : détection par apprentissage supervisé en français et portugais brésilien (Negation scope : sequence labeling by supervised learning in French and Brazilian-Portuguese) , 2018, JEPTALNRECITAL.

[24]  Vincent Claveau,et al.  Numerical Eligibility Criteria in Clinical Protocols: Annotation, Automatic Detection and Interpretation , 2017, AIME.

[25]  Robert H. Baud,et al.  Medical document anonymization with a semantic lexicon , 2000, AMIA.

[26]  F. Collins,et al.  NIH plans to enhance reproducibility , 2014 .

[27]  C Boyer,et al.  Health On the Net automated database of health and medical information. , 1997, International journal of medical informatics.

[28]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[29]  Bonnie L. Webber,et al.  Neural Networks For Negation Scope Detection , 2016, ACL.

[30]  Yan Liu,et al.  Recurrent Neural Networks for Multivariate Time Series with Missing Values , 2016, Scientific Reports.

[31]  Yi Pan,et al.  Automated ICD-9 Coding via A Deep Learning Approach , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  Sanna Salanterä,et al.  Overview of the ShARe/CLEF eHealth Evaluation Lab 2013 , 2013, CLEF.

[33]  Pierre Zweigenbaum,et al.  Automatic De-Identification of French Clinical Records: Comparison of Rule-Based and Machine-Learning Approaches , 2013, MedInfo.