Détection automatique de phrases en domaine de spécialité en français (Sentence boundary detection for specialized domains in French )

Sentence boundary detection is generally considered as a solved problem. However, tools that perform well on standard text do not necessarily deal well with specialized corpus, which may degrade the analysis of other natural language processing tools intended to process sentence-segmented text. In this paper, we conduct a benchmark evaluation of 5 standard sentence boundary detection tools on 3 corpora covering different domains and subdomains. We then retrain one of the tools on domain-specific data and show that this leads to improved performance. In particular, we experiment with the clinical domain using a new clinical corpus annotated for gold-standard sentence boundaries. Sentence boundary detection with an openNLP model trained on the clinical data achieves an F-measure of .73, vs. .66 for standard openNLP distribution. MOTS-CLES : Segmentation en phrases, domaine de specialite, evaluation.

[1]  Marie Candito,et al.  Le corpus Sequoia : annotation syntaxique et exploitation pour l’adaptation d’analyseur par pont lexical (The Sequoia Corpus : Syntactic Annotation and Use for a Parser Lexical Domain Adaptation Method) [in French] , 2012, JEP/TALN/RECITAL.

[2]  Assaf Urieli,et al.  Robust French syntax analysis: reconciling statistical methods and linguistic knowledge in the Talismane toolkit. (Analyse syntaxique robuste du français : concilier méthodes statistiques et connaissances linguistiques dans l'outil Talismane) , 2013 .

[3]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[4]  Sandra M. Aluísio,et al.  Sentence Segmentation in Narrative Transcripts from Neuropsychological Tests using Recurrent Convolutional Neural Networks , 2016, EACL.

[5]  Tibor Kiss,et al.  Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.

[6]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[7]  Thierry Hamon,et al.  A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT) , 2018, Lang. Resour. Evaluation.

[8]  Deyi Xiong,et al.  Automatic Long Sentence Segmentation for Neural Machine Translation , 2016, NLPCC/ICCPOL.

[9]  Jose Camacho Collados,et al.  Splitting complex sentences for natural language processing applications: Building a simplified Spanish corpus , 2013 .

[10]  Karin M. Verspoor,et al.  Findings of the WMT 2017 Biomedical Translation Shared Task , 2017, WMT.

[11]  Stefan Schulz,et al.  Detection of sentence boundaries and abbreviations in clinical narratives , 2015, BMC Medical Informatics and Decision Making.

[12]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[13]  Guergana K. Savova,et al.  Robust Sentence Segmentation for Clinical Text , 2015, AMIA.

[14]  Eric Fosler-Lussier,et al.  A Quantitative and Qualitative Evaluation of Sentence Boundary Detection for the Clinical Domain , 2016, CRI.