The automatic detection of scientific terms in patient information

Despite the legislative efforts to improve the readability of patient information, different surveys have shown that respondents still feel distressed by reading the information, or even consider it as fully incomprehensible. This paper deals with one of the sources of distress: the use of scientific terminology in patient information. In order to assess the scale of the problem, we collected a Dutch-English parallel corpus of European Public Assessment Reports (EPARs) which was annotated by 2 annotators. This corpus was used for evaluating and training an automatic approach to scientific term detection. We investigated the use of a lexicon-based and a learningbased approach which only relies on text-internal clues. Finally, both approaches were combined in an optimized hybrid learning-based term extraction experiment. We show that whereas the lexicon-based approach yields high precision scores on the detection of scientific terms, its coverage remains limited. The learning-based approach on the other hand demonstrates an Fscore of 80% and remains quite robust despite the highly skewed data set.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Patrick Pantel,et al.  A Statistical Corpus-Based Term Extractor , 2001, Canadian Conference on AI.

[3]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[4]  Walter Daelemans,et al.  Combined Optimization of Feature Selection and Algorithm Parameter Interaction in Machine Learning of Language , 2003 .

[5]  Karen Korning Zethsen,et al.  Latin-based terms: True or false friends? , 2004 .

[6]  Veronique Hoste,et al.  Optimization issues in machine learning of coreference resolution , 2005 .

[7]  Nigel Collier,et al.  Extracting the Names of Genes and Gene Products with a Hidden Markov Model , 2000, COLING.

[8]  Walter Daelemans,et al.  Memory-Based Language Processing , 2009, Studies in natural language processing.

[9]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[10]  Thierry Hamon,et al.  Improving Term Extraction with Terminological Resources , 2006, FinTAL.

[11]  Wei Zhong Liu,et al.  Bias in information-based measures in decision tree induction , 1994, Machine Learning.

[12]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[13]  Sophia Ananiadou,et al.  Identifying Terms by their Family and Friends , 2000, COLING.

[14]  Sophia Ananiadou,et al.  Identifying contextual information for multi-word term extraction , 1999 .

[15]  Kenneth Ward Church,et al.  Termight: Identifying and Translating Technical Terminology , 1994, ANLP.

[16]  Joost Buysschaert The development of a MeSH-based biomedical termbase at Hogeschool Gent , 2006 .