Classification-based scientific term detection in patient information

Although intended for the “average layman”, both in terms of readability and contents, the current patient information still contains many scientific terms. Different studies have concluded that the use of scientific terminology is one of the factors, which greatly influences the readability of this patient information. The present study deals with the problem of automatic term recognition of overly scientific terminology as a first step towards the replacement of the recognized scientific terms by their popular counterpart. In order to do so, we experimented with two approaches, a dictionary-based approach and a learning-based approach, which is trained on a rich feature vector. The research was conducted on a bilingual corpus of English and Dutch EPARs (European Public Assessment Report). Our results show that we can extract scientific terms with a high accuracy (> 80%, 10% below human performance) for both languages. Furthermore, we show that a lexicon-independent approach, which solely relies on orthographical and morphological information is the most powerful predictor of the scientific character of a given term.

[1]  Svetlin Nakov Sofia Cognate or False Friend ? Ask the Web ! , 2007 .

[2]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .

[3]  Michel Simard,et al.  Using cognates to align sentences in bilingual corpora , 1993, TMI.

[4]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[5]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[6]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[7]  Joost Buysschaert The development of a MeSH-based biomedical termbase at Hogeschool Gent , 2006 .

[8]  Beatrice Daille,et al.  Combined approach for terminology extraction: lexical statistics and linguistic filtering , 1995 .

[9]  Kenneth Ward Church,et al.  Termight: Identifying and Translating Technical Terminology , 1994, ANLP.

[10]  Daniel S. Hirschberg,et al.  Algorithms for the Longest Common Subsequence Problem , 1977, JACM.

[11]  Jaeki Song,et al.  A conceptual framework for international Web design , 2001 .

[12]  Sophia Ananiadou,et al.  Identifying contextual information for multi-word term extraction , 1999 .

[13]  Ian H. Witten,et al.  Thesaurus based automatic keyphrase indexing , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[14]  Walter Daelemans,et al.  Combined Optimization of Feature Selection and Algorithm Parameter Interaction in Machine Learning of Language , 2003 .

[15]  Karen Korning Zethsen,et al.  Latin-based terms: True or false friends? , 2004 .

[16]  I. Dan Melamed,et al.  Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[17]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[18]  George R. Doddington,et al.  The ATIS Spoken Language Systems Pilot Corpus , 1990, HLT.

[19]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[20]  E. Lerner,et al.  Medical communication: do our patients understand? , 2000, The American journal of emergency medicine.

[21]  Branimir Boguraev,et al.  Automatic Glossary Extraction: Beyond Terminology Identification , 2002, COLING.

[22]  Casimir A. Kulikowski,et al.  Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems , 1990 .

[23]  Sophia Ananiadou,et al.  A Methodology for Automatic Term Recognition , 1994, COLING.

[24]  Limsoon Wong,et al.  Accomplishments and challenges in literature data mining for biology , 2002, Bioinform..

[25]  Patrick Drouin Termhood experiments: quantifying the relevance of candidate terms , 2006 .

[26]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[27]  R. Rudd,et al.  Leave No One Behind: Improving Health and Risk Communication Through Attention to Literacy , 2003, Journal of health communication.

[28]  György Surján,et al.  About the Language of Hungarian Discharge Reports , 2003, MIE.

[29]  I DiFlorio Mothers' comprehension of terminology associated with the care of a newborn baby. , 1991, Pediatric nursing.

[30]  Vibhu O. Mittal,et al.  Bridging the lexical chasm: statistical approaches to answer-finding , 2000, SIGIR '00.

[31]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[32]  Christopher K. Riesbeck,et al.  Inside Case-Based Reasoning , 1989 .

[33]  P. Lafon Sur la variabilité de la fréquence des formes dans un corpus , 1980 .

[34]  Janet L. Kolodner,et al.  Case-Based Reasoning , 1988, IJCAI 1989.

[35]  David W. Aha,et al.  Lazy Learning , 1997, Springer Netherlands.

[36]  Andrea Mulloni,et al.  Semantic Evidence for Automatic Identification of Cognates , 2007 .

[37]  G. L. Banay An Introduction to Medical Terminology I. Greek and Latin Derivations. , 1948, Bulletin of the Medical Library Association.

[38]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[39]  Gosse Bouma,et al.  Using Multilingual Terms for Biomedical Term Extraction , 2007 .

[40]  Daniel Marcu,et al.  Cognates Can Improve Statistical Translation Models , 2003, NAACL.

[41]  T. Sano,et al.  [Diabetic retinopathy]. , 2001, Nihon rinsho. Japanese journal of clinical medicine.

[42]  Sophia Ananiadou,et al.  The C-value/NC-value domain-independent method for multi-word term extraction , 1999 .

[43]  J P Assal,et al.  [Diabetic retinopathy. Interpretation of medical terms by patients]. , 1995, Journal francais d'ophtalmologie.

[44]  David L. Waltz,et al.  Toward memory-based reasoning , 1986, CACM.

[45]  Klaar Vanopstal,et al.  Incorporation of two terminology projects into a system for information retrieval using NLP for term expansion , 2007 .

[46]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[47]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[48]  Jun'ichi Tsujii,et al.  Tuning support vector machines for biomedical named entity recognition , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[49]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[50]  José Gabriel Pereira Lopes,et al.  Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units , 1999, EPIA.

[51]  A. Kilgarriff Comparing Corpora , 2001 .

[52]  Viktor Pekar,et al.  Methods for extracting and classifying pairs of cognates and false friends , 2008, Machine Translation.

[53]  Kyo Kageura,et al.  METHODS OF AUTOMATIC TERM RECOGNITION : A REVIEW , 1996 .

[54]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[55]  G. Dias,et al.  Cognates alignment , 2001, MTSUMMIT.

[56]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[57]  Nigel Collier,et al.  Extracting the Names of Genes and Gene Products with a Hidden Markov Model , 2000, COLING.

[58]  Horacio Rodríguez,et al.  Improving Term Extraction by System Combination Using Boosting , 2001, ECML.

[59]  Nina Wacholder,et al.  Spotting and Discovering Terms Through Natural Language Processing , 2003, Information Retrieval.

[60]  Antton Gurrutxaga Hernaiz,et al.  Elexbi, a Basic Tool for Bilingual Term Extraction from Spanish-Basque Parallel Corpora , 2006 .

[61]  Paul Rayson,et al.  Comparing Corpora using Frequency Profiling , 2000, Proceedings of the workshop on Comparing corpora -.

[62]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[63]  G. Dias,et al.  Automatic Extraction of Multiword Units for Estonian : Phrasal Verbs , 2003 .

[64]  Jennifer Pearson Strategies for Identifying Terms in Specialised Texts. , 1996 .

[65]  Walter Daelemans,et al.  TiMBL: Tilburg Memory-Based Learner , 2007 .