Natural language processing: an introduction

OBJECTIVES To provide an overview and tutorial of natural language processing (NLP) and modern NLP-system design. TARGET AUDIENCE This tutorial targets the medical informatics generalist who has limited acquaintance with the principles behind NLP and/or limited knowledge of the current state of the art. SCOPE We describe the historical evolution of NLP, and summarize common NLP sub-problems in this extensive field. We then provide a synopsis of selected highlights of medical NLP efforts. After providing a brief description of common machine-learning approaches that are being used for diverse NLP sub-problems, we discuss how modern NLP architectures are designed, with a summary of the Apache Foundation's Unstructured Information Management Architecture. We finally consider possible future directions for NLP, and reflect on the possible impact of IBM Watson on the medical field.

[1]  John Hutchins,et al.  The first public demonstration of machine translation : the Georgetown-IBM system , 7 th January 1954 , 2006 .

[2]  Tughrul Arslan,et al.  IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2000) , 2000 .

[3]  Peter J. Haug,et al.  ONYX: A System for the Semantic Analysis of Clinical Text , 2009, BioNLP@HLT-NAACL.

[4]  Craig A. Morioka,et al.  IndexFinder: A Method of Extracting Key Concepts from Clinical Texts for Indexing , 2003, AMIA.

[5]  Andrew McCallum,et al.  Gene Prediction with Conditional Random Fields , 2005 .

[6]  Cui Tao,et al.  Time-Oriented Question Answering from Clinical Narratives Using Semantic-Web Techniques , 2010, SEMWEB.

[7]  Peter J. Haug,et al.  MPLUS: a probabilistic medical language understanding system , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[8]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.

[9]  David Scott Warren,et al.  Programming in Tabled Prolog , 1995 .

[10]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[11]  Wendy W. Chapman,et al.  Fever detection from free-text clinical records for biosurveillance , 2004, Journal of Biomedical Informatics.

[12]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[13]  Jeffrey E. F. Friedl Mastering Regular Expressions , 1997 .

[14]  Hongfang Liu,et al.  A study of abbreviations in MEDLINE abstracts , 2002, AMIA.

[15]  Raymond J. Mooney,et al.  Active Learning for Natural Language Parsing and Information Extraction , 1999, ICML.

[16]  P. Haug,et al.  Computerized extraction of coded findings from free-text radiologic reports. Work in progress. , 1990, Radiology.

[17]  Yuan Luo,et al.  Identifying patient smoking status from medical discharge records. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[18]  Randolph A. Miller,et al.  Review: Medical Diagnostic Decision Support Systems - Past, Present, And Future: A Threaded Bibliography and Brief Commentary , 1994, J. Am. Medical Informatics Assoc..

[19]  Xiaoyan Wang,et al.  Active computerized pharmacovigilance using natural language processing, statistics, and electronic health records: a feasibility study. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[20]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[21]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[22]  Özlem Uzuner,et al.  Extracting medication information from clinical text , 2010, J. Am. Medical Informatics Assoc..

[23]  Prakash M. Nadkarni,et al.  Research Paper: Use of General-purpose Negation Detection to Augment Concept Indexing of Medical Documents: A Quantitative Study Using the UMLS , 2001, J. Am. Medical Informatics Assoc..

[24]  Timothy M. Franz,et al.  Enhancement of clinicians' diagnostic reasoning by computer-based consultation: a multisite study of 2 systems. , 1999, JAMA.

[25]  Paul Fodor,et al.  Natural Language Processing With Prolog in the IBM Watson System , 2011 .

[26]  Noam Chomsky,et al.  On Certain Formal Properties of Grammars , 1959, Inf. Control..

[27]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[28]  Paul Fodor,et al.  The Prolog Interface to the Unstructured Information Management Architecture , 2008, ArXiv.

[29]  Wendy W. Chapman,et al.  Identifying Respiratory Findings in Emergency Department Reports for Biosurveillance using MetaMap , 2004, MedInfo.

[30]  Yang Huang,et al.  A novel hybrid approach to automated negation detection in clinical radiology reports. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[31]  Su Jian,et al.  Exploring Deep Knowledge Resources in Biomedical Name Recognition , 2004, NLPBA/BioNLP.

[32]  Jian Su,et al.  Exploring Deep Knowledge Resources in Biomedical Name Recognition , 2004, NLPBA/BioNLP.

[33]  Tony Mason,et al.  Lex & Yacc , 1992 .

[34]  Stuart M. Shieber,et al.  Foundational issues in natural language processing , 1991 .

[35]  Alaa A. Kharbouch,et al.  Three models for the description of language , 1956, IRE Trans. Inf. Theory.

[36]  Allen C. Browne,et al.  Evaluating lexical variant generation to improve information retrieval , 1998, AMIA.

[37]  Peter Spyns Natural Language Processing in Medicine: An Overview , 1996, Methods of Information in Medicine.

[38]  A. L. Baker,et al.  Performance of four computer-based diagnostic systems. , 1994, The New England journal of medicine.

[39]  Florentina Hristea Statistical Natural Language Processing , 2011, International Encyclopedia of Statistical Science.

[40]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[41]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[42]  Dan Jurafsky,et al.  Statistical Natural Language Processing , 2010, Encyclopedia of Machine Learning.

[43]  Mark Hasegawa-Johnson,et al.  Multivariate-state hidden Markov models for simultaneous transcription of phones and formants , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[44]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[45]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[46]  S C Kleene,et al.  Representation of Events in Nerve Nets and Finite Automata , 1951 .

[47]  David J. Weir,et al.  The convergence of mildly context-sensitive grammar formalisms , 1990 .

[48]  T C Rindflesch,et al.  Ambiguity resolution while mapping free text to the UMLS Metathesaurus. , 1994, Proceedings. Symposium on Computer Applications in Medical Care.

[49]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[50]  Wendy W. Chapman,et al.  Evaluation of negation phrases in narrative clinical reports , 2001, AMIA.

[51]  Brian W. Kernighan,et al.  The UNIX™ programming environment , 1979, Softw. Pract. Exp..

[52]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[53]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[54]  Özlem Uzuner,et al.  Viewpoint Paper: Recognizing Obesity and Comorbidities in Sparse Data , 2009, J. Am. Medical Informatics Assoc..

[55]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[56]  Charles Elkan Log-linear models and conditional random fields , 2007 .

[57]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[58]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[59]  George Hripcsak,et al.  Using empiric semantic correlation to interpret temporal assertions in clinical texts. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[60]  Pierre Zweigenbaum,et al.  Morphosemantic parsing of medical compound words: Transferring a French analyzer to English , 2009, Int. J. Medical Informatics.

[61]  Randolph A. Miller,et al.  Research Paper: Evaluation of a Method to Identify and Categorize Section Headers in Clinical Documents , 2009, J. Am. Medical Informatics Assoc..

[62]  Jian Su,et al.  Enhancing HMM-based biomedical named entity recognition by studying special phenomena , 2004, J. Biomed. Informatics.

[63]  János Csirik,et al.  The CoNLL-2010 Shared Task: Learning to Detect Hedges and their Scope in Natural Language Text , 2010, CoNLL Shared Task.

[64]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[65]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[66]  R A Miller,et al.  The Demise of the “Greek Oracle” Model for Medical Diagnostic Systems , 1990, Methods of Information in Medicine.

[67]  Wendy W. Chapman,et al.  Anaphoric relations in the clinical narrative: corpus creation , 2011, J. Am. Medical Informatics Assoc..

[68]  L. Tick,et al.  Medical Language Processing: Applications to Patient Data Representation and Automatic Encoding , 1995, Methods of Information in Medicine.

[69]  Carol Friedman,et al.  Research Paper: Methods for Building Sense Inventories of Abbreviations in Clinical Notes , 2009, J. Am. Medical Informatics Assoc..

[70]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[71]  Marc Weeber,et al.  Developing a test collection for biomedical word sense disambiguation , 2001, AMIA.

[72]  Xiaoyan Wang,et al.  Characterizing environmental and phenotypic associations using information theory and electronic health records , 2009, BMC Bioinformatics.

[73]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[74]  Jonathan G. Goldin,et al.  A concept-based retrieval system for thoracic radiology , 1996, Journal of Digital Imaging.

[75]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[76]  Graham Cormode,et al.  Discrete methods in epidemiology , 2007 .

[77]  Carol Friedman,et al.  Extracting Phenotypic Information from the Literature via Natural Language Processing , 2004, MedInfo.

[78]  Christopher G. Chute,et al.  The horizontal and vertical nature of patient phenotype retrieval: new directions for clinical text processing , 2002, AMIA.

[79]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..