Automatic population of structured reports from narrative pathology reports

The aim of this project is to use the methods of natural language processing to extract pertinent information from free-text pathology reports to automatically populate structured reports. A processing pipeline has been developed cosseting of a combination of a supervised machine learning based approach using Conditional Random Fields for medical entity recognition and some rule-based methods. In total 477 narrative pathology reports of primary cutaneous melanomas were collected for evaluation. Evaluations on the training set show that system performance can be improved by about 8.7% by refinement of the rules. The overall micro-averaged precision, recall and F-score of end to end evaluation on the test set are 89.44%, 80.60% and 84.79% respectively. Our study indicates the feasibility of this approach to automate the population of structured template from narrative reports with promising results. Error analysis reveals that a single specimen report with standard headings and the presence of simple and concise statements is significantly associated with correct populations. In conclusion, the system can improve pathology reporting, and data mining for cancer registries, clinical audits and epidemiology research.

[1]  T. Lazar,et al.  Histology and Cell Biology—An Introduction to Pathology , 2002 .

[2]  L. Ferreira,et al.  Vertical growth phase and positive sentinel node in thin melanoma. , 2003, Brazilian journal of medical and biological research = Revista brasileira de pesquisas medicas e biologicas.

[3]  Jon Patrick,et al.  An Active Learning Process for Extraction and Standardisation of Medical Measurements by a Trainable FSA , 2011, CICLing.

[4]  Maria Kvist,et al.  Negation Scope Delimitation in Clinical Text Using Three Approaches: NegEx, PyConTextNLP and SynNeg , 2013, NODALIDA.

[5]  Jon D. Patrick,et al.  An Automated System for Conversion of Clinical Notes into SNOMED Clinical Terminology , 2007, ACSW.

[6]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Methods of Information in Medicine.

[7]  Saul A. Kripke,et al.  Naming and Necessity , 1980 .

[8]  Angus Roberts,et al.  Combining Terminology Resources and Statistical Methods for Entity Recognition: an Evaluation , 2008, LREC.

[9]  K. Hou‐Jensen,et al.  Partial regression in thin primary cutaneous malignant melanomas clinical stage I , 1985, Virchows Archiv A.

[10]  Raymond J. Mooney,et al.  Active Learning for Natural Language Parsing and Information Extraction , 1999, ICML.

[11]  Katherine S Panageas,et al.  Tumor-infiltrating lymphocytes predict sentinel lymph node positivity in patients with cutaneous melanoma. , 2007, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[12]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[13]  Guergana K. Savova,et al.  System Evaluation on a Named Entity Corpus from Clinical Notes , 2008, LREC.

[14]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[15]  Erik M. van Mulligen,et al.  Using rule-based natural language processing to improve disease normalization in biomedical text , 2012, J. Am. Medical Informatics Assoc..

[16]  Yefeng Wang,et al.  Annotating and Recognising Named Entities in Clinical Notes , 2009, ACL.

[17]  Domonkos Tikk,et al.  Improving textual medication extraction using combined conditional random fields and rule-based systems , 2010, J. Am. Medical Informatics Assoc..

[18]  Masaki Murata,et al.  Extracting Protein-Protein Interaction Information from Biomedical Text with SVM , 2006, IEICE Trans. Inf. Syst..

[19]  Nochomovitz Le,et al.  Application of synoptic reports. , 1998 .

[20]  P Zweigenbaum,et al.  A multi-lingual architecture for building a normalised conceptual representation from medical language. , 1995, Proceedings. Symposium on Computer Applications in Medical Care.

[21]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[22]  B. Leggett,et al.  Tumour infiltrating lymphocytes and apoptosis are independent features in colorectal cancer stratified according to microsatellite instability status , 2001, Gut.

[23]  F. Wright,et al.  Barriers to optimal assessment of lymph nodes in colorectal cancer specimens. , 2004, American journal of clinical pathology.

[24]  L Fritschi,et al.  Pathology reporting of breast cancer: trends in 1989-1999, following the introduction of mammographic screening in Western Australia. , 2005, Pathology.

[25]  T. Nagakawa,et al.  Perineural invasion of carcinoma of the pancreas and biliary tract , 1993, The British journal of surgery.

[26]  A. Wayne Whitney,et al.  A Direct Method of Nonparametric Measurement Selection , 1971, IEEE Transactions on Computers.

[27]  D. Reintgen,et al.  Establishing a Standard of Care for the Patient with Melanoma , 2001, Annals of Surgical Oncology.

[28]  G. Moody,et al.  Predicting acute hypotensive episodes: The 10th annual PhysioNet/Computers in Cardiology Challenge , 2010, 2009 36th Annual Computers in Cardiology Conference (CinC).

[29]  Thomas P. Minka,et al.  Algorithms for maximum-likelihood logistic regression , 2003 .

[30]  Naomi Sager,et al.  Research Paper: Natural Language Processing and the Representation of Clinical Data , 1994, J. Am. Medical Informatics Assoc..

[31]  Philip J. Stone,et al.  Extracting Information. (Book Reviews: The General Inquirer. A Computer Approach to Content Analysis) , 1967 .

[32]  János Csirik,et al.  The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes , 2008, BMC Bioinformatics.

[33]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[34]  Marshall S. Smith,et al.  The general inquirer: A computer approach to content analysis. , 1967 .

[35]  Barbara Rosario,et al.  Classifying Semantic Relations in Bioscience Texts , 2004, ACL.

[36]  Rachel L Richesson,et al.  Viewpoint: Data Standards in Clinical Research: Gaps, Overlaps, Challenges and Future Directions , 2007, J. Am. Medical Informatics Assoc..

[37]  Yang Huang,et al.  Research Paper: A Pilot Study of Contextual UMLS Indexing to Improve the Precision of Concept-based Representation in XML-structured Clinical Radiology Reports , 2003, J. Am. Medical Informatics Assoc..

[38]  Lubomir M. Hadjiiski,et al.  Feature selection and classifier performance in computer-aided diagnosis: the effect of finite sample size. , 2000, Medical physics.

[39]  Alexa T. McCray,et al.  An Upper-Level Ontology for the Biomedical Domain , 2003, Comparative and functional genomics.

[40]  Anil V Parwani,et al.  Synoptic tool for reporting of hematological and lymphoid neoplasms based on World Health Organization classification and College of American Pathologists checklist , 2007, BMC Cancer.

[41]  Richard A. Scolyer,et al.  Collaboration between clinicians and pathologists: a necessity for the optimal management of melanoma patients , 2005 .

[42]  Fan Meng,et al.  Tagging Sentence Boundaries in Biomedical Literature , 2007, CICLing.

[43]  Hong Shen,et al.  Voting Between Multiple Data Representations for Text Chunking , 2005, Canadian AI.

[44]  Jun'ichi Tsujii,et al.  Syntactic Features for Protein-Protein Interaction Extraction , 2007, LBM.

[45]  Wendy W. Chapman,et al.  Anaphoric relations in the clinical narrative: corpus creation , 2011, J. Am. Medical Informatics Assoc..

[46]  Hui Yang,et al.  Automatic extraction of medication information from medical discharge summaries , 2010, J. Am. Medical Informatics Assoc..

[47]  ChengXiang Zhai,et al.  A Systematic Exploration of the Feature Space for Relation Extraction , 2007, NAACL.

[48]  Marti A. Hearst,et al.  Adaptive Sentence Boundary Disambiguation , 1994, ANLP.

[49]  Kent A. Spackman,et al.  SNOMED RT: a reference terminology for health care , 1997, AMIA.

[50]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[51]  Son Doan,et al.  Application of information technology: MedEx: a medication information extraction system for clinical narratives , 2010, J. Am. Medical Informatics Assoc..

[52]  Karin M. Verspoor,et al.  From Graphs to Events: A Subgraph Matching Approach for Information Extraction from Biomedical Text , 2011, BioNLP@ACL.

[53]  Shih-Hung Wu,et al.  Various criteria in the evaluation of biomedical named entity recognition , 2006, BMC Bioinformatics.

[54]  Yang Huang,et al.  A novel hybrid approach to automated negation detection in clinical radiology reports. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[55]  Clement J. McDonald,et al.  Extracting Structured Information from Free Text Pathology Reports , 2003, AMIA.

[56]  Özlem Uzuner,et al.  Machine learning and rule-based approaches to assertion classification. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[57]  L. Sobin,et al.  TNM Classification of Malignant Tumours , 1987, UICC International Union Against Cancer.

[58]  Jun'ichi Tsujii,et al.  Named entity recognition of follow-up and time information in 20 000 radiology reports , 2012, J. Am. Medical Informatics Assoc..

[59]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[60]  Peter J. Haug,et al.  Bmc Medical Informatics and Decision Making Automation of a Problem List Using Natural Language Processing , 2005 .

[61]  William Long,et al.  Extracting Diagnoses from Discharge Summaries , 2005, AMIA.

[62]  Min Li,et al.  High accuracy information extraction of medication information from clinical notes: 2009 i2b2 medication extraction challenge , 2010, J. Am. Medical Informatics Assoc..

[63]  Richard A Scolyer,et al.  Pathology of melanocytic lesions: New, controversial, and clinically important issues , 2004, Journal of surgical oncology.

[64]  Tiejun Zhao,et al.  Biomedical Named Entity Recognition Based on Classifiers Ensemble , 2008, Int. J. Comput. Sci. Appl..

[65]  E. Van Cutsem,et al.  Advanced colorectal cancer: ESMO Clinical Practice Guidelines for treatment. , 2010, Annals of oncology : official journal of the European Society for Medical Oncology.

[66]  Martti Juhola,et al.  Stemming and lemmatization in the clustering of finnish text documents , 2004, CIKM '04.

[67]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[68]  Hua Xu,et al.  Clinical entity recognition using structural support vector machines with rich features , 2012, DTMBIO '12.

[69]  Jon D. Patrick,et al.  Mapping Clinical Notes to Medical Terminologies at Point of Care , 2008, BioNLP.

[70]  Jules J Berman,et al.  Implementation and evaluation of a negation tagger in a pipeline-based system for information extract from pathology reports. , 2004, Studies in health technology and informatics.

[71]  Geoffrey Leech Corpus Annotation Schemes , 1993 .

[72]  Richard W. Grant,et al.  Case Report: Using Regular Expressions to Abstract Blood Pressure and Treatment Intensification Information from the Text of Physician Notes , 2006, J. Am. Medical Informatics Assoc..

[73]  Viktor H. Koelzer,et al.  The Tumor Border Configuration of Colorectal Cancer as a Histomorphological Prognostic Indicator , 2014, Front. Oncol..

[74]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[75]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[76]  George Hripcsak,et al.  Automated encoding of clinical documents based on natural language processing. , 2004, Journal of the American Medical Informatics Association : JAMIA.

[77]  Yue Li,et al.  Information extraction from pathology reports in a hospital setting , 2011, CIKM '11.

[78]  Zhenhong Qu,et al.  Synoptic reporting in tumor pathology: advantages of a web-based system. , 2007, American journal of clinical pathology.

[79]  S. Johnson A semantic lexicon for medical language processing. , 1999, Journal of the American Medical Informatics Association : JAMIA.

[80]  Angus Roberts,et al.  Building a semantically annotated corpus of clinical texts , 2009, J. Biomed. Informatics.

[81]  Anne-Lyse Minard,et al.  Multi-class SVM for Relation Extraction from Clinical Reports , 2011, RANLP.

[82]  Anthony N. Nguyen,et al.  Structured Pathology Reporting for Cancer from Free Text: Lung Cancer Case Study , 2011 .

[83]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[84]  Matthew Scotch,et al.  The Yale cTAKES extensions for document classification: architecture and application , 2011, J. Am. Medical Informatics Assoc..

[85]  Martijn J. Schuemie,et al.  Peregrine: Lightweight gene name normalization by dictionary lookup , 2007 .

[86]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[87]  Anthony N. Nguyen,et al.  Symbolic rule-based classification of lung cancer stages from free-text pathology reports , 2010, J. Am. Medical Informatics Assoc..

[88]  Carol Friedman,et al.  PhenoGO: Assigning Phenotypic Context to Gene Ontology Annotations with Natural Language Processing , 2005, Pacific Symposium on Biocomputing.

[89]  Hongfang Liu,et al.  Using machine learning for concept extraction on clinical documents from multiple data sources , 2011, J. Am. Medical Informatics Assoc..

[90]  Dan Klein,et al.  Improved Identification of Noun Phrases in Clinical Radiology Reports Using a High-Performance Statistical Natural Language Parser Augmented with the UMLS Specialist Lexicon , 2005 .

[91]  Yue Li,et al.  Information Extraction of Multiple Categories from Pathology Reports , 2010, ALTA.

[92]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[93]  Hongfang Liu,et al.  BioTagger-GM: a gene/protein name recognition system. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[94]  A. Tannapfel,et al.  Ductal adenocarcinoma of the pancreas. Histopathological features and prognosis. , 1992, International journal of pancreatology : official journal of the International Association of Pancreatology.

[95]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[96]  Angus Roberts,et al.  The CLEF Corpus: Semantic Annotation of Clinical Text , 2007, AMIA.

[97]  Joel D. Martin,et al.  Machine-learned solutions for three stages of clinical information extraction: the state of the art at i2b2 2010 , 2011, J. Am. Medical Informatics Assoc..

[98]  Min Li,et al.  A knowledge discovery and reuse pipeline for information extraction in clinical notes , 2011, J. Am. Medical Informatics Assoc..

[99]  L. Saltz,et al.  Colorectal cancer : multimodality management , 2002 .

[100]  Hongfang Liu,et al.  BioThesaurus: a web-based thesaurus of protein and gene names , 2006, Bioinform..

[101]  Xiaoyan Wang,et al.  Selecting information in electronic health records for knowledge acquisition , 2010, J. Biomed. Informatics.

[102]  Joakim Nivre,et al.  MaltParser: A Language-Independent System for Data-Driven Dependency Parsing , 2007, Natural Language Engineering.

[103]  Sunghwan Sohn,et al.  Dependency Parser-based Negation Detection in Clinical Narratives , 2012, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[104]  J. Shewchuk An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[105]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Approach to Identifying Sentence Boundaries , 1997, ANLP.

[106]  Özlem Uzuner,et al.  Extracting medication information from clinical text , 2010, J. Am. Medical Informatics Assoc..

[107]  Daniel Dominic Sleator,et al.  Parsing English with a Link Grammar , 1995, IWPT.

[108]  Shuying Shen,et al.  Evaluating the state of the art in coreference resolution for electronic medical records , 2012, J. Am. Medical Informatics Assoc..

[109]  Mari Mino-Kenudson,et al.  Lymphocytic Reaction to Colorectal Cancer Is Associated with Longer Survival, Independent of Lymph Node Count, Microsatellite Instability, and CpG Island Methylator Phenotype , 2009, Clinical Cancer Research.

[110]  Siddhartha Jonnalagadda,et al.  Enhancing clinical concept extraction with distributional semantics , 2012, J. Biomed. Informatics.

[111]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[112]  Shuying Shen,et al.  Textractor: a hybrid system for medications and reason for their prescription extraction from clinical text documents , 2010, J. Am. Medical Informatics Assoc..

[113]  Angus Roberts,et al.  Mining clinical relationships from patient narratives , 2008, BMC Bioinformatics.

[114]  Lubomir M. Hadjiiski,et al.  Effect of finite sample size on feature selection and classification: a simulation study. , 2010, Medical physics.

[115]  Yefeng Wang,et al.  Cascading Classifiers for Named Entity Recognition in Clinical Notes , 2009, BiomedicalIE@RANLP.

[116]  Anthony N. Nguyen,et al.  Application of Information Technology: Collection of Cancer Stage Data by Classifying Free-text Medical Reports , 2007, J. Am. Medical Informatics Assoc..

[117]  J. Srigley,et al.  Standardized synoptic cancer pathology reporting: A population‐based approach , 2009, Journal of surgical oncology.

[118]  R. Scolyer,et al.  The advantage of using a synoptic pathology report format for cutaneous melanoma , 2007, Histopathology.

[119]  S. Soderland,et al.  Automatic structuring of radiology free-text reports. , 2001, Radiographics : a review publication of the Radiological Society of North America, Inc.

[120]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[121]  Galia Angelova,et al.  Identifying Relations between Medical Concepts by Parsing UMLS® Definitions , 2011, ICCS.

[122]  Tatsuro Irimura,et al.  Absence of a relationship of size of primary colon carcinoma with metastasis and survival , 1985, Clinical & Experimental Metastasis.

[123]  Prakash M. Nadkarni,et al.  Research Paper: Use of General-purpose Negation Detection to Augment Concept Indexing of Medical Documents: A Quantitative Study Using the UMLS , 2001, J. Am. Medical Informatics Assoc..

[124]  Su Jian,et al.  Exploring Deep Knowledge Resources in Biomedical Name Recognition , 2004, NLPBA/BioNLP.

[125]  Richard A Scolyer,et al.  Cooperation between surgical oncologists and pathologists: a key element of multidisciplinary care for patients with cancer. , 2004, Pathology.

[126]  M. Mcmahon,et al.  Redefining the R1 resection in pancreatic cancer , 2006, The British journal of surgery.

[127]  Jun'ichi Tsujii,et al.  Part-of-Speech Annotation of Biology Research Abstracts , 2004, LREC.

[128]  L. Nieman,et al.  The Impact of Preclinical Preceptorships on Learning the Fundamentals of Clinical Medicine and Physical Diagnosis Skills , 2006, Academic medicine : journal of the Association of American Medical Colleges.

[129]  Dingcheng Li,et al.  Conditional Random Fields and Support Vector Machines for Disorder Named Entity Recognition in Clinical Texts , 2008, BioNLP.

[130]  Sing Kai Lo,et al.  Interobserver Reproducibility of Histopathologic Prognostic Variables in Primary Cutaneous Melanomas , 2003, The American journal of surgical pathology.

[131]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[132]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[133]  Sanda M. Harabagiu,et al.  Automatic extraction of relations between medical concepts in clinical texts , 2011, J. Am. Medical Informatics Assoc..