Information Extraction for Clinical Data Mining: A Mammography Case Study

Breast cancer is the leading cause of cancer mortality in women between the ages of 15 and 54. During mammography screening, radiologists use a strict lexicon (BI-RADS) to describe and report their findings. Mammography records are then stored in a well-defined database format (NMD). Lately, researchers have applied data mining and machine learning techniques to these databases. They successfully built breast cancer classifiers that can help in early detection of malignancy. However, the validity of these models depends on the quality of the underlying databases. Unfortunately, most databases suffer from inconsistencies, missing data, inter-observer variability and inappropriate term usage. In addition, many databases are not compliant with the NMD format and/or solely consist of text reports. BI-RADS feature extraction from free text and consistency checks between recorded predictive variables and text reports are crucial to addressing this problem. We describe a general scheme for concept information retrieval from free text given a lexicon, and present a BI-RADS features extraction algorithm for clinical data mining. It consists of a syntax analyzer, a concept finder and a negation detector. The syntax analyzer preprocesses the input into individual sentences. The concept finder uses a semantic grammar based on the BI-RADS lexicon and the experts’ input. It parses sentences detecting BI-RADS concepts. Once a concept is located, a lexical scanner checks for negation. Our method can handle multiple latent concepts within the text, filtering out ultrasound concepts. On our dataset, our algorithm achieves 97.7% precision, 95.5% recall and an F1-score of 0.97. It outperforms manual feature extraction at the 5% statistical significance level.

[1]  M. Plummer,et al.  International agency for research on cancer. , 2020, Archives of pathology.

[2]  V. Lenin,et al.  The United States of America , 2002, Government Statistical Agencies and the Politics of Credibility.

[3]  O. Linton,et al.  American College of Radiology , 2018, Definitions.

[4]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.

[5]  Carol Friedman,et al.  Research Paper: A General Natural-language Text Processor for Clinical Radiology , 1994, J. Am. Medical Informatics Assoc..

[6]  Padhraic Smyth,et al.  Bounds on the mean classification error rate of multiple experts , 1996, Pattern Recognit. Lett..

[7]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[8]  A. Jemal,et al.  Global cancer statistics , 2011, CA: a cancer journal for clinicians.

[9]  B. Burnside,et al.  Automated Indexing of Mammography Reports Using Linear Least Squares Fit , 2000 .

[10]  Eleazar Eskin,et al.  Detecting Errors within a Corpus using Anomaly Detection , 2000, ANLP.

[11]  R. J. Brenner False-negative mammograms. Medical, legal, and risk management implications. , 2000, Radiologic clinics of North America.

[12]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[13]  M. Petticrew,et al.  FALSE-NEGATIVE RESULTS IN SCREENING PROGRAMS , 2001, International Journal of Technology Assessment in Health Care.

[14]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[15]  Prakash M. Nadkarni,et al.  Research Paper: Use of General-purpose Negation Detection to Augment Concept Indexing of Medical Documents: A Quantitative Study Using the UMLS , 2001, J. Am. Medical Informatics Assoc..

[16]  Robert H. Baud,et al.  Comparing General and Medical Texts for Information Retrieval Based on Natural Language Processing: An Inquiry into Lexical Disambiguation , 2001, MedInfo.

[17]  Wendy W. Chapman,et al.  Evaluation of negation phrases in narrative clinical reports , 2001, AMIA.

[18]  L. Liberman,et al.  Breast imaging reporting and data system (BI-RADS). , 2002, Radiologic clinics of North America.

[19]  David G. Stork,et al.  Evaluating Classifiers by Means of Test Data with Noisy Labels , 2003, IJCAI.

[20]  Lior Rokach,et al.  Information Retrieval System for Medical Narrative Reports , 2004, FQAS.

[21]  Charles E. Kahn,et al.  Knowledge Discovery from Structured Mammography Reports Using Inductive Logic Programming , 2005, AMIA.

[22]  S. Nass,et al.  Improving breast imaging quality standards , 2005 .

[23]  J. Ferlay,et al.  Global Cancer Statistics, 2002 , 2005, CA: a cancer journal for clinicians.

[24]  Jesse Davis,et al.  View Learning for Statistical Relational Learning: With an Application to Mammography , 2005, IJCAI.

[25]  Éric Gaussier,et al.  A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation , 2005, ECIR.

[26]  Lior Rokach,et al.  Cascaded Data Mining Methods for Text Understanding, with medical case study , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[27]  Yang Huang,et al.  A novel hybrid approach to automated negation detection in clinical radiology reports. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[28]  D. Miglioretti,et al.  Coding free text radiology reports using the Cancer Text Information Extraction System (caTIES). , 2007, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[29]  William J. Long,et al.  Lessons Extracting Diseases from Discharge Summaries , 2007, AMIA.

[30]  Katharina Kaiser,et al.  Syntactical Negation Detection in Clinical Practice Guidelines , 2008, MIE.

[31]  C. D. Page,et al.  Probabilistic computer model developed from clinical data in national mammography database format to classify mammographic findings. , 2009, Radiology.

[32]  E. Burnside,et al.  A logistic regression model based on the national mammography database format to aid breast cancer diagnosis. , 2009, AJR. American journal of roentgenology.

[33]  P. Boyle,et al.  World Cancer Report 2008 , 2009 .

[34]  A. Jemal,et al.  Global Cancer Statistics , 2011 .