Using machine learning for predicting cervical cancer from Swedish electronic health records by mining hierarchical representations

Electronic health records (EHRs) contain rich documentation regarding disease symptoms and progression, but EHR data is challenging to use for diagnosis prediction due to its high dimensionality, relative scarcity, and substantial level of noise. We investigated how to best represent EHR data for predicting cervical cancer, a serious disease where early detection is beneficial for the outcome of treatment. A case group of 1321 patients with cervical cancer were matched to ten times as many controls, and for both groups several types of events were extracted from their EHRs. These events included clinical codes, lab results, and contents of free text notes retrieved using a LSTM neural network. Clinical events are described with great variation in EHR texts, leading to a very large feature space. Therefore, an event hierarchy inferred from the textual events was created to represent the clinical texts. Overall, the events extracted from free text notes contributed the most to the final prediction, and the hierarchy of textual events further improved performance. Four classifiers were evaluated for predicting a future cancer diagnosis where Random Forest achieved the best results with an AUC of 0.70 from a year before diagnosis up to 0.97 one day before diagnosis. We conclude that our approach is sound and had excellent discrimination at diagnosis, but only modest discrimination capacity before this point. Since our study objective was earlier disease prediction than such, we propose further work should consider extending patient histories through e.g. the integration of primary health records preceding referral to hospital.

[1]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[2]  Joshua C. Denny,et al.  Type 2 Diabetes Risk Forecasting from EMR Data using Machine Learning , 2012, AMIA.

[3]  Ping Zhang,et al.  Risk Prediction with Electronic Health Records: A Deep Learning Approach , 2016, SDM.

[4]  Alicia Pérez,et al.  Recent advances in Swedish and Spanish medical entity recognition in clinical texts using deep neural approaches , 2019, BMC Medical Informatics and Decision Making.

[5]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[6]  Søren Brunak,et al.  Using Electronic Patient Records to Discover Disease Correlations and Stratify Patient Cohorts , 2011, PLoS Comput. Biol..

[7]  Hercules Dalianis,et al.  Retrieving disorders and findings: Results using SNOMED CT and NegEx adapted for Swedish , 2011 .

[8]  Christoph U. Lehmann,et al.  Clinical Data Reuse or Secondary Use: Current Status and Potential Future Progress , 2017, Yearbook of Medical Informatics.

[9]  R. Lowry,et al.  Concepts and Applications of Inferential Statistics , 2014 .

[10]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[11]  Di Zhao,et al.  Combining PubMed knowledge and EHR data to develop a weighted bayesian network for pancreatic cancer prediction , 2011, J. Biomed. Informatics.

[12]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[13]  Nigam H. Shah,et al.  Toward personalizing treatment for depression: predicting diagnosis and severity , 2014, J. Am. Medical Informatics Assoc..

[14]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[15]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[16]  Mark Hoogendoorn,et al.  Predictive modeling of colorectal cancer using a dedicated pre-processing pipeline on routine electronic medical records , 2016, Comput. Biol. Medicine.

[17]  Christoph U. Lehmann,et al.  Clinical Data Reuse or Secondary Use: Current Status and Potential Future Progress. , 2017, Yearbook of medical informatics.

[18]  Jing Zhao,et al.  Detecting Adverse Drug Events Using Concept Hierarchies of Clinical Codes , 2014, 2014 IEEE International Conference on Healthcare Informatics.

[19]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[20]  Fenglong Ma,et al.  Dipole: Diagnosis Prediction in Healthcare via Attention-based Bidirectional Recurrent Neural Networks , 2017, KDD.

[21]  Jing Zhao,et al.  Detecting adverse drug events with multiple representations of clinical measurements , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[22]  Hua Xu,et al.  Data from clinical notes: a perspective on the tension between structure and flexible documentation , 2011, J. Am. Medical Informatics Assoc..

[23]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[24]  Szilard Nemes,et al.  Increased consultation frequency in primary care, a risk marker for cancer: a case–control study , 2016, Scandinavian journal of primary health care.

[25]  Maria Kvist,et al.  HEALTH BANK - A Workbench for Data Science Applications in Healthcare , 2015, CAiSE Industry Track.

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[27]  Maria Skeppstedt,et al.  Negation detection in Swedish clinical text: An adaption of NegEx to Swedish , 2011, J. Biomed. Semant..

[28]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[29]  John P. A. Ioannidis,et al.  Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review , 2017, J. Am. Medical Informatics Assoc..

[30]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[31]  George Hripcsak,et al.  Next-generation phenotyping of electronic health records , 2012, J. Am. Medical Informatics Assoc..

[32]  Li Li,et al.  Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records , 2016, Scientific Reports.