Leveraging machine learning to identify acute myeloid leukemia patients and their chemotherapy regimens in an administrative database

BACKGROUND Administrative datasets are useful for identifying rare disease cohorts such as pediatric acute myeloid leukemia (AML). Previously, cohorts were assembled using labor-intensive, manual reviews of patients' longitudinal chemotherapy data. METHODS We utilized a two-step machine learning (ML) method to (i) identify pediatric patients with newly diagnosed AML, and (ii) among the identified AML patients, their chemotherapy courses, in an administrative/billing database. Using 2558 patients previously manually reviewed, multiple ML algorithms were derived from 75% of the study sample, and the selected model was tested in the remaining hold-out sample. The selected model was also applied to assemble a new pediatric AML cohort and further assessed in an external validation, using a standalone cohort established by manual chart abstraction. RESULTS For patient identification, the selected Support Vector Machine model yielded a sensitivity of 0.97 and a positive predictive value (PPV) of 0.97 in the hold-out test sample. For course-specific chemotherapy regimen and start date identification, the selected Random Forest model yielded overall PPV greater than or equal to 0.88 and sensitivity greater than or equal to 0.86 across all courses in the test sample. When applied to new cohort assembly, ML identified 3016 AML patients with 10,588 treatment courses. In the external validation subset, PPV was greater than or equal to 0.75 and sensitivity was greater than or equal to 0.82 for patient identification, and PPV was greater than or equal to 0.93 and sensitivity was greater than or equal to 0.94 for regimen identifications. CONCLUSION A carefully designed ML model can accurately identify pediatric AML patients and their chemotherapy courses from administrative databases. This approach may be generalizable to other diseases and databases.

[1]  T. Attard,et al.  Improving Cohort Definitions in Research Using Hospital Administrative Databases-Do We Need Guidelines? , 2022, JAMA pediatrics.

[2]  Alistair E. W. Johnson,et al.  Machine learning approaches to investigate Clostridioides difficile infection and outcomes: A systematic review , 2022, Int. J. Medical Informatics.

[3]  N. Winick,et al.  Medical Outcomes, Quality of Life, and Family Perceptions for Outpatient vs Inpatient Neutropenia Management After Chemotherapy for Pediatric Acute Myeloid Leukemia , 2021, JAMA network open.

[4]  J. Donnelly,et al.  External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients. , 2021, JAMA internal medicine.

[5]  A. Jemal,et al.  Cancer Statistics, 2021 , 2021, CA: a cancer journal for clinicians.

[6]  G. Wertheim,et al.  Risk-Adapted Preemptive Tocilizumab to Prevent Severe Cytokine Release Syndrome After CTL019 for Pediatric B-Cell Acute Lymphoblastic Leukemia: A Prospective Clinical Trial. , 2021, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[7]  Chava L. Ramspek,et al.  External validation of prognostic models: what, why, how, when and where? , 2020, Clinical kidney journal.

[8]  M. Bornhäuser,et al.  Application of machine learning in the management of acute myeloid leukemia: current practice and future prospects. , 2020, Blood advances.

[9]  S. Grosse,et al.  Administrative data identify sickle cell disease: A critical review of approaches in U.S. health services research , 2020, Pediatric blood & cancer.

[10]  J. Wilkes,et al.  Dronabinol Prescribing and Exposure Among Children and Young Adults Diagnosed with Cancer. , 2020, Journal of adolescent and young adult oncology.

[11]  R. Tavakkoli-Moghaddam,et al.  Design of an integrated model for diagnosis and classification of pediatric acute leukemia using machine learning , 2020, Proceedings of the Institution of Mechanical Engineers. Part H, Journal of engineering in medicine.

[12]  Adam G. D'Souza,et al.  Enhancing ICD-Code-Based Case Definition for Heart Failure Using Electronic Medical Record Data. , 2020, Journal of cardiac failure.

[13]  Anne E Carpenter,et al.  Label‐Free Leukemia Monitoring by Computer Vision , 2020, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[14]  Ibrahim N. Muhsen,et al.  Machine learning applications in the diagnosis of leukemia: Current trends and future directions , 2019, International journal of laboratory hematology.

[15]  E. Goldmuntz,et al.  2-Year Outcomes After Complete or Staged Procedure for Tetralogy of Fallot in Neonates. , 2019, Journal of the American College of Cardiology.

[16]  Martin Kampel,et al.  Automated Flow Cytometric MRD Assessment in Childhood Acute B‐ Lymphoblastic Leukemia Using Supervised Machine Learning , 2019, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[17]  A. Seif,et al.  Comparable on‐therapy mortality and supportive care requirements in Black and White patients following initial induction for pediatric acute myeloid leukemia , 2019, Pediatric blood & cancer.

[18]  Christopher P. Bonafide,et al.  Machine learning models for early sepsis recognition in the neonatal intensive care unit using readily available electronic health record data , 2019, PloS one.

[19]  Rachel E. Rutkowski,et al.  Identifying Algorithms to Improve the Accuracy of Unverified Diagnosis Codes for Birth Defects , 2018, Public health reports.

[20]  A. Seif,et al.  Opioid utilization among pediatric patients treated for newly diagnosed acute myeloid leukemia , 2018, PloS one.

[21]  Paul J. Kennedy,et al.  Convolutional Deep Belief Network with Feature Encoding for Classification of Neuroblastoma Histological Images , 2018, Journal of pathology informatics.

[22]  A. Seif,et al.  The role of acuity of illness at presentation in early mortality in black children with acute myeloid leukemia , 2017, American journal of hematology.

[23]  Inigo Martincorena,et al.  Precision oncology for acute myeloid leukemia using a knowledge bank approach , 2017, Nature Genetics.

[24]  J. Mann,et al.  Using Administrative Data to Ascertain True Cases of Muscular Dystrophy: Rare Disease Surveillance , 2017, JMIR public health and surveillance.

[25]  John P. A. Ioannidis,et al.  Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review , 2017, J. Am. Medical Informatics Assoc..

[26]  A. Seif,et al.  Low rates of pregnancy screening in adolescents before teratogenic exposures in a national sample of children's hospitals , 2016, Cancer.

[27]  Richard D Riley,et al.  External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: opportunities and challenges , 2016, BMJ.

[28]  A. Seif,et al.  Merging Children’s Oncology Group Data with an External Administrative Database Using Indirect Patient Identifiers: A Report from the Children’s Oncology Group , 2015, PloS one.

[29]  A. Seif,et al.  A comparison of resource utilization following chemotherapy for acute myeloid leukemia in children discharged versus children that remain hospitalized during neutropenia , 2015, Cancer medicine.

[30]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[31]  A. Seif,et al.  Dexrazoxane exposure and risk of secondary acute myeloid leukemia in pediatric oncology patients , 2015, Pediatric blood & cancer.

[32]  J. Ioannidis,et al.  External validation of new risk prediction models is infrequent and reveals worse prognostic discrimination. , 2015, Journal of clinical epidemiology.

[33]  A. Seif,et al.  Association of weekend admission with hospital length of stay, time to chemotherapy, and risk for respiratory failure in pediatric patients with newly diagnosed leukemia at freestanding US children's hospitals. , 2014, JAMA pediatrics.

[34]  J. Hanly,et al.  Identification of patients with systemic lupus erythematosus in administrative healthcare databases , 2014, Lupus.

[35]  A. Seif,et al.  Establishing a high‐risk neuroblastoma cohort using the pediatric health information system database , 2014, Pediatric blood & cancer.

[36]  A. Seif,et al.  Assembly of a cohort of children treated for acute myeloid leukemia at free‐standing children's hospitals in the United States using an administrative database , 2013, Pediatric blood & cancer.

[37]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..