Predicting Decision-Making Time for Diagnosis over NGS Cycles: An Interpretable Machine Learning Approach

Motivation Genome sequencing processes are commonly followed by computational analysis in medical diagnosis. The analyses are generally performed once the sequencing process has finished. However, in time-critical applications, it is crucial to start diagnosis once sufficient evidence has been accumulated. This research aims to define a proof-of-principle for predicting earlier time for decision-making using a machine learning approach. The method is evaluated on Illumina sequencing cycles for pathogen diagnosis. Results We utilized a Long-Short Term Memory (LSTM) approach to make predictions for the early decision-making time in time-critical clinical applications. We modeled the (meta-)information obtained from NGS intermediate cycles to investigate whether there are any changes to expect in the remaining sequencing cycles. We tested our model on different patient datasets, resulting in high accuracy of over 98%, indicating the model is independent of a dataset. Furthermore, we can save several hours of turnaround time by using the early prediction results. We used the SHapley Additive exPlanations (SHAP) framework for the interpretation and assessment of the LSTM classifier. Availability The source code is available at https://gitlab.com/dacs-hpi/ngs-biclass. Contact Bernhard.Renard@hpi.de

[1]  B. Renard,et al.  ReadBouncer: precise and scalable adaptive sampling for nanopore sequencing , 2022, bioRxiv.

[2]  Alexander Payne,et al.  Readfish enables targeted nanopore sequencing of gigabase-sized genomes , 2020, Nature Biotechnology.

[3]  S. Harthug,et al.  Early diagnosis of sepsis in emergency departments, time to treatment, and association with mortality: An observational study , 2020, PloS one.

[4]  Hugh Chen,et al.  From local explanations to global understanding with explainable AI for trees , 2020, Nature Machine Intelligence.

[5]  L. Runtuwene,et al.  Portable sequencer in the fight against infectious disease , 2019, Journal of Human Genetics.

[6]  Ali Ajdari,et al.  Adjustable robust treatment-length optimization in radiation therapy , 2019, Optimization and Engineering.

[7]  B. Chapman,et al.  In-field whole genome sequencing using the MinION nanopore sequencer to detect the presence of high-prized military targets , 2019, Australian Journal of Forensic Sciences.

[8]  Andreas Andrusch,et al.  PAIPline: pathogen identification in metagenomic and clinical next generation sequencing samples , 2018, Bioinform..

[9]  Simon H. Tausch,et al.  Reliable variant calling during runtime of Illumina sequencing , 2018, Scientific Reports.

[10]  Andreas Andrusch,et al.  LiveKraken‐‐real‐time metagenomic classification of illumina data , 2018, Bioinform..

[11]  Alexandre Pouget,et al.  Learning optimal decisions with confidence , 2018, Proceedings of the National Academy of Sciences.

[12]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[13]  Bernhard Y. Renard,et al.  PaPrBaG: A machine learning approach for the detection of novel pathogens from NGS data , 2017, Scientific Reports.

[14]  Piotr Wojtek Dabrowski,et al.  HiLive: real‐time mapping of illumina reads while sequencing , 2016, Bioinform..

[15]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[16]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[17]  R. Ratcliff,et al.  Sequential Sampling Models in Cognitive Neuroscience: Advantages, Applications, and Extensions. , 2016, Annual review of psychology.

[18]  Charles Elkan,et al.  Learning to Diagnose with LSTM Recurrent Neural Networks , 2015, ICLR.

[19]  Laurie D. Smith,et al.  A 26-hour system of highly sensitive whole genome sequencing for emergency management of genetic diseases , 2015, Genome Medicine.

[20]  Laurie D. Smith,et al.  Whole-genome sequencing for identification of Mendelian disorders in critically ill infants: a retrospective analysis of diagnostic and clinical findings. , 2015, The Lancet. Respiratory medicine.

[21]  Alexander S. Mikheyev,et al.  A first look at the Oxford Nanopore MinION sequencer , 2014, Molecular ecology resources.

[22]  Changjin Hong,et al.  PathoScope 2.0: a complete computational framework for strain identification in environmental or clinical sequencing samples , 2014, Microbiome.

[23]  M. Zaharia,et al.  A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples , 2014, Genome Research.

[24]  Vincent Ferretti,et al.  Feasibility of real time next generation sequencing of cancer genes linked to drug response: Results from a clinical trial , 2013, International journal of cancer.

[25]  Rafail Ostrovsky,et al.  Optimal sampling from sliding windows , 2009, J. Comput. Syst. Sci..

[26]  A Brahme,et al.  Development of Radiation Therapy Optimization , 2000, Acta oncologica.

[27]  Sepp Hochreiter,et al.  The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[28]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[29]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[30]  L. Shapley A Value for n-person Games , 1988 .

[31]  Yutaka Suzuki,et al.  On-Site MinION Sequencing. , 2019, Advances in experimental medicine and biology.