Genome Sequence Classification for Animal Diagnostics with Graph Representations and Deep Neural Networks

Bovine Respiratory Disease Complex (BRDC) is a complex respiratory disease in cattle with multiple etiologies, including bacterial and viral. It is estimated that mortality, morbidity, therapy, and quarantine resulting from BRDC account for significant losses in the cattle industry. Early detection and management of BRDC are crucial in mitigating economic losses. Current animal disease diagnostics is based on traditional tests such as bacterial culture, serolog, and Polymerase Chain Reaction (PCR) tests. Even though these tests are validated for several diseases, their main challenge is their limited ability to detect the presence of multiple pathogens simultaneously. Advancements of data analytics and machine learning and applications over metagenome sequencing are setting trends on several applications. In this work, we demonstrate a machine learning approach to identify pathogen signatures present in bovine metagenome sequences using k-mer-based network embedding followed by a deep learning-based classification task. With experiments conducted on two different simulated datasets, we show that networks-based machine learning approaches can detect pathogen signature with up to 89.7% accuracy. We will make the data available publicly upon request to tackle this important problem in a difficult domain.

[1]  Srinivas Katkoori,et al.  Latent Space Modeling for Cloning Encrypted PUF-Based Authentication , 2019, IFIPIoT.

[2]  Sheng Li,et al.  Graph embedding and unsupervised learning predict genomic sub-compartments from HiC chromatin interaction data , 2020, Nature Communications.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Jimeng Sun,et al.  Fast Random Walk Graph Kernel , 2012, SDM.

[5]  T. Sharpton An introduction to the analysis of shotgun metagenomic data , 2014, Front. Plant Sci..

[6]  Yang Liu,et al.  graph2vec: Learning Distributed Representations of Graphs , 2017, ArXiv.

[7]  Carl Kingsford,et al.  Asymptotically optimal minimizers schemes , 2018, bioRxiv.

[8]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[9]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[10]  Wei Zhang,et al.  Network-based machine learning and graph theory algorithms for precision oncology , 2017, npj Precision Oncology.

[11]  Ole Lund,et al.  A Bacterial Analysis Platform: An Integrated System for Analysing Bacterial Whole Genome Sequencing Data for Clinical Diagnostics and Surveillance , 2016, PloS one.

[12]  Alex Fout,et al.  Protein Interface Prediction using Graph Convolutional Networks , 2017, NIPS.

[13]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[14]  Arunkumar Bagavathi,et al.  Examining Untempered Social Media: Analyzing Cascades of Polarized Conversations , 2019, 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[15]  Kurt Mehlhorn,et al.  Weisfeiler-Lehman Graph Kernels , 2011, J. Mach. Learn. Res..

[16]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[17]  Fatemeh Almodaresi,et al.  A space and time-efficient index for the compacted colored de Bruijn graph , 2017, bioRxiv.

[18]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[19]  Antonino Fiannaca,et al.  Deep learning models for bacteria taxonomic classification of metagenomic data , 2018, BMC Bioinformatics.

[20]  Sunmo Yang,et al.  HumanNet v2: human gene networks for disease research , 2018, Nucleic Acids Res..

[21]  G. Dougan,et al.  Routine Use of Microbial Whole Genome Sequencing in Diagnostic and Public Health Microbiology , 2012, PLoS pathogens.

[22]  Kurt Mehlhorn,et al.  Efficient graphlet kernels for large graph comparison , 2009, AISTATS.

[23]  Pavel A. Pevzner,et al.  Assembly of long error-prone reads using de Bruijn graphs , 2016, Proceedings of the National Academy of Sciences.

[24]  Ning Chen,et al.  Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding , 2017, Bioinform..

[25]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[26]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[27]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[28]  Nataliya Sokolovska,et al.  Deep Learning for Metagenomic Data: using 2D Embeddings and Convolutional Neural Networks , 2017, ArXiv.

[29]  P E Klapper,et al.  Multiplex PCR: Optimization and Application in Diagnostic Virology , 2000, Clinical Microbiology Reviews.

[30]  Albert-László Barabási,et al.  A Genetic Model of the Connectome , 2019, Neuron.

[31]  T. Alexander,et al.  Characterization of Mannheimia haemolytica isolated from feedlot cattle that were healthy or treated for bovine respiratory disease. , 2014, Canadian journal of veterinary research = Revue canadienne de recherche veterinaire.

[32]  S. V. N. Vishwanathan,et al.  Graph kernels , 2007 .

[33]  Bernhard Y. Renard,et al.  PaPrBaG: A machine learning approach for the detection of novel pathogens from NGS data , 2017, Scientific Reports.

[34]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[35]  Michalis Vazirgiannis,et al.  GraKeL: A Graph Kernel Library in Python , 2018, J. Mach. Learn. Res..

[36]  Max Maurin,et al.  Real-time PCR as a diagnostic tool for bacterial diseases , 2012, Expert review of molecular diagnostics.

[37]  Hans-Peter Kriegel,et al.  Shortest-path kernels on graphs , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[38]  Gil McVean,et al.  Integrating long-range connectivity information into de Bruijn graphs , 2017, bioRxiv.

[39]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Yixin Chen,et al.  An End-to-End Deep Learning Architecture for Graph Classification , 2018, AAAI.

[41]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..