Coding nucleic acid sequences with graph convolutional network

Genome sequencing technologies reveal a huge amount of genomic sequences. Neural network-based methods can be prime candidates for retrieving insights from these sequences because of their applicability to large and diverse datasets.However, the highly variable lengths of nucleic acid sequences severely impair the presentation of sequences as input to the neural network. Genetic variations further complicate tasks that involve sequence comparison or alignment. Here, we propose a graph representation of nucleic acid sequences called gapped pattern graphs. These graphs can be transformed through a Graph Convolutional Network to form lower-dimensional embeddings for downstream tasks. On the basis of the gapped pattern graphs, we implemented a neural network model and demonstrated its performance in studying phage sequences. We compared our model with equivalent models based on other forms of input in performing four tasks related to nucleic acid sequences—phage and ICE discrimination, phage integration site prediction, lifestyle prediction, and host prediction. Other state-of-the-art tools were also compared, where available. Our method consistently outperformed all the other methods in various metrics on all four tasks. In addition, our model was able to identify distinct gapped pattern signatures from the sequences.

[1]  W. Jianping,et al.  DeepHost: phage host prediction with convolutional neural network. , 2021, Briefings in bioinformatics.

[2]  Huaiqiu Zhu,et al.  DeePhage: distinguishing virulent and temperate phage-derived sequences in metavirome data with a deep learning approach , 2021, GigaScience.

[3]  Y. Tong,et al.  Mining bacterial NGS data vastly expands the complete genomes of temperate phages , 2021, bioRxiv.

[4]  M. Voet,et al.  Differential transcription profiling of the phage LUZ19 infection process in different growth media , 2021, RNA biology.

[5]  Z. Xuan,et al.  DNA sequences performs as natural language processing by exploiting deep learning algorithm for the identification of N4-methylcytosine , 2021, Scientific Reports.

[6]  Timothy L. Bailey,et al.  STREME: Accurate and versatile sequence motif discovery , 2020, bioRxiv.

[7]  R. Finn,et al.  Massive expansion of human gut bacteriophage diversity , 2020, Cell.

[8]  Shuguang Han,et al.  Its2vec: Fungal Species Identification Using Sequence Embedding and Random Forest Classification , 2020, BioMed research international.

[9]  C. Wilke,et al.  BACPHLIP: predicting bacteriophage lifestyle from conserved protein domains , 2020, bioRxiv.

[10]  David S. Rosenblum,et al.  Directed Graph Convolutional Network , 2020, ArXiv.

[11]  R. Xavier,et al.  Comprehensive analysis of chromosomal mobile genetic elements in the gut microbiome reveals phylum-level niche-adaptive gene pools , 2019, PloS one.

[12]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[13]  Yaodong Yang,et al.  Spectral-based Graph Convolutional Network for Directed Graphs , 2019, ArXiv.

[14]  De-Shuang Huang,et al.  Modeling in-vivo protein-DNA binding by combining multiple-instance learning with a hybrid deep neural network , 2019, Scientific Reports.

[15]  Jan Eric Lenssen,et al.  Fast Graph Representation Learning with PyTorch Geometric , 2019, ArXiv.

[16]  Q. Zou,et al.  Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA , 2018, RNA.

[17]  Natapol Pornputtapong,et al.  MHCSeqNet: a deep neural network model for universal MHC binding prediction , 2018, BMC Bioinformatics.

[18]  Razvan Pascanu,et al.  Relational inductive biases, deep learning, and graph networks , 2018, ArXiv.

[19]  Jean-Philippe Vert,et al.  Continuous embeddings of DNA sequencing reads, and application to metagenomics , 2018, bioRxiv.

[20]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[21]  Hani Z. Girgis,et al.  MeShClust: an intelligent tool for clustering DNA sequences , 2017, bioRxiv.

[22]  Jonas S. Almeida,et al.  Alignment-free sequence comparison: benefits, applications, and tools , 2017, Genome Biology.

[23]  Jonathan Vincent,et al.  WIsH: who is the host? Predicting prokaryotic hosts from metagenomic phage contigs , 2017, Bioinform..

[24]  Travis N. Mavrich,et al.  Bacteriophage evolution differs by host, lifestyle and genome , 2017, Nature Microbiology.

[25]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[26]  Patrick Ng,et al.  dna2vec: Consistent vector representations of variable-length k-mers , 2017, ArXiv.

[27]  Xiaohui Xie,et al.  HLA class I binding prediction via convolutional neural networks , 2017, bioRxiv.

[28]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[29]  M. Salas,et al.  Global Transcriptional Analysis of Virus-Host Interactions between Phage ϕ29 and Bacillus subtilis , 2016, Journal of Virology.

[30]  Morten Nielsen,et al.  HostPhinder: A Phage Host Prediction Tool , 2016, Viruses.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Yiming Bao,et al.  NCBI Viral Genomes Resource , 2014, Nucleic Acids Res..

[33]  Torsten Seemann,et al.  Prokka: rapid prokaryotic genome annotation , 2014, Bioinform..

[34]  Hsi-Yuan Huang,et al.  CRP represses the CRISPR/Cas system in Escherichia coli: evidence that endogenous CRISPR spacers impede phage P1 replication , 2014, Molecular microbiology.

[35]  Ivan Erill,et al.  CollecTF: a database of experimentally validated transcription factor-binding sites in Bacteria , 2013, Nucleic Acids Res..

[36]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[37]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[38]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[39]  Zhen Xu,et al.  ICEberg: a web-based resource for integrative and conjugative elements found in Bacteria , 2011, Nucleic Acids Res..

[40]  Mihai Pop,et al.  DNACLUST: accurate and efficient clustering of phylogenetic marker genes , 2011, BMC Bioinformatics.

[41]  Sean R. Eddy,et al.  Hidden Markov model speed heuristic and iterative HMM search procedure , 2010, BMC Bioinformatics.

[42]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[43]  Lucian Ilie,et al.  Multiple spaced seeds for homology search , 2007, Bioinform..

[44]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[45]  Steven Salzberg,et al.  Identifying bacterial genes and endosymbiont DNA with Glimmer , 2007, Bioinform..

[46]  Lih-Yuan Deng,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning , 2006, Technometrics.

[47]  B. Davis,et al.  Bacteriophage-bacteriophage interactions in the evolution of pathogenic bacteria. , 2001, Trends in microbiology.

[48]  Michael Perlmutter,et al.  MagNet: A Magnetic Neural Network for Directed Graphs , 2021, ArXiv.

[49]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[50]  Robert C. Edgar,et al.  Search and clustering orders of magnitude faster than BLAST , 2010 .