PhANNs, a fast and accurate tool and web server to classify phage structural proteins

For any given bacteriophage genome or phage sequences in metagenomic data sets, we are unable to assign a function to 50-90% of genes. Structural protein-encoding genes constitute a large fraction of the average phage genome and are among the most divergent and difficult-to-identify genes using homology-based methods. To understand the functions encoded by phages, their contributions to their environments, and to help gauge their utility as potential phage therapy agents, we have developed a new approach to classify phage ORFs into ten major classes of structural proteins or into an “other” category. The resulting tool is named PhANNs (Phage Artificial Neural Networks). We built a database of 538,213 manually curated phage protein sequences that we split into eleven subsets (10 for cross-validation, one for testing) using a novel clustering method that ensures there are no homologous proteins between sets yet maintains the maximum sequence diversity for training. An Artificial Neural Network ensemble trained on features extracted from those sets reached a test F1-score of 0.875 and test accuracy of 86.2%. PhANNs can rapidly classify proteins into one of the ten classes, and non-phage proteins are classified as “other”, providing a new approach for functional annotation of phage proteins. PhANNs is open source and can be run from our web server or installed locally. Author Summary Bacteriophages (phages, viruses that infect bacteria) are the most abundant biological entity on Earth. They outnumber bacteria by a factor of ten. As phages are very different within them and from bacteria, and we have comparatively few phage genes in our database, we are unable to assign function to 50%-90% of phage genes. In this work, we developed PhANNs, a machine learning tool that can classify a phage gene as one of ten structural roles, or “other”. This approach does not require a similar gene to be known.

[1]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[2]  M. W. Pandit,et al.  Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. , 1990, Protein engineering.

[3]  Wei Chen,et al.  Naïve Bayes Classifier with Feature Selection to Identify Phage Virion Proteins , 2013, Comput. Math. Methods Medicine.

[4]  H. Krisch,et al.  The diversity and evolution of the T4-type bacteriophages. , 2003, Research in microbiology.

[5]  Victor Seguritan,et al.  Artificial Neural Networks Trained to Detect Viral and Phage Structural Proteins , 2012, PLoS Comput. Biol..

[6]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[7]  Jessica C. Sacher,et al.  Current State of Compassionate Phage Therapy , 2019, Viruses.

[8]  Carol L. Ecale Zhou,et al.  PHANOTATE: a novel approach to gene identification in phage genomes , 2019, Bioinform..

[9]  H. Goodrich-Blair,et al.  R‐type bacteriocins in related strains of Xenorhabdus bovienii: Xenorhabdicin tail fiber modularity and contribution to competitiveness , 2017, FEMS microbiology letters.

[10]  Jeff F. Miller,et al.  Diversity-generating retroelements. , 2007, Current opinion in microbiology.

[11]  Wei Li,et al.  A Broadly Implementable Research Course in Phage Discovery and Genomics for First-Year Undergraduate Students , 2014, mBio.

[12]  Rida Assaf,et al.  Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center , 2016, Nucleic Acids Res..

[13]  Naoyuki Kanda,et al.  Elastic spectral distortion for low resource speech recognition with deep neural networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[14]  Gwang Lee,et al.  PVP-SVM: Sequence-Based Prediction of Phage Virion Proteins Using a Support Vector Machine , 2018, Front. Microbiol..

[15]  Paul E. Turner,et al.  Parallel Evolution of Host-Attachment Proteins in Phage PP01 Populations Adapting to Escherichia coli O157:H7 , 2018, Pharmaceuticals.

[16]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[17]  R. Edwards,et al.  A diversity-generating retroelement encoded by a globally ubiquitous Bacteroides phage , 2018, Microbiome.

[18]  Runtao Yang,et al.  An Ensemble Method to Distinguish Bacteriophage Virion from Non-Virion Proteins Based on Protein Sequence Characteristics , 2015, International journal of molecular sciences.

[19]  U. Henning,et al.  Single mutations in a gene for a tail fiber component of an Escherichia coli phage can cause an extension from a protein to a carbohydrate as a receptor. , 1991, Journal of molecular biology.

[20]  Barbara A. Bailey,et al.  Prophage genomics reveals patterns in phage genome organization and replication , 2017, bioRxiv.

[21]  Robert A Edwards,et al.  Structure and function of a cyanophage-encoded peptide deformylase , 2013, The ISME Journal.

[22]  Pierre Baldi,et al.  VIRALpro: a tool to identify viral capsid and tail sequences , 2016, Bioinform..

[23]  Matthew K. Waldor,et al.  Lysogenic Conversion by a Filamentous Phage Encoding Cholera Toxin , 1996, Science.

[24]  Forest Rohwer,et al.  Viruses as Winners in the Game of Life. , 2016, Annual review of virology.

[25]  R. Edwards,et al.  Viral metagenomics , 2005, Nature Reviews Microbiology.

[26]  S. Adhya,et al.  Phage Therapy in the Twenty-First Century: Facing the Decline of The Antibiotic Era; Is it Finally Time for The Age of the Phage? , 2019, Annual review of microbiology.

[27]  Carol L. Ecale Zhou,et al.  THEA: A novel approach to gene identification in phage genomes , 2018 .

[28]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Mya Breitbart,et al.  Phage puppet masters of the marine microbial realm , 2018, Nature Microbiology.

[30]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[31]  C. Gautier,et al.  Hydrophobicity, expressivity and aromaticity are the major trends of amino-acid usage in 999 Escherichia coli chromosome-encoded genes. , 1994, Nucleic acids research.

[32]  Barbara A. Bailey,et al.  Lytic to temperate switching of viral communities , 2016, Nature.