Powerful Sequence Similarity Search Methods and In-Depth Manual Analyses Can Identify Remote Homologs in Many Apparently “Orphan” Viral Proteins

ABSTRACT The genome sequences of new viruses often contain many “orphan” or “taxon-specific” proteins apparently lacking homologs. However, because viral proteins evolve very fast, commonly used sequence similarity detection methods such as BLAST may overlook homologs. We analyzed a data set of proteins from RNA viruses characterized as “genus specific” by BLAST. More powerful methods developed recently, such as HHblits or HHpred (available through web-based, user-friendly interfaces), could detect distant homologs of a quarter of these proteins, suggesting that these methods should be used to annotate viral genomes. In-depth manual analyses of a subset of the remaining sequences, guided by contextual information such as taxonomy, gene order, or domain cooccurrence, identified distant homologs of another third. Thus, a combination of powerful automated methods and manual analyses can uncover distant homologs of many proteins thought to be orphans. We expect these methodological results to be also applicable to cellular organisms, since they generally evolve much more slowly than RNA viruses. As an application, we reanalyzed the genome of a bee pathogen, Chronic bee paralysis virus (CBPV). We could identify homologs of most of its proteins thought to be orphans; in each case, identifying homologs provided functional clues. We discovered that CBPV encodes a domain homologous to the Alphavirus methyltransferase-guanylyltransferase; a putative membrane protein, SP24, with homologs in unrelated insect viruses and insect-transmitted plant viruses having different morphologies (cileviruses, higreviruses, blunerviruses, negeviruses); and a putative virion glycoprotein, ORF2, also found in negeviruses. SP24 and ORF2 are probably major structural components of the virions.

[1]  J. Cooper,et al.  Sequence analysis and location of capsid proteins within RNA 2 of strawberry latent ringspot virus. , 1994, The Journal of general virology.

[2]  A. A. Souza,et al.  Complete nucleotide sequence, genomic organization and phylogenetic analysis of Citrus leprosis virus cytoplasmic type. , 2006, The Journal of general virology.

[3]  Christopher Miller,et al.  Membrane transport proteins: surprises in structural sameness , 2010, Nature Structural &Molecular Biology.

[4]  Paolo Frasconi,et al.  MetalDetector v2.0: predicting the geometry of metal binding sites from protein sequence , 2011, Nucleic Acids Res..

[5]  V. Dolja,et al.  Comparative and functional genomics of closteroviruses , 2006, Virus Research.

[6]  Robert D. Finn,et al.  HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[7]  J. Derisi,et al.  Temporal Analysis of the Honey Bee Microbiome Reveals Four Novel Viruses and Seasonal Prevalence of Known Viruses, Nosema, and Crithidia , 2011, PloS one.

[8]  R. Houlgatte,et al.  Molecular characterisation and phylogenetic analysis of Chronic bee paralysis virus, a honey bee virus. , 2008, Virus research.

[9]  E. Kitajima,et al.  Mechanical transmission and ultrastructural aspects of citrus leprosis disease. , 1995 .

[10]  D. Tautz,et al.  The evolutionary origin of orphan genes , 2011, Nature Reviews Genetics.

[11]  Ying Zhang,et al.  The Flavivirus Precursor Membrane-Envelope Protein Complex: Structure and Maturation , 2008, Science.

[12]  A. Gorbalenya,et al.  Partitioning the Genetic Diversity of a Virus Family: Approach and Evaluation through a Case Study of Picornaviruses , 2012, Journal of Virology.

[13]  M. Pépin,et al.  Detection of chronic honey bee (Apis mellifera L.) paralysis virus infection: application to a field survey , 2000 .

[14]  Kengo Kinoshita,et al.  Prediction of disordered regions in proteins based on the meta approach , 2008, Bioinform..

[15]  Vladimir Vacic,et al.  Composition Profiler: a tool for discovery and visualization of amino acid composition differences , 2007, BMC Bioinformatics.

[16]  Xuping Xie,et al.  Membrane Topology and Function of Dengue Virus NS2A Protein , 2013, Journal of Virology.

[17]  J. Maniloff,et al.  Virus taxonomy : eighth report of the International Committee on Taxonomy of Viruses , 2005 .

[18]  F. Schurr,et al.  Experimental infection of the honeybee (Apis mellifera L.) with the chronic bee paralysis virus (CBPV): infectivity of naked CBPV RNAs. , 2012, Virus Research.

[19]  A. Keith Dunker,et al.  Overlapping Genes Produce Proteins with Unusual Sequence Properties and Offer Insight into De Novo Protein Creation , 2009, Journal of Virology.

[20]  E. Koonin,et al.  Evolutionary genomics of archaeal viruses: unique viral genomes in the third domain of life. , 2006, Virus research.

[21]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[22]  Warren C. Lathe,et al.  Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. , 2000, Genome research.

[23]  A. Gorbalenya,et al.  Toward Genetics-Based Virus Taxonomy: Comparative Analysis of a Genetics-Based Classification and the Taxonomy of Picornaviruses , 2012, Journal of Virology.

[24]  David G. Karlin,et al.  Detecting Remote Sequence Homology in Disordered Proteins: Discovery of Conserved Motifs in the N-Termini of Mononegavirales phosphoproteins , 2012, PloS one.

[25]  R. Sanjuán,et al.  Viral Mutation Rates , 2010, Journal of Virology.

[26]  C. Desnues,et al.  Computational tools for viral metagenomics and their application in clinical research , 2012, Virology.

[27]  David S. Eisenberg,et al.  Finding families for genomic ORFans , 1999, Bioinform..

[28]  Sebastian Maurer-Stroh,et al.  Not all transmembrane helices are born equal: Towards the extension of the sequence homology concept to membrane proteins , 2011, Biology Direct.

[29]  Troy Hernandez,et al.  Real Time Classification of Viruses in 12 Dimensions , 2013, PloS one.

[30]  H. Guzmán,et al.  Negevirus: a Proposed New Taxon of Insect-Specific Viruses with Wide Geographic Distribution , 2012, Journal of Virology.

[31]  D. Fischer,et al.  Structural and functional insights into Mimivirus ORFans , 2007, BMC Genomics.

[32]  Eugene V. Koonin,et al.  Evolution of microbes and viruses: a paradigm shift in evolutionary biology? , 2012, Front. Cell. Inf. Microbio..

[33]  M. Albà,et al.  On homology searches by protein Blast and the characterization of the age of genes , 2007, BMC Evolutionary Biology.

[34]  E. Bornberg-Bauer,et al.  Mechanisms and Dynamics of Orphan Gene Emergence in Insect Genomes , 2013, Genome biology and evolution.

[35]  D. Raoult,et al.  Classification and Determination of Possible Origins of ORFans through Analysis of Nucleocytoplasmic Large DNA Viruses , 2010, Intervirology.

[36]  E. Koonin,et al.  Phylogeny of capsid proteins of small icosahedral RNA plant viruses. , 1991, The Journal of general virology.

[37]  L. Aravind Guilt by association: contextual information in genome analysis. , 2000, Genome research.

[38]  Ming Tang,et al.  PROMALS web server for accurate multiple protein sequence alignments , 2007, Nucleic Acids Res..

[39]  Sitao Wu,et al.  LOMETS: A local meta-threading-server for protein structure prediction , 2007, Nucleic acids research.

[40]  R. Brlansky,et al.  A novel virus of the genus Cilevirus causing symptoms similar to citrus leprosis. , 2013, Phytopathology.

[41]  T. Bosch,et al.  More than just orphans: are taxonomically-restricted genes important in evolution? , 2009, Trends in genetics : TIG.

[42]  E. Gould,et al.  Novel Virus Discovery and Genome Reconstruction from Field RNA Samples Reveals Highly Divergent Viruses in Dipteran Hosts , 2013, PloS one.

[43]  Georg Schneider,et al.  ANNIE: integrated de novo protein sequence annotation , 2009, Nucleic Acids Res..

[44]  Endogenous RNA viruses of plants in insect genomes. , 2012, Virology.

[45]  Michael Emerman,et al.  Paleovirology - ghosts and gifts of viruses past. , 2011, Current opinion in virology.

[46]  D. Field,et al.  Orphans as taxonomically restricted and ecologically important genes. , 2005, Microbiology.

[47]  P. Harmon,et al.  Genetic characterization of Blueberry necrotic ring blotch virus, a novel RNA virus with unique genetic features. , 2013, The Journal of general virology.

[48]  John S. Hu,et al.  Characterization of a virus infecting Citrus volkameriana with citrus leprosis-like symptoms. , 2012, Phytopathology.

[49]  Adam Godzik,et al.  FFAS server: novel features and applications. Nucleic Acids Res 39:W38-W44 , 2011 .

[50]  E. Koonin,et al.  Conservation of the putative methyltransferase domain: a hallmark of the 'Sindbis-like' supergroup of positive-strand RNA viruses. , 1992, The Journal of general virology.

[51]  Johannes Söding,et al.  Fast and accurate automatic structure prediction with HHpred , 2009, Proteins.

[52]  Johannes Söding,et al.  Protein sequence comparison and fold recognition: progress and good-practice benchmarking. , 2011, Current opinion in structural biology.

[53]  D. Voss,et al.  Studies on membrane topology, N-glycosylation and functionality of SARS-CoV membrane protein , 2009, Virology Journal.

[54]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[55]  W. Pearson,et al.  The limits of protein sequence comparison? , 2005, Current opinion in structural biology.

[56]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[57]  Olivier Gascuel,et al.  Detection of new protein domains using co-occurrence: application to Plasmodium falciparum , 2009, Bioinform..

[58]  Daniel Fischer,et al.  Identification and investigation of ORFans in the viral world , 2008, BMC Genomics.

[59]  A. Katzourakis,et al.  Paleovirology and virally derived immunity. , 2012, Trends in ecology & evolution.

[60]  Sonia Longhi,et al.  A practical overview of protein disorder prediction methods , 2006, Proteins.

[61]  Christoph Weber,et al.  FFAS server: novel features and applications , 2011, Nucleic Acids Res..

[62]  Johannes Söding,et al.  The MPI Bioinformatics Toolkit for protein sequence analysis , 2006, Nucleic Acids Res..

[63]  P. Arruda,et al.  The Complete Nucleotide Sequence and Genomic Organization of Citrus Leprosis Associated Virus, Cytoplasmatic type (CiLV-C) , 2006, Virus Genes.

[64]  A. Gibbs,et al.  The purification and properties of chornic bee-paralysis virus. , 1968, The Journal of general virology.

[65]  Sun Tian,et al.  Application of a sensitive collection heuristic for very large protein families: Evolutionary relationship between adipose triglyceride lipase (ATGL) and classic mammalian lipases , 2006, BMC Bioinformatics.

[66]  Jaap Heringa,et al.  webPRC: the Profile Comparer for alignment-based searching of public domain databases , 2009, Nucleic Acids Res..

[67]  N Srinivasan,et al.  Assessment of a Rigorous Transitive Profile Based Search Method to Detect Remotely Similar Proteins , 2005, Journal of biomolecular structure & dynamics.

[68]  Sebastian Maurer-Stroh,et al.  Transmembrane helix: simple or complex , 2012, Nucleic Acids Res..

[69]  Narayanaswamy Srinivasan,et al.  Improved Detection of Remote Homologues Using Cascade PSI-BLAST: Influence of Neighbouring Protein Families on Sequence Coverage , 2013, PloS one.

[70]  Allan Olspert,et al.  Sobemovirus RNA linked to VPg over a threonine residue , 2011, FEBS letters.

[71]  J. Silberg,et al.  A transposase strategy for creating libraries of circularly permuted proteins , 2012, Nucleic acids research.

[72]  Erik L. L. Sonnhammer,et al.  Advantages of combined transmembrane topology and signal peptide prediction—the Phobius web server , 2007, Nucleic Acids Res..

[73]  Alejandro Ochoa,et al.  Using context to improve protein domain identification , 2011, BMC Bioinformatics.

[74]  Roland L. Dunbrack Sequence comparison and protein structure prediction. , 2006, Current opinion in structural biology.

[75]  P. Bork,et al.  Quod erat demonstrandum? The mystery of experimental validation of apparently erroneous computational analyses of protein sequences , 2001, Genome Biology.

[76]  P. Revill,et al.  The nucleotide sequence and genome organization of mushroom bacilliform virus: a single-stranded RNA virus of Agaricus bisporus (Lange) Imbach. , 1994, Virology.

[77]  Cédric Notredame,et al.  Accurate multiple sequence alignment of transmembrane proteins with PSI-Coffee , 2012, BMC Bioinformatics.

[78]  R. Baric,et al.  Coronavirus Genome Structure and Replication , 2005, Current topics in microbiology and immunology.

[79]  István Simon,et al.  The HMMTOP transmembrane topology prediction server , 2001, Bioinform..

[80]  The UniProt Consortium,et al.  Reorganizing the protein space at the Universal Protein Resource (UniProt) , 2011, Nucleic Acids Res..

[81]  Angelo Pavesi,et al.  Viral Proteins Originated De Novo by Overprinting Can Be Identified by Codon Usage: Application to the “Gene Nursery” of Deltaretroviruses , 2013, PLoS Comput. Biol..

[82]  A. Biegert,et al.  Sequence context-specific profiles for homology searching , 2009, Proceedings of the National Academy of Sciences.

[83]  Hans Bitter,et al.  ViralZone: recent updates to the virus knowledge resource , 2012, Nucleic Acids Res..

[84]  Fernanda L. Sirota,et al.  Protein Sequence–Structure–Function–Network Links Discovered with the ANNOTATOR Software Suite: Application to ELYS/Mel-28 , 2012 .

[85]  Johannes Söding,et al.  Discriminative modelling of context-specific amino acid substitution probabilities , 2012, Bioinform..

[86]  P. Keese,et al.  Origins of genes: "big bang" or continuous creation? , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[87]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.