Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence

Traditional protein annotation methods describe known domains with probabilistic models representing consensus among homologous domain sequences. However, when relevant signals become too weak to be identified by a global consensus, attempts for annotation fail. Here we address the fundamental question of domain identification for highly divergent proteins. By using high performance computing, we demonstrate that the limits of state-of-the-art annotation methods can be bypassed. We design a new strategy based on the observation that many structural and functional protein constraints are not globally conserved through all species but might be locally conserved in separate clades. We propose a novel exploitation of the large amount of data available: 1. for each known protein domain, several probabilistic clade-centered models are constructed from a large and differentiated panel of homologous sequences, 2. a decision-making protocol combines outcomes obtained from multiple models, 3. a multi-criteria optimization algorithm finds the most likely protein architecture. The method is evaluated for domain and architecture prediction over several datasets and statistical testing hypotheses. Its performance is compared against HMMScan and HHblits, two widely used search methods based on sequence-profile and profile-profile comparison. Due to their closeness to actual protein sequences, clade-centered models are shown to be more specific and functionally predictive than the broadly used consensus models. Based on them, we improved annotation of Plasmodium falciparum protein sequences on a scale not previously possible. We successfully predict at least one domain for 72% of P. falciparum proteins against 63% achieved previously, corresponding to 30% of improvement over the total number of Pfam domain predictions on the whole genome. The method is applicable to any genome and opens new avenues to tackle evolutionary questions such as the reconstruction of ancient domain duplications, the reconstruction of the history of protein architectures, and the estimation of protein domain age. Website and software: http://www.lcqb.upmc.fr/CLADE.

[1]  Alejandro Ochoa,et al.  Using context to improve protein domain identification , 2011, BMC Bioinformatics.

[2]  O. Gascuel,et al.  SeaView version 4: A multiplatform graphical user interface for sequence alignment and phylogenetic tree building. , 2010, Molecular biology and evolution.

[3]  Christine A. Orengo,et al.  A fast and automated solution for accurately resolving protein domain architectures , 2010, Bioinform..

[4]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[5]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[6]  Yong Wang,et al.  Using Model Trees for Classification , 1998, Machine Learning.

[7]  M Levitt,et al.  Alignment of the amino acid sequences of distantly related proteins using variable gap penalties. , 1986, Protein engineering.

[8]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[9]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[10]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[11]  C. Orengo,et al.  Protein function annotation by homology-based inference , 2009, Genome Biology.

[12]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[13]  B Franz Lang,et al.  The tree of eukaryotes. , 2005, Trends in ecology & evolution.

[14]  P Stothard,et al.  The sequence manipulation suite: JavaScript programs for analyzing and formatting protein and DNA sequences. , 2000, BioTechniques.

[15]  Michael Y. Galperin,et al.  Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes , 2003, BMC Evolutionary Biology.

[16]  Pierre Brézellec,et al.  Gene fusion/fission is a major contributor to evolution of multi-domain bacterial proteins , 2006, Bioinform..

[17]  Jaap Heringa,et al.  webPRC: the Profile Comparer for alignment-based searching of public domain databases , 2009, Nucleic Acids Res..

[18]  David A. Lee,et al.  Gene3D: Multi-domain annotations for protein sequence and comparative genome analysis , 2013, Nucleic Acids Res..

[19]  A. Elofsson,et al.  Domain rearrangements in protein evolution. , 2005, Journal of molecular biology.

[20]  Michelle S. Scott,et al.  Predicting subcellular localization via protein motif co-occurrence. , 2004, Genome research.

[21]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[22]  Ganesan Pugalenthi,et al.  Predicting protein structural class by SVM with class-wise optimized features and decision probabilities. , 2008, Journal of theoretical biology.

[23]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[24]  S. Wuchty,et al.  Evolutionary cores of domain co-occurrence networks , 2005, BMC Evolutionary Biology.

[25]  A. Lesk,et al.  Determinants of a protein fold. Unique features of the globin amino acid sequences. , 1987, Journal of molecular biology.

[26]  Anushya Muruganujan,et al.  PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees , 2012, Nucleic Acids Res..

[27]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[28]  Fangli Lu,et al.  cDNA sequences reveal considerable gene prediction inaccuracy in the Plasmodium falciparum genome , 2007, BMC Genomics.

[29]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[30]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[31]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[32]  Chris P Ponting,et al.  Genome cartography through domain annotation , 2001, Genome Biology.

[33]  Nick V Grishin,et al.  Access the most recent version at doi: 10.1110/ps.03197403 References , 2003 .

[34]  Olivier Gascuel,et al.  Detection of new protein domains using co-occurrence: application to Plasmodium falciparum , 2009, Bioinform..

[35]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[36]  Ken Chen,et al.  On the detection of functionally coherent groups of protein domains with an extension to protein annotation , 2007, BMC Bioinformatics.

[37]  Karine Prat,et al.  Prediction of the general transcription factors associated with RNA polymerase II in Plasmodium falciparum: conserved features and differences relative to other eukaryotes , 2005, BMC Genomics.

[38]  Eileen Kraemer,et al.  PlasmoDB: a functional genomic database for malaria parasites , 2008, Nucleic Acids Res..

[39]  R. Durbin,et al.  Enhanced protein domain discovery by using language modeling techniques from speech recognition , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Lewis Y. Geer,et al.  CDART: protein homology by domain architecture. , 2002, Genome research.

[41]  D. Eisenberg,et al.  Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[42]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[43]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[44]  Golan Yona,et al.  Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. , 2002, Journal of molecular biology.

[45]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[46]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..

[47]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[48]  J E Darnell,et al.  Speculations on the early course of evolution. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[49]  Ian Sillitoe,et al.  Gene3D: a domain-based resource for comparative genomics, functional annotation and protein network analysis , 2011, Nucleic Acids Res..

[50]  Alessandra Carbone,et al.  A multi-objective optimization approach accurately resolves protein domain architectures , 2015, Bioinform..

[51]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[52]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[53]  Nello Cristianini,et al.  Large Margin DAGs for Multiclass Classification , 1999, NIPS.

[54]  Catherine Vaquero,et al.  In silico and biological survey of transcription-associated proteins implicated in the transcriptional machinery during the erythrocytic development of Plasmodium falciparum , 2010, BMC Genomics.

[55]  Philip E. Bourne,et al.  The Evolutionary History of Protein Domains Viewed by Species Phylogeny , 2009, PloS one.

[56]  Ricardo Vilalta,et al.  Metalearning - Applications to Data Mining , 2008, Cognitive Technologies.

[57]  Andrew D. Moore,et al.  Arrangements in the modular evolution of proteins. , 2008, Trends in biochemical sciences.

[58]  Markus Wistrand,et al.  Improving profile HMM discrimination by adapting transition probabilities. , 2004, Journal of molecular biology.

[59]  M Vingron,et al.  Phylogenetic information improves homology detection , 2001, Proteins.

[60]  S. Teichmann,et al.  Supra-domains: evolutionary units larger than single protein domains. , 2004, Journal of molecular biology.

[61]  Alessandra Carbone,et al.  A discriminative method for family-based protein remote homology detection that combines inductive logic programming and propositional models , 2011, BMC Bioinformatics.

[62]  A Carbone,et al.  Periodic distributions of hydrophobic amino acids allows the definition of fundamental building blocks to align distantly related proteins , 2007, Proteins.

[63]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[64]  Christian J Stoeckert,et al.  Computational modeling of the Plasmodium falciparum interactome reveals protein function on a genome-wide scale. , 2006, Genome research.

[65]  S. Teichmann,et al.  Domain combinations in archaeal, eubacterial and eukaryotic proteomes. , 2001, Journal of molecular biology.