Domain Architecture Comparison for Multidomain Homology Identification

Homology identification is the first step for many genomic studies. Current methods, based on sequence comparison, can result in a substantial number of mis-assignments due to the similarity of homologous domains in otherwise unrelated sequences. Here we propose methods to detect homologs through explicit comparison of protein domain content. We developed several schemes for scoring the homology of a pair of protein sequences based on methods used in the field of information retrieval. We evaluate the proposed methods and methods used in the literature using a benchmark of fifteen sequence families of known evolutionary history. The results of these studies demonstrate the effectiveness of comparing domain architectures using these similarity measures. We also demonstrate the importance of both weighting promiscuous domains and of compensating for the statistical effect of having a large number of domains in a protein. Using logistic regression, we demonstrate the benefit of combining similarity measures based on domain content with sequence similarity measures.

[1]  Michael R. Kroeger,et al.  Structure–Function Analysis of the ADAM Family of Disintegrin-Like and Metalloproteinase-Containing Proteins (Review) , 1999, Journal of protein chemistry.

[2]  S. O’Brien,et al.  Comparative genomics: lessons from cats. , 1997, Trends in genetics : TIG.

[3]  C. Chung,et al.  Deubiquitinating enzymes as cellular regulators. , 2003, Journal of biochemistry.

[4]  M. Goodman,et al.  Embryonic epsilon and gamma globin genes of a prosimian primate (Galago crassicaudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. , 1988, Journal of molecular biology.

[5]  U. Yazdani,et al.  The semaphorins , 2006, Genome Biology.

[6]  W. T. Starmer,et al.  A phylogenetic analysis of vertebrate and invertebrate Notch-related genes. , 1995, Molecular phylogenetics and evolution.

[7]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[8]  Jonathan A Raper,et al.  Semaphorins and their receptors in vertebrates and invertebrates , 2000, Current Opinion in Neurobiology.

[9]  Lei Zhu,et al.  An initial strategy for comparing proteins at the domain architecture level , 2006, Bioinform..

[10]  K. Wharton Runnin' with the Dvl: proteins that associate with Dsh/Dvl and their significance to Wnt signal transduction. , 2003, Developmental biology.

[11]  Jérôme Gouzy,et al.  The ProDom database of protein domain families , 1998, Nucleic Acids Res..

[12]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[13]  J. McGhee,et al.  The GATA family (vertebrates and invertebrates). , 2002, Current opinion in genetics & development.

[14]  William R. Atchley,et al.  Molecular Evolution of the GATA Family of Transcription Factors: Conservation Within the DNA-Binding Domain , 2000, Journal of Molecular Evolution.

[15]  M. Blanchette,et al.  Discovery of regulatory elements by a computational method for phylogenetic footprinting. , 2002, Genome research.

[16]  M. Long,et al.  Evolution of novel genes. , 2001, Current opinion in genetics & development.

[17]  M. Lardelli,et al.  Evolutionary analysis of vertebrate Notch genes , 2001, Development Genes and Evolution.

[18]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[19]  H. Hutter,et al.  Conservation and novelty in the evolution of cell adhesion and extracellular matrix genes. , 2000, Science.

[20]  Z. Gu,et al.  Evolutionary analyses of the human genome , 2001, Nature.

[21]  W. Fitch Distinguishing homologous from analogous proteins. , 1970, Systematic zoology.

[22]  R. Locksley,et al.  The TNF and TNF Receptor Superfamilies Integrating Mammalian Biology , 2001, Cell.

[23]  Cathy H. Wu,et al.  InterPro, progress and status in 2005 , 2004, Nucleic Acids Res..

[24]  Adam Godzik,et al.  Comparative analysis of protein domain organization. , 2004, Genome research.

[25]  K. Kaestner,et al.  Unified nomenclature for the winged helix/forkhead transcription factors. , 2000, Genes & development.

[26]  Izhar Ben-Shlomo,et al.  Signaling Receptome: A Genomic and Evolutionary Perspective of Plasma Membrane Receptors Involved in Signal Transduction , 2003, Science's STKE.

[27]  W. Gilbert,et al.  The exon theory of genes. , 1987, Cold Spring Harbor symposia on quantitative biology.

[28]  Peer Bork,et al.  Recent improvements to the SMART domain-based sequence annotation resource , 2002, Nucleic Acids Res..

[29]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[30]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[31]  S. Teichmann,et al.  Domain combinations in archaeal, eubacterial and eukaryotic proteomes. , 2001, Journal of molecular biology.

[32]  Lewis Y. Geer,et al.  CDART: protein homology by domain architecture. , 2002, Genome research.

[33]  Dannie Durand,et al.  Diagnosing duplications--can it be done? , 2006, Trends in genetics : TIG.

[34]  Michael J. Wilson,et al.  Killer Cell Ig-Like Receptor and Leukocyte Ig-Like Receptor Transgenic Mice Exhibit Tissue- and Cell-Specific Transgene Expression 1 , 2003, The Journal of Immunology.

[35]  C. Chothia,et al.  The geometry of domain combination in proteins. , 2002, Journal of molecular biology.

[36]  Michael Maibaum,et al.  Survey of current protein family databases and their application in comparative, structural and functional genomics. , 2005, Journal of chromatography. B, Analytical technologies in the biomedical and life sciences.

[37]  E. Eichler,et al.  Recent duplication, domain accretion and the dynamic mutation of the human genome. , 2001, Trends in genetics : TIG.

[38]  R. Moon,et al.  Dishevelled activates Ca2+ flux, PKC, and CamKII in vertebrate embryos , 2003, The Journal of cell biology.

[39]  S. Hanks,et al.  Genomic analysis of the eukaryotic protein kinase superfamily: a perspective , 2003, Genome Biology.

[40]  Kevin R. Thornton,et al.  Gene duplication and evolution. , 2001, Science.

[41]  S. Wuchty Scale-free behavior in protein domain networks. , 2001, Molecular biology and evolution.

[42]  A. Elofsson,et al.  Domain rearrangements in protein evolution. , 2005, Journal of molecular biology.

[43]  N. Vinckenbosch,et al.  Evolutionary fate of retroposed gene copies in the human genome. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[44]  S. Wuchty,et al.  Evolutionary cores of domain co-occurrence networks , 2005, BMC Evolutionary Biology.

[45]  L. Patthy Evolution of the proteases of blood coagulation and fibrinolysis by assembly from modules , 1985, Cell.

[46]  B. Foth,et al.  New insights into myosin evolution and classification. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[47]  J. Luban,et al.  Cyclophilin A retrotransposition into TRIM5 explains owl monkey resistance to HIV-1 , 2004, Nature.

[48]  S. Wing Deubiquitinating enzymes--the importance of driving in reverse along the ubiquitin-proteasome pathway. , 2003, The international journal of biochemistry & cell biology.

[49]  C. Chothia,et al.  Evolution of the Protein Repertoire , 2003, Science.

[50]  Yves Van de Peer,et al.  Computational approaches to unveiling ancient genome duplications , 2004, Nature Reviews Genetics.

[51]  Kevin R. Thornton,et al.  The origin of new genes: glimpses from the young and old , 2003, Nature Reviews Genetics.

[52]  C. Orengo,et al.  Protein families and their evolution-a structural perspective. , 2005, Annual review of biochemistry.

[53]  M. Long,et al.  The origin of the Jingwei gene and the complex modular structure of its parental gene, yellow emperor, in Drosophila melanogaster. , 2000, Molecular biology and evolution.

[54]  E V Koonin,et al.  Apoptotic molecular machinery: vastly increased complexity in vertebrates revealed by genome comparisons. , 2001, Science.

[55]  J Engel,et al.  Laminins and other strange proteins. , 1992, Biochemistry.

[56]  R. Richards,et al.  Counting on comparative maps , 1998 .

[57]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[58]  Nick V Grishin,et al.  Sequence and structure classification of kinases. , 2002, Journal of molecular biology.

[59]  Anton Nekrutenko,et al.  Signatures of domain shuffling in the human genome. , 2002, Genome research.

[60]  K. Kuma,et al.  Ancient gene duplication and domain shuffling in the animal cyclic nucleotide phosphodiesterase family 1 , 1998, FEBS letters.

[61]  Byungwook Lee,et al.  DAhunter: a web-based server that identifies homologous proteins by comparing domain architecture , 2008, Nucleic Acids Res..

[62]  L. Patthy,et al.  Modules, multidomain proteins and organismic complexity , 2005, The FEBS journal.

[63]  L. Patthy Modular Assembly of Genes and the Evolution of New Functions , 2003, Genetica.

[64]  D. Liberles,et al.  Phylogenetic relationships of the Fox (Forkhead) gene family in the Bilateria. , 2003, Gene.

[65]  T. Miyata,et al.  Divergence pattern of animal gene families and relationship with the Cambrian explosion , 2001, BioEssays : news and reviews in molecular, cellular and developmental biology.

[66]  T. Miyata,et al.  Kinesin-related genes from diplomonad, sponge, amphioxus, and cyclostomes: divergence pattern of kinesin family and evolution of giardial membrane-bounded organella. , 2002, Molecular biology and evolution.

[67]  L. Patthy Genome evolution and the evolution of exon-shuffling--a review. , 1999, Gene.

[68]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[69]  M. Goodman,et al.  Embryonic ε and γ globin genes of a prosimian primate (Galago crassicaudatus): Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints , 1988 .

[70]  J. Inoue,et al.  Tumor necrosis factor receptor-associated factor (TRAF) family: adapter proteins that mediate cytokine signaling. , 2000, Experimental cell research.

[71]  L. Holm,et al.  Exhaustive enumeration of protein domain families. , 2003, Journal of molecular biology.

[72]  Russell L. Malmberg,et al.  A standardized kinesin nomenclature , 2004, The Journal of cell biology.

[73]  D. Finnegan,et al.  Eukaryotic transposable elements and genome evolution. , 1989, Trends in genetics : TIG.

[74]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[75]  E. Degerman,et al.  Structure, Localization, and Regulation of cGMP-inhibited Phosphodiesterase (PDE3)* , 1997, The Journal of Biological Chemistry.

[76]  D. Begun Origin and evolution of a new gene descended from alcohol dehydrogenase in Drosophila. , 1997, Genetics.

[77]  D. Robinson,et al.  The protein tyrosine kinase family of the human genome , 2000, Oncogene.

[78]  H. Goodson,et al.  Multiplying myosins. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[79]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[80]  M. Lardelli,et al.  Three novel Notch genes in zebrafish: implications for vertebrate Notch gene evolution and function , 1997, Development Genes and Evolution.

[81]  W. Fitch Homology a personal view on some of the problems. , 2000, Trends in genetics : TIG.

[82]  P Argos,et al.  DOMO: a new database of aligned protein domains. , 1998, Trends in biochemical sciences.

[83]  Mitsutoshi Setou,et al.  Kinesin superfamily proteins (KIFs) in the mouse transcriptome. , 2003, Genome research.

[84]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[85]  L. Spain,et al.  Identification of the mouse killer immunoglobulin-like receptor-like (Kirl) gene family mapping to Chromosome X , 2003, Immunogenetics.

[86]  T. Wolfsberg,et al.  ADAMs in fertilization and development. , 1996, Developmental biology.

[87]  Frances M. G. Pearl,et al.  The CATH domain structure database. , 2005, Methods of biochemical analysis.

[88]  David A. Lee,et al.  Gene3D: modelling protein structure, function and evolution , 2005, Nucleic Acids Res..

[89]  Klaas Vandepoele,et al.  Recent developments in computational approaches for uncovering genomic homology. , 2004, BioEssays : news and reviews in molecular, cellular and developmental biology.

[90]  P. Bucher,et al.  Searching for regulatory elements in human noncoding sequences. , 1997, Current opinion in structural biology.

[91]  D. MacEwan TNF ligands and receptors – a matter of life and death , 2002, British journal of pharmacology.

[92]  Cathy H. Wu,et al.  Protein family classification and functional annotation , 2003, Comput. Biol. Chem..

[93]  M. Gerstein,et al.  Of mice and men: phylogenetic footprinting aids the discovery of regulatory elements , 2003, Journal of biology.

[94]  Erik L. L. Sonnhammer,et al.  Predicting protein function from domain content , 2008, Bioinform..

[95]  Tamim H. Shaikh,et al.  Segmental duplications: an 'expanding' role in genomic instability and disease , 2001, Nature Reviews Genetics.

[96]  Jessica H. Fong,et al.  Modeling the evolution of protein domain architectures using maximum parsimony. , 2007, Journal of molecular biology.

[97]  James E. Bray,et al.  The CATH database: an extended protein family resource for structural and functional genomics , 2003, Nucleic Acids Res..

[98]  김태규,et al.  한국인에서 건선과 KIR (Killer Cell Immunoglobulin-like Receptor) 유전자형 사이의 연관성 , 2005 .

[99]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[100]  Wen-Hsiung Li,et al.  Origins, lineage-specific expansions, and multiple losses of tyrosine kinases in eukaryotes. , 2004, Molecular biology and evolution.

[101]  Ferenc Müller,et al.  The identification and functional characterisation of conserved regulatory elements in developmental genes. , 2005, Briefings in functional genomics & proteomics.

[102]  Erwin G. Van Meir,et al.  Functional evolution of ADAMTS genes: Evidence from analyses of phylogeny and gene organization , 2005, BMC Evolutionary Biology.

[103]  T. Cavalier-smith,et al.  Myosin domain evolution and the primary divergence of eukaryotes , 2005, Nature.