The relationship between classification of multi-domain proteins using an alignment-free approach and their functions: a case study with immunoglobulins.

Establishing functional relationships between multi-domain protein sequences is a non-trivial task. Traditionally, delineating functional assignment and relationships of proteins requires domain assignments as a prerequisite. This process is sensitive to alignment quality and domain definitions. In multi-domain proteins due to multiple reasons, the quality of alignments is poor. We report the correspondence between the classification of proteins represented as full-length gene products and their functions. Our approach differs fundamentally from traditional methods in not performing the classification at the level of domains. Our method is based on an alignment free local matching scores (LMS) computation at the amino-acid sequence level followed by hierarchical clustering. As there are no gold standards for full-length protein sequence classification, we resorted to Gene Ontology and domain-architecture based similarity measures to assess our classification. The final clusters obtained using LMS show high functional and domain architectural similarities. Comparison of the current method with alignment based approaches at both domain and full-length protein showed superiority of the LMS scores. Using this method we have recreated objective relationships among different protein kinase sub-families and also classified immunoglobulin containing proteins where sub-family definitions do not exist currently. This method can be applied to any set of protein sequences and hence will be instrumental in analysis of large numbers of full-length protein sequences.

[1]  W. Doolittle,et al.  Reconstructing/Deconstructing the Earliest Eukaryotes How Comparative Genomics Can Help , 2001, Cell.

[2]  Narayanaswamy Srinivasan,et al.  Classification of Protein Kinases on the Basis of Both Kinase and Non-Kinase Regions , 2010, PloS one.

[3]  Peer Bork,et al.  Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation , 2007, Bioinform..

[4]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[5]  Geoffrey J. Barton,et al.  Kinomer v. 1.0: a database of systematically classified eukaryotic protein kinases , 2008, Nucleic Acids Res..

[6]  T. Hunter,et al.  The protein kinase family: conserved features and deduced phylogeny of the catalytic domains. , 1988, Science.

[7]  R. Russell,et al.  Domain Recombination: A Workhorse for Evolutionary Innovation , 2010, Science Signaling.

[8]  F. Collins,et al.  A vision for the future of genomics research , 2003, Nature.

[9]  Warren C. Lathe,et al.  Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. , 2000, Genome research.

[10]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[11]  C. Chothia,et al.  The generation of new protein functions by the combination of domains. , 2007, Structure.

[12]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[13]  Narayanaswamy Srinivasan,et al.  KinG: a database of protein kinases in genomes , 2004, Nucleic Acids Res..

[14]  David A. Lee,et al.  Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space , 2006, Nucleic acids research.

[15]  A. Elofsson,et al.  Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. , 2005, Journal of molecular biology.

[16]  Nick V Grishin,et al.  Sequence and structure classification of kinases. , 2002, Journal of molecular biology.

[17]  F. Delsuc,et al.  Phylogenomics and the reconstruction of the tree of life , 2005, Nature Reviews Genetics.

[18]  T. Hunter,et al.  The eukaryotic protein kinase superfamily: kinase (catalytic) domain structure and classification 1 , 1995, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[19]  G. Caetano-Anollés,et al.  Global phylogeny determined by the combination of protein domains in proteomes. , 2006, Molecular biology and evolution.

[20]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[21]  E. Koonin,et al.  The structure of the protein universe and genome evolution , 2002, Nature.

[22]  C Chothia,et al.  Many of the immunoglobulin superfamily domains in cell adhesion molecules and surface receptors belong to a new structural set which is close to that containing variable domains. , 1994, Journal of molecular biology.

[23]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[24]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[25]  Erik L. L. Sonnhammer,et al.  Predicting protein function from domain content , 2008, Bioinform..

[26]  T. Hunter,et al.  The mouse kinome: discovery and comparative genomics of all mouse protein kinases. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[28]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[29]  S. Teichmann,et al.  Domain combinations in archaeal, eubacterial and eukaryotic proteomes. , 2001, Journal of molecular biology.

[30]  C. Chothia,et al.  Structure, function and evolution of multidomain proteins. , 2004, Current opinion in structural biology.

[31]  H. Mewes,et al.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. , 2004, Nucleic acids research.

[32]  D. Higgins,et al.  See Blockindiscussions, Blockinstats, Blockinand Blockinauthor Blockinprofiles Blockinfor Blockinthis Blockinpublication Clustal: Blockina Blockinpackage Blockinfor Blockinperforming Multiple Blockinsequence Blockinalignment Blockinon Blockina Minicomputer Article Blockin Blockinin Blockin , 2022 .

[33]  Mario Cannataro,et al.  Semantic similarity analysis of protein data: assessment with biological features and issues , 2012, Briefings Bioinform..

[34]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[35]  E. Sonnhammer,et al.  Evolution of protein domain architectures. , 2012, Methods in molecular biology.

[36]  W. Lim,et al.  Domains, motifs, and scaffolds: the role of modular interactions in the evolution and wiring of cell signaling circuits. , 2006, Annual review of biochemistry.

[37]  P. Holland,et al.  Phylogenomics of eukaryotes: impact of missing data on large alignments. , 2004, Molecular biology and evolution.

[38]  E. Birney,et al.  Comparative genomics: genome-wide analysis in metazoan eukaryotes , 2003, Nature Reviews Genetics.

[39]  M. Levandowsky,et al.  Distance between Sets , 1971, Nature.

[40]  Sarah A. Teichmann,et al.  An insight into domain combinations , 2001, ISMB.

[41]  Lei Zhu,et al.  An initial strategy for comparing proteins at the domain architecture level , 2006, Bioinform..

[42]  Chris P. Ponting,et al.  Issues in Predicting Protein Function From Sequence , 2001, Briefings Bioinform..

[43]  E. Koonin,et al.  Evolution of protein domain promiscuity in eukaryotes. , 2008, Genome research.

[44]  T. Hunter,et al.  The Protein Kinase Complement of the Human Genome , 2002, Science.

[45]  A. F. Williams,et al.  The immunoglobulin superfamily--domains for cell surface recognition. , 1988, Annual review of immunology.

[46]  Robert D. Finn,et al.  InterPro in 2011: new developments in the family and domain prediction database , 2011, Nucleic acids research.

[47]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[48]  Eugene V Koonin,et al.  Comparative genomics and structural biology of the molecular innovations of eukaryotes. , 2006, Current opinion in structural biology.

[49]  E. Koonin,et al.  Birth and death of protein domains: A simple model of evolution explains power law behavior , 2002, BMC Evolutionary Biology.

[50]  Iddo Friedberg,et al.  Automated protein function predictionçthe genomic challenge , 2006 .

[51]  N Srinivasan,et al.  The repertoire of protein kinases encoded in the draft version of the human genome: atypical variations and uncommon domain combinations , 2002, Genome Biology.

[52]  N. Srinivasan,et al.  Repertoire of Protein Kinases Encoded in the Genome of Takifugu rubripes , 2012, Comparative and functional genomics.

[53]  T. Hunter,et al.  Evolution of protein kinase signaling from yeast to man. , 2002, Trends in biochemical sciences.

[54]  M. R. Adams,et al.  Comparative genomics of the eukaryotes. , 2000, Science.