Similarity clustering of proteins using substantive knowledge and reconstruction of evolutionary gene histories in herpesvirus

The issue of clustering proteins into homologous protein families (HPFs) has attracted considerable attention by researchers. On one side, many databases of protein families have been developed by using popular sequence alignment tools and relatively simple clustering methods followed by extensive manual curation. On the other side, more elaborate clustering approaches have been used, yet with a very limited degree of success. This paper advocates an approach to clustering protein families involving knowledge of the protein functions to adjust the parameter of similarity scale shift. One more source of external information is utilised as we proceed to reconstruct HPF evolutionary histories over an evolutionary tree; the consistency between these histories and information on gene arrangement in the genomes is used to narrow down the choice of the clustering.

[1]  C. Hutchison,et al.  Gene content phylogeny of herpesviruses. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Duncan P. Brown,et al.  Automated Protein Subfamily Identification and Classification , 2007, PLoS Comput. Biol..

[3]  Andrew J Davison,et al.  Evolution of the herpesviruses. , 2002, Veterinary microbiology.

[4]  Jérôme Gouzy,et al.  XDOM, a graphical tool to analyse domain arrangements in any set of protein sequences , 1997, Comput. Appl. Biosci..

[5]  Riqiang Deng,et al.  Detection and analysis of horizontal gene transfer in herpesvirus. , 2008, Virus research.

[6]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[7]  Paul Kellam,et al.  Kaposi's Sarcoma-Associated Herpesvirus Latent and Lytic Gene Expression as Revealed by DNA Arrays , 2001, Journal of Virology.

[8]  J. Hartigan REPRESENTATION OF SIMILARITY MATRICES BY TREES , 1967 .

[9]  Eugene V. Koonin,et al.  A top-down method for building genome classification trees with linear binary hierarchies , 2001, Bioconsensus.

[10]  James A. Casbon,et al.  Spectral clustering of protein sequences , 2006, Nucleic acids research.

[11]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[12]  J. Peter Gogarten,et al.  BranchClust: a phylogenetic algorithm for selecting gene families , 2007, BMC Bioinformatics.

[13]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[14]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[15]  Roger N. Shepard,et al.  Additive clustering: Representation of similarities as combinations of discrete overlapping properties. , 1979 .

[16]  Haruki Nakamura,et al.  Announcing the worldwide Protein Data Bank , 2003, Nature Structural Biology.

[17]  Yoichi Takenaka,et al.  Graph-based clustering for finding distant relationships in a large set of protein sequences , 2004, Bioinform..

[18]  Xiaohui Liu,et al.  Consensus clustering and functional interpretation of gene-expression data , 2004, Genome Biology.

[19]  Gary D. Bader,et al.  An automated method for finding molecular complexes in large protein interaction networks , 2003, BMC Bioinformatics.

[20]  R. Jarvis,et al.  ClusteringUsing a Similarity Measure Based on SharedNear Neighbors , 1973 .

[21]  B. Snel,et al.  Genomes in flux: the evolution of archaeal and proteobacterial gene content. , 2002, Genome research.

[22]  Andrew J Davison,et al.  Topics in herpesvirus genomics and evolution. , 2006, Virus research.

[23]  D. McGeoch,et al.  Toward a Comprehensive Phylogeny for Mammalian and Avian Herpesviruses , 2000, Journal of Virology.

[24]  Zohar Yakhini,et al.  Clustering gene expression patterns , 1999, J. Comput. Biol..

[25]  Andrew J Davison,et al.  Fundamental and accessory systems in herpesviruses. , 2002, Antiviral research.

[26]  Guido Jenster,et al.  Venn Mapping: clustering of heterologous microarray data based on the number of co-occurring differentially expressed genes , 2003, Bioinform..

[27]  Frances M. G. Pearl,et al.  VIDA: a virus database system for the organization of animal virus genome open reading frames , 2001, Nucleic Acids Res..

[28]  Paul Kellam,et al.  Identification of new herpesvirus gene homologs in the human genome. , 2002, Genome research.

[29]  Trevor I. Fenner,et al.  Aggregating Homologous Protein Families in Evolutionary Reconstructions of Herpesviruses , 2006, 2006 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[30]  Robin G. Allaby,et al.  Phylogenetics in the Bioinformatics Culture of Understanding , 2004, Comparative and functional genomics.

[31]  Michael Y. Galperin,et al.  Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes , 2003, BMC Evolutionary Biology.

[32]  Jacques van Helden,et al.  Evaluation of clustering algorithms for protein-protein interaction networks , 2006, BMC Bioinformatics.

[33]  Cédric Notredame,et al.  Recent Evolutions of Multiple Sequence Alignment Algorithms , 2007, PLoS Comput. Biol..

[34]  B. Mirkin Additive clustering and qualitative factor analysis methods for similarity matrices , 1987 .