Pandit: a database of protein and associated nucleotide domains with inferred trees

MOTIVATION A large, high-quality database of homologous sequence alignments with good estimates of their corresponding phylogenetic trees will be a valuable resource to those studying phylogenetics. It will allow researchers to compare current and new models of sequence evolution across a large variety of sequences. The large quantity of data may provide inspiration for new models and methodology to study sequence evolution and may allow general statements about the relative effect of different molecular processes on evolution. RESULTS The Pandit 7.6 database contains 4341 families of sequences derived from the seed alignments of the Pfam database of amino acid alignments of families of homologous protein domains (Bateman et al., 2002). Each family in Pandit includes an alignment of amino acid sequences that matches the corresponding Pfam family seed alignment, an alignment of DNA sequences that contain the coding sequence of the Pfam alignment when they can be recovered (overall, 82.9% of sequences taken from Pfam) and the alignment of amino acid sequences restricted to only those sequences for which a DNA sequence could be recovered. Each of the alignments has an estimate of the phylogenetic tree associated with it. The tree topologies were obtained using the neighbor joining method based on maximum likelihood estimates of the evolutionary distances, with branch lengths then calculated using a standard maximum likelihood approach.

[1]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[2]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[3]  Rolf Apweiler,et al.  The EBI SRS server-new features , 2002, Bioinform..

[4]  Z. Yang,et al.  Models of amino acid substitution and applications to mitochondrial protein evolution. , 1998, Molecular biology and evolution.

[5]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[6]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[7]  Simon Whelan,et al.  A novel use of equilibrium frequencies in models of sequence evolution. , 2002, Molecular biology and evolution.

[8]  P. Lio’,et al.  Molecular phylogenetics: state-of-the-art methods for looking into the past. , 2001, Trends in genetics : TIG.

[9]  M. Kimura The Neutral Theory of Evolution , 1983 .

[10]  T L Blundell,et al.  Sequence‐structure homology recognition by iterative alignment refinement and comparative modeling , 2001, Proteins.

[11]  M. Gouy,et al.  HOBACGEN: database system for comparative genomics in bacteria. , 2000, Genome research.

[12]  Ziheng Yang,et al.  Statistical methods for detecting molecular adaptation , 2000, Trends in Ecology & Evolution.

[13]  Erik L. L. Sonnhammer,et al.  NIFAS: visual analysis of domain evolution in proteins , 2001, Bioinform..

[14]  Alex Bateman,et al.  QuickTree: building huge Neighbour-Joining trees of protein sequences , 2002, Bioinform..

[15]  Ziheng Yang,et al.  PAML: a program package for phylogenetic analysis by maximum likelihood , 1997, Comput. Appl. Biosci..

[16]  Shashi B. Pandit,et al.  SUPFAM - a database of potential protein superfamily relationships derived by comparing sequence-based and structure-based families: implications for structural genomics and function annotation in genomes , 2002, Nucleic Acids Res..

[17]  P. Waddell,et al.  Plastid Genome Phylogeny and a Model of Amino Acid Substitution for Proteins Encoded by Chloroplast DNA , 2000, Journal of Molecular Evolution.

[18]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[19]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[20]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[21]  N. Goldman,et al.  Codon-substitution models for heterogeneous selection pressure at amino acid sites. , 2000, Genetics.

[22]  N. Goldman,et al.  Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. , 1994, Molecular biology and evolution.

[23]  Baldomero Oliva,et al.  Structural similarity to link sequence space: New potential superfamilies and implications for structural genomics , 2002, Protein science : a publication of the Protein Society.

[24]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[25]  Jiye Shi,et al.  HOMSTRAD: adding sequence information to structure-based alignments of homologous protein families , 2001, Bioinform..

[26]  S. Balaji,et al.  PALI - a database of Phylogeny and ALIgnment of homologous protein structures , 2001, Nucleic Acids Res..

[27]  Nick Goldman,et al.  MAXIMUM LIKELIHOOD TREES FROM DNA SEQUENCES: A PECULIAR STATISTICAL ESTIMATION PROBLEM , 1995 .

[28]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[29]  Jonathan P. Bollback,et al.  Bayesian Inference of Phylogeny and Its Impact on Evolutionary Biology , 2001, Science.

[30]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[31]  M. Gouy,et al.  HOVERGEN: a database of homologous vertebrate genes. , 1994, Nucleic acids research.

[32]  Rodrigo Lopez,et al.  The EMBL Nucleotide Sequence Database , 1999, Nucleic Acids Res..

[33]  Sean R. Eddy,et al.  ATV: display and manipulation of annotated phylogenetic , 2001, Bioinform..

[34]  K. Holsinger,et al.  The effect of topology on estimates of among-site rate variation , 1996, Journal of Molecular Evolution.