The SYSTERS protein family database: Taxon-related protein family size distributions and singleton frequencies

Based on the SYSTERS protein family database, we present taxon-related protein family frequencies and distributions. A set of taxon-related protein families is a subset of the whole family set with respect to one taxon, where taxon is not restricted to the species level but may be any rank in the taxonomy. We examine eight ranks in the lineages of seven organisms. A strong linear correlation is observed between the total number of different families and the number of sequences in the data set under consideration. We fitted the generalised power-law function to protein family distributions in a least-squares sense excluding singleton frequencies. Taxon-related family distributions tend to have the same shape and a negative slope being not larger than -2.1 for large data sets. For smaller data sets, the slope is decreasing down to -3.7. Slopes of family distributions are found to be slowly increasing towards higher taxonomic ranks. Our observations lead to a new estimation of single sequence cluster frequencies. Data sets of various species are studied with respect to being complete or incomplete.

[1]  V. Kuznetsov Statistics of the Numbers of Transcripts and Protein Sequences Encoded in the Genome , 2003 .

[2]  Robert S. Ledley,et al.  The Protein Information Resource , 2003, Nucleic Acids Res..

[3]  L. Holm,et al.  Exhaustive enumeration of protein domain families. , 2003, Journal of molecular biology.

[4]  Shlomo Havlin,et al.  Scaling law in sizes of protein sequence families: From super‐families to orphan genes , 2003, Proteins.

[5]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology , 2003, Nucleic Acids Res..

[6]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[7]  Antje Krause Large scale clustering of protein sequences , 2002 .

[8]  W. Gelbart The FlyBase database of the Drosophila Genome Projects and community literature. , 1999, Nucleic acids research.

[9]  L Holm,et al.  Towards a covering set of protein family profiles. , 2000, Progress in biophysics and molecular biology.

[10]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[11]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[12]  M. Huynen,et al.  The frequency distribution of gene family sizes in complete genomes. , 1998, Molecular biology and evolution.

[13]  E. Koonin,et al.  The structure of the protein universe and genome evolution , 2002, Nature.

[14]  David S. Eisenberg,et al.  Finding families for genomic ORFans , 1999, Bioinform..

[15]  David T. Jones,et al.  Protein superfamilles and domain superfolds , 1994, Nature.

[16]  The FlyBase database of the Drosophila genome projects and community literature. , 2003, Nucleic acids research.

[17]  Kara Dolinski,et al.  Saccharomyces Genome Database (SGD) provides biochemical and structural information for budding yeast proteins , 2003, Nucleic Acids Res..

[18]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[19]  Martin Vingron,et al.  SYSTERS, GeneNest, SpliceNest: exploring sequence space from genome to protein , 2002, Nucleic Acids Res..

[20]  Juancarlos Chan,et al.  WormBase: a cross-species database for comparative genomics , 2003, Nucleic Acids Res..

[21]  C. Chothia Proteins. One thousand families for the molecular biologist. , 1992, Nature.