Detecting cryptically simple protein sequences using the SIMPLE algorithm

MOTIVATION Low-complexity or cryptically simple sequences are widespread in protein sequences but their evolution and function are poorly understood. To date methods for the detection of low complexity in proteins have been directed towards the filtering of such regions prior to sequence homology searches but not to the analysis of the regions per se. However, many of these regions are encoded by non-repetitive DNA sequences and may therefore result from selection acting on protein structure and/or function. RESULTS We have developed a new tool, based on the SIMPLE algorithm, that facilitates the quantification of the amount of simple sequence in proteins and determines the type of short motifs that show clustering above a certain threshold. By modifying the sensitivity of the program simple sequence content can be studied at various levels, from highly organised tandem structures to complex combinations of repeats. We compare the relative amount of simplicity in different functional groups of yeast proteins and determine the level of clustering of the different amino acids in these proteins. AVAILABILITY The program is available on request or online at http://www.biochem.ucl.ac.uk/bsm/SIMPLE.

[1]  John M. Hancock,et al.  Conservation of polyglutamine tract size between mice and humans depends on codon interruption. , 1999, Molecular biology and evolution.

[2]  A. Krainer,et al.  Identification and characterization of three members of the human SR family of pre‐mRNA splicing factors. , 1995, The EMBO journal.

[3]  John M. Hancock,et al.  Amino Acid Reiterations in Yeast Are Overrepresented in Particular Classes of Proteins and Show Evidence of a Slippage-Like Mutational Process , 1999, Journal of Molecular Evolution.

[4]  J. Hancock,et al.  Evolution of sequence repetition and gene duplications in the TATA-binding protein TBP (TFIID). , 1993, Nucleic acids research.

[5]  Francisco Antequera,et al.  Initiation of DNA replication at CpG islands in mammalian chromosomes , 1998, The EMBO journal.

[6]  C A Smith,et al.  An RNA-binding chameleon. , 2000, Molecular cell.

[7]  Robert Tjian,et al.  Isolation and characterization of the Drosophila gene encoding the TATA box binding protein, TFIID , 1990, Cell.

[8]  M. Nishizawa,et al.  Local‐scale repetitiveness in amino acid use in eukaryote protein sequences: A genomic factor in protein evolution , 1999, Proteins.

[9]  D. Housman,et al.  The complex pathology of trinucleotide repeats. , 1997, Current opinion in cell biology.

[10]  James I. Garrels,et al.  The Yeast Proteome Database (YPD): a model for the organization and presentation of genome-wide functional data , 1999, Nucleic Acids Res..

[11]  M. Perutz,et al.  Glutamine Repeats as Polar Zippers: Their Role in Inherited Neurodegenerative Disease , 1995, Molecular medicine.

[12]  E. Pizzi,et al.  Low-complexity regions in Plasmodium falciparum proteins. , 2001, Genome research.

[13]  John M. Hancock,et al.  Simple sequences and the expanding genome. , 1996, BioEssays : news and reviews in molecular, cellular and developmental biology.

[14]  John M. Hancock,et al.  A role for selection in regulating the evolutionary emergence of disease-causing and other coding CAG repeats in humans and mice. , 2001, Molecular biology and evolution.

[15]  H Green,et al.  Codon reiteration and the evolution of proteins. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[16]  John C. Wootton,et al.  Non-globular Domains in Protein Sequences: Automated Segmentation Using Complexity Measures , 1994, Comput. Chem..

[17]  S Karlin,et al.  Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[18]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[19]  Chris Sander,et al.  CAST: an iterative algorithm for the complexity analysis of sequence tracts , 2000, Bioinform..

[20]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..

[21]  John M. Hancock,et al.  SIMPLE34: an improved and enhanced implementation for VAX and Sun computers of the SIMPLE algorithm for analysis of clustered repetitive motifs in nucleotide sequences , 1994, Comput. Appl. Biosci..

[22]  D. Tautz,et al.  Cryptic simplicity in DNA is a major source of genetic variation , 1986, Nature.

[23]  L Pinsky,et al.  Evidence for a repressive function of the long polyglutamine tract in the human androgen receptor: possible pathogenetic relevance for the (CAG)n-expanded neuronopathies. , 1995, Human molecular genetics.

[24]  G. Gutman,et al.  Slipped-strand mispairing: a major mechanism for DNA sequence evolution. , 1987, Molecular biology and evolution.

[25]  P. Romero,et al.  Sequence complexity of disordered protein , 2001, Proteins.

[26]  G. B. Golding,et al.  Simple sequence is abundant in eukaryotic proteins , 1999, Protein science : a publication of the Protein Society.

[27]  Golding Gb,et al.  Simple sequence is abundant in eukaryotic proteins. , 1999 .

[28]  Francisco Antequera,et al.  CpG islands as genomic footprints of promoters that are associated with replication origins , 1999, Current Biology.

[29]  H. Biessmann,et al.  Telomeric repeat sequences , 1994, Chromosoma.