Tandem and cryptic amino acid repeats accumulate in disordered regions of proteins

BackgroundAmino acid repeats (AARs) are common features of protein sequences. They often evolve rapidly and are involved in a number of human diseases. They also show significant associations with particular Gene Ontology (GO) functional categories, particularly transcription, suggesting they play some role in protein function. It has been suggested recently that AARs play a significant role in the evolution of intrinsically unstructured regions (IURs) of proteins. We investigate the relationship between AAR frequency and evolution and their localization within proteins based on a set of 5,815 orthologous proteins from four mammalian (human, chimpanzee, mouse and rat) and a bird (chicken) genome. We consider two classes of AAR (tandem repeats and cryptic repeats: regions of proteins containing overrepresentations of short amino acid repeats).ResultsMammals show very similar repeat frequencies but chicken shows lower frequencies of many of the cryptic repeats common in mammals. Regions flanking tandem AARs evolve more rapidly than the rest of the protein containing the repeat and this phenomenon is more pronounced for non-conserved repeats than for conserved ones. GO associations are similar to those previously described for the mammals, but chicken cryptic repeats show fewer significant associations. Comparing the overlaps of AARs with IURs and protein domains showed that up to 96% of some AAR types are associated preferentially with IURs. However, no more than 15% of IURs contained an AAR.ConclusionsTheir location within IURs explains many of the evolutionary properties of AARs. Further study is needed on the types of IURs containing AARs.

[1]  P. Romero,et al.  Sequence complexity of disordered protein , 2001, Proteins.

[2]  M. Campbell,et al.  PANTHER: a library of protein families and subfamilies indexed by function. , 2003, Genome research.

[3]  S. Lovell Are non‐functional, unfolded proteins (‘junk proteins’) common in the genome? , 2003, FEBS letters.

[4]  Nobuaki Yoshida,et al.  Morphological change caused by loss of the taxon-specific polyalanine tract in Hoxd-13. , 2006, Molecular biology and evolution.

[5]  D. Tautz,et al.  Cryptic simplicity in DNA is a major source of genetic variation , 1986, Nature.

[6]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[7]  C. Lobe,et al.  Products of the grg (Groucho-related Gene) Family Can Dimerize through the Amino-terminal Q Domain* , 1996, The Journal of Biological Chemistry.

[8]  S. Karlin,et al.  Amino acid runs in eukaryotic proteomes and disease associations , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[9]  J. Whisstock,et al.  Functional insights from the distribution and role of homopeptide repeat-containing proteins. , 2005, Genome research.

[10]  R. Guigó,et al.  Comparative analysis of amino acid repeats in rodents and humans. , 2004, Genome research.

[11]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[12]  S. Mundlos,et al.  The other trinucleotide repeat: polyalanine expansion disorders. , 2005, Current opinion in genetics & development.

[13]  Rainer B. Lanz,et al.  A transcriptional repressor obtained by alternative translation of a trinucleotide repeat , 1995, Nucleic Acids Res..

[14]  Andreas Prlic,et al.  Ensembl 2007 , 2006, Nucleic Acids Res..

[15]  Zheng Rong Yang,et al.  RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins , 2005, Bioinform..

[16]  J. Hancock,et al.  Evolution of sequence repetition and gene duplications in the TATA-binding protein TBP (TFIID). , 1993, Nucleic acids research.

[17]  David B. Goldstein,et al.  Microsatellites: Evolution and Applications , 1999 .

[18]  S. Artavanis-Tsakonas,et al.  opa: A novel family of transcribed repeats shared by the Notch locus and other developmentally regulated loci in D. melanogaster , 1985, Cell.

[19]  A. Dunker,et al.  Disorder and sequence repeats in hub proteins and their implications for network evolution. , 2006, Journal of proteome research.

[20]  A Keith Dunker,et al.  Intrinsic disorder and protein function. , 2002, Biochemistry.

[21]  T JonesDavid,et al.  The DISOPRED server for the prediction of protein disorder , 2004 .

[22]  Melanie A. Huntley,et al.  Evolutionary analysis of amino acid repeats across the genomes of 12 Drosophila species. , 2007, Molecular biology and evolution.

[23]  Bernard F. Buxton,et al.  The DISOPRED server for the prediction of protein disorder , 2004, Bioinform..

[24]  P. Tompa Intrinsically unstructured proteins evolve by repeat expansion , 2003, BioEssays : news and reviews in molecular, cellular and developmental biology.

[25]  Zsuzsanna Dosztányi,et al.  IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content , 2005, Bioinform..

[26]  B. Dujon,et al.  Trinucleotide repeats in yeast. , 1997, Research in microbiology.

[27]  Joaquín Dopazo,et al.  The role of the environment in Parkinson's disease. , 1996, Nucleic Acids Res..

[28]  H Green,et al.  Codon reiteration and the evolution of proteins. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[29]  John M. Hancock,et al.  Detecting cryptically simple protein sequences using the SIMPLE algorithm , 2002, Bioinform..

[30]  John M. Hancock,et al.  Simple sequence repeats in proteins and their significance for network evolution. , 2005, Gene.

[31]  D. T. Jones,et al.  Sequence patterns associated with disordered regions in proteins , 2004, Proteins.

[32]  Marc S. Cortese,et al.  Comparing and combining predictors of mostly disordered proteins. , 2005, Biochemistry.

[33]  L Pinsky,et al.  Evidence for a repressive function of the long polyglutamine tract in the human androgen receptor: possible pathogenetic relevance for the (CAG)n-expanded neuronopathies. , 1995, Human molecular genetics.

[34]  Susan L. Epstein,et al.  Comparative Genomics Reveals Long, Evolutionarily Conserved, Low-Complexity Islands in Yeast Proteins , 2006, Journal of Molecular Evolution.

[35]  C. Schwechheimer,et al.  The activities of acidic and glutamine-rich transcriptional activation domains in plant cells: design of modular transcription factors for high-level expression , 2004, Plant Molecular Biology.

[36]  S Karlin,et al.  Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[37]  John M. Hancock,et al.  How slippage-derived sequences are incorporated into rRNA variable-region secondary structure: implications for phylogeny reconstruction. , 2000, Molecular phylogenetics and evolution.

[38]  John M. Hancock The contribution of slippage-like processes to genome evolution , 1995, Journal of Molecular Evolution.

[39]  Geoffrey I. Webb,et al.  RCPdb: An evolutionary classification and codon usage database for repeat-containing proteins. , 2007, Genome research.

[40]  H. Garner,et al.  Molecular origins of rapid and continuous morphological evolution , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[41]  John M. Hancock,et al.  Amino Acid Reiterations in Yeast Are Overrepresented in Particular Classes of Proteins and Show Evidence of a Slippage-Like Mutational Process , 1999, Journal of Molecular Evolution.

[42]  Colin N. Dewey,et al.  Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution , 2004, Nature.

[43]  David T. Jones,et al.  Prediction of disordered regions in proteins from position specific score matrices , 2003, Proteins.

[44]  John M. Hancock,et al.  Codon repeats in genes associated with human diseases: fewer repeats in the genes of nonhuman primates and nucleotide substitutions concentrated at the sites of reiteration. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[45]  David P. Kreil,et al.  Asparagine repeats are rare in mammalian proteins. , 2000, Trends in biochemical sciences.

[46]  Torsten Schwede,et al.  Assessment of disorder predictions in CASP7 , 2007, Proteins.

[47]  V. Uversky Intrinsically Disordered Proteins , 2000 .

[48]  H. Dyson,et al.  Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. , 1999, Journal of molecular biology.

[49]  A Keith Dunker,et al.  Conservation of intrinsic disorder in protein domains and families: II. functions of conserved disorder. , 2006, Journal of proteome research.

[50]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[51]  E. Marcotte,et al.  A fast algorithm for genome‐wide analysis of proteins with repeated sequences , 1999, Proteins.

[52]  John M. Hancock,et al.  A role for selection in regulating the evolutionary emergence of disease-causing and other coding CAG repeats in humans and mice. , 2001, Molecular biology and evolution.

[53]  E. Young,et al.  Trinucleotide repeats are clustered in regulatory genes in Saccharomyces cerevisiae. , 2000, Genetics.

[54]  Manish S. Shah,et al.  A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes , 1993, Cell.

[55]  John M. Hancock Genome size and the accumulation of simple sequence repeats: implications of new data from genome sequencing projects , 2002, Genetica.

[56]  John M. Hancock,et al.  Dictionary of bioinformatics and computational biology , 2004, Choice Reviews Online.

[57]  Christopher J. Oldfield,et al.  Evolutionary Rate Heterogeneity in Proteins with Long Disordered Regions , 2002, Journal of Molecular Evolution.

[58]  C. Brown,et al.  Intrinsic protein disorder in complete genomes. , 2000, Genome informatics. Workshop on Genome Informatics.

[59]  Christian Schlötterer,et al.  Two distinct modes of microsatellite mutation processes: evidence from the complete genomic sequences of nine species. , 2003, Genome research.

[60]  S. Brunak,et al.  Locating proteins in the cell using TargetP, SignalP and related tools , 2007, Nature Protocols.

[61]  L. Mularoni,et al.  Highly constrained proteins contain an unexpectedly large number of amino acid tandem repeats. , 2007, Genomics.

[62]  John M. Hancock,et al.  Conservation of polyglutamine tract size between mice and humans depends on codon interruption. , 1999, Molecular biology and evolution.

[63]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[64]  Rolf Apweiler,et al.  InterProScan: protein domains identifier , 2005, Nucleic Acids Res..

[65]  M. Pagel,et al.  Origin of avian genome size and structure in non-avian dinosaurs , 2007, Nature.

[66]  J. Cáceres,et al.  The SR protein family of splicing factors: master regulators of gene expression. , 2009, The Biochemical journal.

[67]  Jessica W. Chen Conversation of Intrinsic Disorder in Protein Domains and Families , 2005 .

[68]  John M. Hancock,et al.  The Comparative Genomics of Polyglutamine Repeats: Extreme Difference in the Codon Organization of Repeat-Encoding Regions Between Mammals and Drosophila , 2001, Journal of Molecular Evolution.

[69]  Robert D. Finn,et al.  New developments in the InterPro database , 2007, Nucleic Acids Res..

[70]  John C. Wootton,et al.  Non-globular Domains in Protein Sequences: Automated Segmentation Using Complexity Measures , 1994, Comput. Chem..

[71]  Christopher J. Oldfield,et al.  Intrinsically disordered protein. , 2001, Journal of molecular graphics & modelling.