Evolutionary analysis of amino acid repeats across the genomes of 12 Drosophila species.

Repeated motifs of amino acids within proteins are an abundant feature of eukaryotic sequences and may catalyze the rapid production of genetic and even phenotypic variation among organisms. The completion of the genome sequencing projects of 12 distinct Drosophila species provides a unique dataset to study these intriguing sequence features on a phylogeny with a variety of timescales. We show that there is a higher percentage of proteins containing repeats within the Drosophila genus than most other eukaryotes, including non-Drosphila insects, which makes this collection of species particularly useful for the study of protein repeats. We also find that proteins containing repeats are overrepresented in functional categories involving developmental processes, signaling, and gene regulation. Using the set of 1-to-1 ortholog alignments for the 12 Drosophila species, we test the ability of repeats to act as reliable phylogenetic signals and find that they resolve the generally accepted phylogeny despite the noise caused by their accelerated rate of evolution. We also determine that in general the position of repeats within a protein sequence is non-random, with repeats more often being absent from the middle regions of sequences. Finally we find evidence to suggest that the presence of repeats is associated with an increase in evolutionary rate upon the entire sequence in which they are embedded. With additional evidence to suggest a corresponding elevation in positive selection we propose that some repeats may be inducing compensatory substitutions in their surrounding sequence.

[1]  Masaru Tomita,et al.  A novel feature of microsatellites in plants: a distribution gradient along the direction of transcription , 2003, FEBS letters.

[2]  Melanie A. Huntley,et al.  Selection and slippage creating serine homopolymers. , 2006, Molecular biology and evolution.

[3]  Melanie A. Huntley,et al.  Simple sequences are rare in the Protein Data Bank , 2002, Proteins.

[4]  Mark A DePristo,et al.  On the abundance, amino acid composition, and evolutionary dynamics of low-complexity regions in proteins. , 2006, Gene.

[5]  M. Saqi,et al.  An analysis of structural instances of low complexity sequence segments. , 1995, Protein engineering.

[6]  S. Ganesh,et al.  Genomic and evolutionary insights into genes encoding proteins with single amino acid repeats. , 2006, Molecular biology and evolution.

[7]  D. Eisenberg,et al.  A census of protein repeats. , 1999, Journal of molecular biology.

[8]  S Karlin,et al.  Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[9]  John C. Wootton,et al.  Sequences with ‘unusual’ amino acid compositions , 1994 .

[10]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[11]  Ziheng Yang,et al.  PAML: a program package for phylogenetic analysis by maximum likelihood , 1997, Comput. Appl. Biosci..

[12]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..

[13]  R. Guigó,et al.  Comparative analysis of amino acid repeats in rodents and humans. , 2004, Genome research.

[14]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Melanie A. Huntley,et al.  Evolution of genes and genomes on the Drosophila phylogeny , 2007, Nature.

[16]  A Keith Dunker,et al.  Intrinsic disorder and protein function. , 2002, Biochemistry.

[17]  Melanie A. Huntley,et al.  Neurological Proteins Are Not Enriched For Repetitive Sequences , 2004, Genetics.

[18]  John M. Hancock,et al.  Amino Acid Reiterations in Yeast Are Overrepresented in Particular Classes of Proteins and Show Evidence of a Slippage-Like Mutational Process , 1999, Journal of Molecular Evolution.

[19]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..

[20]  E. Pizzi,et al.  Low-complexity regions in Plasmodium falciparum proteins. , 2001, Genome research.

[21]  G. B. Golding,et al.  Simple sequence is abundant in eukaryotic proteins , 1999, Protein science : a publication of the Protein Society.

[22]  L. Iakoucheva,et al.  Intrinsic Disorder and Protein Function , 2002 .

[23]  G. Gutman,et al.  Slipped-strand mispairing: a major mechanism for DNA sequence evolution. , 1987, Molecular biology and evolution.

[24]  Youfang Cao,et al.  Distributional gradient of amino acid repeats in plant proteins. , 2006, Genome.

[25]  Huda Y. Zoghbi,et al.  Diseases of Unstable Repeat Expansion: Mechanisms and Common Principles , 2005, Nature Reviews Genetics.

[26]  Y. Kashi,et al.  Simple sequence repeats as advantageous mutators in evolution. , 2006, Trends in genetics : TIG.

[27]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[28]  Golding Gb,et al.  Simple sequence is abundant in eukaryotic proteins. , 1999 .

[29]  Gregory Kucherov,et al.  mreps: efficient and flexible detection of tandem repeats in DNA , 2003, Nucleic Acids Res..

[30]  Susan L. Epstein,et al.  Comparative Genomics Reveals Long, Evolutionarily Conserved, Low-Complexity Islands in Yeast Proteins , 2006, Journal of Molecular Evolution.

[31]  R. Morimoto,et al.  Modeling polyglutamine pathogenesis in C. elegans. , 2006, Methods in enzymology.