A novel sensitive method for the detection of user-defined compositional bias in biological sequences

MOTIVATION Most biological sequences contain compositionally biased segments in which one or more residue types are significantly overrepresented. The function and evolution of these segments are poorly understood. Usually, all types of compositionally biased segments are masked and ignored during sequence analysis. However, it has been shown for a number of proteins that biased segments that contain amino acids with similar chemical properties are involved in a variety of molecular functions and human diseases. A detailed large-scale analysis of the functional implications and evolutionary conservation of different compositionally biased segments requires a sensitive method capable of detecting user-specified types of compositional bias. RESULTS We present BIAS, a novel sensitive method for the detection of compositionally biased segments composed of a user-specified set of residue types. BIAS uses the discrete scan statistics that provides a highly accurate correction for multiple tests to compute analytical estimates of the significance of each compositionally biased segment. The method can take into account global compositional bias when computing analytical estimates of the significance of local clusters. BIAS is benchmarked against SEG, SAPS and CAST programs. We also use BIAS to show that groups of proteins with the same biological function are significantly associated with particular types of compositionally biased segments.

[1]  M. Gerstein,et al.  A method to assess compositional bias in biological sequences and its application to prion-like glutamine/asparagine-rich domains in eukaryotic proteomes , 2003, Genome Biology.

[2]  Christos A. Ouzounis,et al.  Comparison of sequence masking algorithms and the detection of biased protein sequence regions , 2003, Bioinform..

[3]  K. Beck,et al.  Supercoiled protein motifs: the collagen triple-helix and the alpha-helical coiled coil. , 1998, Journal of structural biology.

[4]  B. Silverman Underlying Hydrophobic Sequence Periodicity of Protein Tertiary Structure , 2005, Journal of biomolecular structure & dynamics.

[5]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[6]  Adam Godzik,et al.  Tolerating some redundancy significantly speeds up clustering of large protein databases , 2002, Bioinform..

[7]  Melanie A. Huntley,et al.  Simple sequences are rare in the Protein Data Bank , 2002, Proteins.

[8]  K. Dill Polymer principles and protein folding , 1999, Protein science : a publication of the Protein Society.

[9]  James W. A. Allen,et al.  C-type cytochrome formation: chemical and biological enigmas. , 2004, Accounts of chemical research.

[10]  M. Nishizawa,et al.  Biased Usages of Arginines and Lysines in Proteins Are Correlated with Local-Scale Fluctuations of the G + C Content of DNA Sequences , 1998, Journal of Molecular Evolution.

[11]  Stanley B. Prusiner,et al.  Nobel Lecture: Prions , 1998 .

[12]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[13]  S Karlin,et al.  Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[14]  John M. Hancock,et al.  Detecting cryptically simple protein sequences using the SIMPLE algorithm , 2002, Bioinform..

[15]  S. Karlin,et al.  Amino acid runs in eukaryotic proteomes and disease associations , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[16]  S Karlin,et al.  Association of charge clusters with functional domains of cellular transcription factors. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Witold K. Surewicz,et al.  Crystal structure of the human prion protein reveals a mechanism for oligomerization , 2002, Nature Structural Biology.

[18]  S. Cusack,et al.  Refined crystal structure of the seryl-tRNA synthetase from Thermus thermophilus at 2.5 A resolution. , 1993, Journal of molecular biology.

[19]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[20]  M. Méchali,et al.  Differential expression of two Xenopus c‐myc proto‐oncogenes during development. , 1989, The EMBO journal.

[21]  Chris Sander,et al.  CAST: an iterative algorithm for the complexity analysis of sequence tracts , 2000, Bioinform..

[22]  K. Beck,et al.  Supercoiled Protein Motifs: The Collagen Triple-Helix and the α-Helical Coiled Coil , 1998 .

[23]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[24]  J. Wootton,et al.  Analysis of compositionally biased regions in sequence databases. , 1996, Methods in enzymology.

[25]  G. Singer,et al.  Nucleotide bias causes a genomewide bias in the amino acid composition of proteins. , 2000, Molecular biology and evolution.

[26]  E V Koonin,et al.  Chromosome 2 sequence of the human malaria parasite Plasmodium falciparum. , 1998, Science.

[27]  Samuel Karlin,et al.  Genome comparisons and analysis. , 2003, Current opinion in structural biology.

[28]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[29]  S Karlin,et al.  Methods and algorithms for statistical analysis of protein sequences. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[30]  V. Tumanyan,et al.  Amino acid composition of protein termini are biased in different manners. , 1999, Protein engineering.

[31]  S. Karlin,et al.  Identification of significant sequence patterns in proteins. , 1990, Methods in enzymology.

[32]  T. Steitz,et al.  The kink‐turn: a new RNA secondary structure motif , 2001, The EMBO journal.

[33]  A. Mangé,et al.  Trafficking of the cellular isoform of the prion protein. , 1999, Biomedicine & pharmacotherapy = Biomedecine & pharmacotherapie.