Pseudo-periodic partitions of biological sequences

MOTIVATION Algorithm development for finding typical patterns in sequences, especially multiple pseudo-repeats (pseudo-periodic regions), is at the core of many problems arising in biological sequence and structure analysis. In fact, one of the most significant features of biological sequences is their high quasi-repetitiveness. Variation in the quasi-repetitiveness of genomic and proteomic texts demonstrates the presence and density of different biologically important information. It is very important to develop sensitive automatic computational methods for the identification of pseudo-periodic regions of sequences through which we can infer, describe and understand biological properties, and seek precise molecular details of biological structures, dynamics, interactions and evolution. RESULTS We develop a novel, powerful computational tool for partitioning a sequence to pseudo-periodic regions. The pseudo-periodic partition is defined as a partition, which intuitively has the minimal bias to some perfect-periodic partition of the sequence based on the evolutionary distance. We devise a quadratic time and space algorithm for detecting a pseudo-periodic partition for a given sequence, which actually corresponds to the shortest path in the main diagonal of the directed (acyclic) weighted graph constructed by the Smith-Waterman self-alignment of the sequence. We use several typical examples to demonstrate the utilization of our algorithm and software system in detecting functional or structural domains and regions of proteins. A big advantage of our software program is that there is a parameter, the granularity factor, associated with it and we can freely choose a biological sequence family as a training set to determine the best parameter. In general, we choose all repeats (including many pseudo-repeats) in the SWISS-PROT amino acid sequence database as a typical training set. We show that the granularity factor is 0.52 and the average agreement accuracy of pseudo-periodic partitions, detected by our software for all pseudo-repeats in the SWISS-PROT database, is as high as 97.6%.

[1]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[2]  John C. Wootton,et al.  Algorithms for computing lengths of chains in integral partition lattices , 2002, Theor. Comput. Sci..

[3]  Chris Sander,et al.  CAST: an iterative algorithm for the complexity analysis of sequence tracts , 2000, Bioinform..

[4]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..

[5]  L. Patthy Evolution of the proteases of blood coagulation and fibrinolysis by assembly from modules , 1985, Cell.

[6]  E. Chen,et al.  cDNA sequence of human apolipoprotein(a) is homologous to plasminogen , 1987, Nature.

[7]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[8]  John C. Wootton,et al.  A Global Compositional Complexity Measure for Biological Sequences: AT-rich and GC-rich Genomes Encode Less Complex Proteins , 2000, Comput. Chem..

[9]  M. Billeter,et al.  MOLMOL: a program for display and analysis of macromolecular structures. , 1996, Journal of molecular graphics.

[10]  S Cusack,et al.  The 2.9 A crystal structure of T. thermophilus seryl-tRNA synthetase complexed with tRNA(Ser). , 1994, Science.

[11]  M. Vaara,et al.  The novel hexapeptide motif found in the acyltransferases LpxA and LpxD of lipid A biosynthesis is conserved in various bacteria , 1994, FEBS letters.

[12]  M. Vaara,et al.  Eight bacterial proteins, including UDP-N-acetylglucosamine acyltransferase (LpxA) and three other transferases of Escherichia coli, consist of a six-residue periodicity theme. , 1992, FEMS microbiology letters.

[13]  J. Echave,et al.  Structural constraints and emergence of sequence patterns in protein evolution. , 2001, Molecular biology and evolution.

[14]  M. V. Katti,et al.  Amino acid repeat patterns in protein sequences: Their diversity and structural‐functional implications , 2000, Protein science : a publication of the Protein Society.

[15]  C. Raetz,et al.  A Left-Handed Parallel β Helix in the Structure of UDP-N-Acetylglucosamine Acyltransferase , 1995, Science.

[16]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[17]  A. Smit,et al.  The origin of interspersed repeats in the human genome. , 1996, Current opinion in genetics & development.

[18]  J. Blanchard,et al.  Three-dimensional structure of tetrahydrodipicolinate N-succinyltransferase. , 1997, Biochemistry.

[19]  T Gojobori,et al.  Evolutionary origin of numerous kringles in human and simian apolipoprotein(a) , 1991, FEBS letters.

[20]  Enmin Song,et al.  Quasiperiodic biosequences and modulo incidence matrices , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[21]  John C. Wootton,et al.  Discovering Simple Regions in Biological Sequences Associated with Scoring Schemes , 2003, J. Comput. Biol..

[22]  Bostjan Kobe,et al.  Crystal structure of porcine ribonuclease inhibitor, a protein with leucine-rich repeats , 1993, Nature.

[23]  Finn Drabløs,et al.  Detecting periodic patterns in biological sequences , 1998, Bioinform..

[24]  J. M. Beals,et al.  The genetic relationships between the kringle domains of human plasminogen, prothrombin, tissue plasminogen activator, urokinase, and coagulation factor XII , 2005, Journal of Molecular Evolution.