Evolutionary insights from suffix array-based genome sequence analysis

Gene and protein sequence analyses, central components of studies in modern biology are easily amenable to string matching and pattern recognition algorithms. The growing need of analysing whole genome sequences more efficiently and thoroughly, has led to the emergence of new computational methods. Suffix trees and suffix arrays are data structures, well known in many other areas and are highly suited for sequence analysis too. Here we report an improvement to the design of construction of suffix arrays. Enhancement in versatility and scalability, enabled by this approach, is demonstrated through the use of real-life examples.The scalability of the algorithm to whole genomes renders it suitable to address many biologically interesting problems. One example is the evolutionary insight gained by analysing unigrams, bi-grams and higher n-grams, indicating that the genetic code has a direct influence on the overall composition of the genome. Further, different proteomes have been analysed for the coverage of the possible peptide space, which indicate that as much as a quarter of the total space at the tetra-peptide level is left un-sampled in prokaryotic organisms, although almost all tri-peptides can be seen in one protein or another in a proteome. Besides, distinct patterns begin to emerge for the counts of particular tetra and higher peptides, indicative of a ‘meaning’ for tetra and higher n-grams.The toolkit has also been used to demonstrate the usefulness of identifying repeats in whole proteomes efficiently. As an example, 16 members of one COG, coded by the genome of Mycobacterium tuberculosis H37Rv have been found to contain a repeating sequence of 300 amino acids.

[1]  John Riedl,et al.  Generalized suffix trees for biological sequence data: applications and implementation , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[2]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[3]  B. Barrell,et al.  Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence , 1998, Nature.

[4]  L. Caporale Chance Favors the Prepared Genome , 1999, Annals of the New York Academy of Sciences.

[5]  Martin Vingron,et al.  q-gram based database searching using a suffix array (QUASAR) , 1999, RECOMB.

[6]  N. Morrison,et al.  Characterization of IS1547, a New Member of the IS900 Family in the Mycobacterium tuberculosis Complex, and Its Association with IS6110 , 1999, Journal of bacteriology.

[7]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[8]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[9]  Golan Yona,et al.  Variations on probabilistic suffix trees: statistical modeling and prediction of protein families , 2001, Bioinform..

[10]  Jaime G. Carbonell,et al.  Comparative N-gram Analysis of Genome Sequences , 2001 .

[11]  Lyle H. Ungar,et al.  Maximum entropy methods for biological sequence modeling , 2001, BIOKDD.

[12]  B. Barrell,et al.  Massive gene decay in the leprosy , 2001 .

[13]  S. Cole,et al.  The evolution of mycobacterial pathogenicity: clues from comparative genomics. , 2001, Trends in microbiology.

[14]  Kenneth Ward Church,et al.  Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus , 2001, Computational Linguistics.

[15]  Hiroshi Sakamoto,et al.  Efficient Discovery of Proximity Patterns with Suffix Arrays , 2001, CPM.

[16]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prex Computation in Sux Arrays and Its Applications , 2001 .

[17]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[18]  B. Barrell,et al.  Massive gene decay in the leprosy bacillus , 2001, Nature.

[19]  Jaime G. Carbonell,et al.  Rare and Frequent N-grams in Whole-genome Protein Sequences , 2002 .

[20]  Enno Ohlebusch,et al.  The Enhanced Suffix Array and Its Applications to Genome Analysis , 2002, WABI.

[21]  Jonathan E. Allen,et al.  Genome sequence of the human malaria parasite Plasmodium falciparum , 2002, Nature.

[22]  Jaime G. Carbonell,et al.  Comparative n-gram analysis of whole-genome protein sequences , 2002 .

[23]  Inge Jonassen,et al.  Fast Sequence Clustering Using A Suffix Array Algorithm , 2003, Bioinform..

[24]  Judith Klein-Seetharaman,et al.  BLMT: statistical sequence analysis using N-grams. , 2004, Applied bioinformatics.

[25]  N. Balakrishnan,et al.  Characterization of protein secondary structure , 2004, IEEE Signal Processing Magazine.

[26]  J. Klein-Seetharaman,et al.  Yule Value Tables from Protein Datasets , 2004 .

[27]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[28]  William F. Smyth,et al.  A taxonomy of suffix array construction algorithms , 2007, CSUR.

[29]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.