BLMT: statistical sequence analysis using N-grams.

UNLABELLED Statistical analysis of amino acid and nucleotide sequences, especially sequence alignment, is one of the most commonly performed tasks in modern molecular biology. However, for many tasks in bioinformatics, the requirement for the features in an alignment to be consecutive is restrictive and "n-grams" (aka k-tuples) have been used as features instead. N-grams are usually short nucleotide or amino acid sequences of length n, but the unit for a gram may be chosen arbitrarily. The n-gram concept is borrowed from language technologies where n-grams of words form the fundamental units in statistical language models. Despite the demonstrated utility of n-gram statistics for the biology domain, there is currently no publicly accessible generic tool for the efficient calculation of such statistics. Most sequence analysis tools will disregard matches because of the lack of statistical significance in finding short sequences. This article presents the integrated Biological Language Modeling Toolkit (BLMT) that allows efficient calculation of n-gram statistics for arbitrary sequence datasets. AVAILABILITY BLMT can be downloaded from http://www.cs.cmu.edu/~blmt/source and installed for standalone use on any Unix platform or Unix shell emulation such as Cygwin on the Windows platform. Specific tools and usage details are described in a "readme" file. The n-gram computations carried out by the BLMT are part of a broader set of tools borrowed from language technologies and modified for statistical analysis of biological sequences; these are available at http://flan.blm.cs.cmu.edu/.

[1]  Lorna J. Smith,et al.  Long-Range Interactions Within a Nonnative Protein , 2002, Science.

[2]  Tetsuo Shibuya,et al.  Indexing huge genome sequences for solving various problems. , 2001, Genome informatics. International Conference on Genome Informatics.

[3]  Kuo-Chen Chou,et al.  Prediction of protein secondary structure content by artificial neural network , 2003, J. Comput. Chem..

[4]  Yael Mandel-Gutfreund,et al.  On the significance of alternating patterns of polar and non-polar residues in beta-strands. , 2002, Journal of molecular biology.

[5]  P. Y. Chou,et al.  Prediction of the secondary structure of proteins from their amino acid sequence. , 2006 .

[6]  David Haussler,et al.  Classifying G-protein coupled receptors with support vector machines , 2002, Bioinform..

[7]  D Larhammar,et al.  Lack of biological significance in the 'linguistic features' of noncoding DNA--a quantitative analysis. , 1996, Nucleic acids research.

[8]  Chan,et al.  Can Zipf distinguish language from noise in noncoding DNA? , 1996, Physical review letters.

[9]  Partha Niyogi,et al.  A Note on Zipf's Law, Natural Languages, and Noncoding DNA regions , 1995, ArXiv.

[10]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[11]  B. Rost,et al.  State-of-the-art in membrane protein prediction. , 2002, Applied bioinformatics.

[12]  J. Klein-Seetharaman,et al.  Yule Value Tables from Protein Datasets , 2004 .

[13]  Gad M. Landau,et al.  Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity , 2002, Bioinform..

[14]  R. Durbin,et al.  Enhanced protein domain discovery by using language modeling techniques from speech recognition , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Judith Klein-Seetharaman,et al.  Identification of fundamental building blocks in protein sequences using statistical association measures , 2004, SAC '04.

[16]  K. Chou,et al.  Prediction of protein secondary structure content. , 1999, Protein engineering.

[17]  J. Richardson,et al.  Amino acid preferences for specific locations at the ends of alpha helices. , 1988, Science.

[18]  S Erhan,et al.  Amino-acid neighborhood relationships in proteins. Breakdown of amino-acid sequences into overlapping doublets, triplets and quadruplets. , 1980, International journal of bio-medical computing.

[19]  T G Dewey,et al.  The Shannon information entropy of protein sequences. , 1996, Biophysical journal.

[20]  N. Balakrishnan,et al.  Characterization of protein secondary structure , 2004, IEEE Signal Processing Magazine.

[21]  P. Holland,et al.  Discrete Multivariate Analysis. , 1976 .

[22]  Wentian Li,et al.  Statistical Properties of Open Reading Frames in Complete Genome Sequences , 1999, Comput. Chem..

[23]  Andreas D. Baxevanis,et al.  Bioinformatics - a practical guide to the analysis of genes and proteins , 2001, Methods of biochemical analysis.

[24]  Jaime G. Carbonell,et al.  Comparative N-gram Analysis of Genome Sequences , 2001 .

[25]  Per Jambeck,et al.  Developing Bioinformatics Computer Skills , 2001 .

[26]  H E Stanley,et al.  Linguistic features of noncoding DNA sequences. , 1994, Physical review letters.

[27]  Richard Bonneau,et al.  Ab initio protein structure prediction: progress and prospects. , 2001, Annual review of biophysics and biomolecular structure.

[28]  Cathy H. Wu,et al.  Protein classification artificial neural system , 1992, Protein science : a publication of the Protein Society.

[29]  A A Tsonis,et al.  Is DNA a language? , 1997, Journal of theoretical biology.

[30]  L. Wasserman,et al.  Exponential Language Models, Logistic Regression, and Semantic Coherence , 2000 .

[31]  Stanley F. Chen,et al.  An empirical study of smoothing techniques for language modeling , 1999 .

[32]  S F Altschul,et al.  Statistical methods and insights for protein and DNA sequences. , 1991, Annual review of biophysics and biophysical chemistry.

[33]  S Karlin,et al.  Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[34]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[35]  S Rackovsky,et al.  On the properties and sequence context of structurally ambivalent fragments in proteins , 2003, Protein science : a publication of the Protein Society.

[36]  E. Trifonov,et al.  Enhancement of the nucleosomal pattern in sequences of lower complexity. , 1997, Nucleic acids research.

[37]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[38]  Golan Yona,et al.  Variations on probabilistic suffix trees: statistical modeling and prediction of protein families , 2001, Bioinform..

[39]  Bogdan Dorohonceanu,et al.  Accelerating Protein Classification Using Suffix Trees , 2000, ISMB.

[40]  Judith Klein-Seetharaman,et al.  PROTEINS: Structure, Function, and Bioinformatics 58:955–970 (2005) Protein Classification Based on Text Document Classification Techniques , 2022 .

[41]  E. B. Newman,et al.  Tests of a statistical explanation of the rank-frequency relation for words in written English. , 1958, American Journal of Psychology.

[42]  Jonathan Pevsner,et al.  Basic Local Alignment Search Tool (BLAST) , 2005 .

[43]  Sean R. Eddy,et al.  Biological sequence analysis: Probabilistic approaches to phylogeny , 1998 .

[44]  A K Konopka,et al.  Noncoding DNA, Zipf's law, and language. , 1995, Science.

[45]  S. Karlin,et al.  Quantile distributions of amino acid usage in protein classes. , 1992, Protein engineering.

[46]  D. Searls,et al.  Robots in invertebrate neuroscience , 2002, Nature.