Sequence Alignment in Bioinformatics

Over two billion US dollars have been budgeted for the Human Genome Project alone in the past twelve years, not to mention other similar or related projects worldwide. These investments have led to the production of enormous amount of biological data, many of which are sequence information of biomolecules — e.g. specifying proteins/DNAs by identifying each amino-acid/nucleotide in the sequential order. These sequence data, presumably containing the “digital” information of life, are hard to decipher. Extracting useful and important information out of those massive biological data has developed into a new branch of science — bioinformatics. One of the most important and widely used method in bioinformatics research is called “sequence alignment”. The basic idea is to expedite the identification of biological functions of a newly sequenced biomolecule, say a protein, by comparing the sequence content of the new molecule to the existing ones (characterized and documented in the database).

[1]  S. Karlin,et al.  Applications and statistics for multiple high-scoring segments in molecular sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Rolf Olsen,et al.  Rapid Assessment of Extremal Statistics for Gapped Local Alignment , 1999, ISMB.

[3]  Sean R. Eddy,et al.  Maximum Discrimination Hidden Markov Models of Sequence Consensus , 1995, J. Comput. Biol..

[4]  Yicheng Zhang,et al.  Kinetic roughening phenomena, stochastic growth, directed polymers and all that. Aspects of multidisciplinary statistical mechanics , 1995 .

[5]  Martin Vingron,et al.  Sequence Comparison Significance and Poisson Approximation , 1994 .

[6]  M. Waterman,et al.  Stochastic scrabble: large deviations for sequences with scores , 1988, Journal of Applied Probability.

[7]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[8]  J. Felsenstein,et al.  Inching toward reality: An improved likelihood model of sequence evolution , 2004, Journal of Molecular Evolution.

[9]  Ralf Bundschuh,et al.  An analytic approach to significance assessment in local sequence alignment with gaps , 2000, RECOMB '00.

[10]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[11]  Richard Mott,et al.  Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores , 1992 .

[12]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[13]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[14]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[15]  Fisher,et al.  Directed paths in a random potential. , 1991, Physical review. B, Condensed matter.

[16]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[17]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[18]  Terence Hwa,et al.  Statistical significance and extremal ensemble of gapped local hybrid alignment , 2002 .

[19]  Amir Dembo,et al.  LIMIT DISTRIBUTIONS OF MAXIMAL SEGMENTAL SCORE AMONG MARKOV-DEPENDENT PARTIAL SUMS , 1992 .

[20]  Richard Mott,et al.  Approximate Statistics of Gapped Alignments , 1999, J. Comput. Biol..

[21]  M. Waterman,et al.  A Phase Transition for the Score in Matching Random Sequences Allowing Deletions , 1994 .

[22]  M S Waterman,et al.  Rapid and accurate estimates of statistical significance for sequence data base searches. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Temple F. Smith,et al.  The statistical distribution of nucleic acid similarities. , 1985, Nucleic acids research.

[24]  Terence Hwa,et al.  Statistical Significance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models , 2001, J. Comput. Biol..

[25]  M. Bishop,et al.  Maximum likelihood alignment of DNA sequences. , 1986, Journal of molecular biology.

[26]  Benjamin Yakir,et al.  Approximate p-values for local sequence alignments , 2000 .

[27]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[28]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[29]  M. Waterman,et al.  Phase transitions in sequence matches and nucleic acid structure. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[30]  J. Felsenstein,et al.  An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.

[31]  A. Dembo,et al.  Limit Distribution of Maximal Non-Aligned Two-Sequence Segmental Score , 1994 .