Sequence alignment by cross-correlation.

Many recent advances in biology and medicine have resulted from DNA sequence alignment algorithms and technology. Traditional approaches for the matching of DNA sequences are based either on global alignment schemes or heuristic schemes that seek to approximate global alignment algorithms while providing higher computational efficiency. This report describes an approach using the mathematical operation of cross-correlation to compare sequences. It can be implemented using the fast fourier transform for computational efficiency. The algorithm is summarized and sample applications are given. These include gene sequence alignment in long stretches of genomic DNA, finding sequence similarity in distantly related organisms, demonstrating sequence similarity in the presence of massive (approximately 90%) random point mutations, comparing sequences related by internal rearrangements (tandem repeats) within a gene, and investigating fusion proteins. Application to RNA and protein sequence alignment is also discussed. The method is efficient, sensitive, and robust, being able to find sequence similarities where other alignment algorithms may perform poorly.

[1]  W. Pearson Effective protein sequence comparison. , 1996, Methods in enzymology.

[2]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[3]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[4]  William H. Press,et al.  Numerical recipes in C , 2002 .

[5]  Christian E. V. Storm,et al.  Comprehensive analysis of orthologous protein domains using the HOPS database. , 2003, Genome research.

[6]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[7]  D J States,et al.  Molecular sequence accuracy and the analysis of protein coding regions. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Sanguthevar Rajasekaran,et al.  The Efficient Computation of Position-Specific Match Scores with the Fast Fourier Transform , 2002, J. Comput. Biol..

[9]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[10]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[11]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[12]  B. Burr,et al.  International Rice Genome Sequencing Project: the effort to completely sequence the rice genome. , 2000, Current opinion in plant biology.

[13]  M. Gribskov,et al.  Sequence Analysis Primer , 1991 .

[14]  J. Reilly,et al.  Genomic structure of human FLT3: implications for mutational analysis , 2001, British journal of haematology.

[15]  J. Roach,et al.  Modeling the feasibility of whole genome shotgun sequencing using a pairwise end strategy. , 2000, Genomics.

[16]  James Ze Wang,et al.  SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size , 2002, Bioinform..

[17]  Joseph Felsenstein,et al.  An efficient method for matching nucleic acid sequences , 1982, Nucleic Acids Res..

[18]  A. Borkhardt,et al.  Molecular analysis of MLL-1/AF4 recombination in infant acute lymphoblastic leukemia. , 1994, Leukemia.

[19]  Antoine Danchin,et al.  Relationship of SARS-CoV to other pathogenic RNA viruses explored by tetranucleotide usage profiling , 2003, BMC Bioinformatics.

[20]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.