Fast Fourier transform-based correlation of DNA sequences using complex plane encoding

The detection of similarities between DNA sequences can be accomplished using the signal-processing technique of cross-correlation. An early method used the fast Fourier transform (FFT) to perform correlations on DNA sequences in O(n log n) time for any length sequence. However, this method requires many FFTs (nine), runs no faster if one sequence is much shorter than the other, and measures only global similarity, so that significant short local matches may be missed. We report that, through the use of alternative encodings of the DNA sequence in the complex plane, the number of FFTs performed can be traded off against (i) signal-to-noise ratio, and (ii) a certain degree of filtering for local similarity via k-tuple correlation. Also, when comparing probe sequences against much longer targets, the algorithm can be sped up by decomposing the target and performing multiple small FFTs in an overlap-save arrangement. Finally, by decomposing the probe sequence as well, the detection of local similarities can be further enhanced. With current advances in extremely fast hardware implementations of signal-processing operations, this approach may prove more practical than heretofore.

[1]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[2]  John W. Mellors,et al.  Human retroviruses and AIDS 1996. A compilation and analysis of nucleic acid and amino acid sequences , 1997 .

[3]  M. Bishop,et al.  Nucleic acid and protein sequence analysis : a practical approach , 1987 .

[4]  C DeLisi,et al.  Computers in molecular biology: current applications and emerging trends. , 1988, Science.

[5]  R. N. Curnow,et al.  A test for the statistical significance of DNA sequence similarities for application in databank searches , 1989, Comput. Appl. Biosci..

[6]  J. P. Dumas,et al.  Efficient algorithms for folding and comparing nucleic acid sequences , 1982, Nucleic Acids Res..

[7]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[9]  R. Swanson A unifying concept for the amino acid code. , 1984, Bulletin of mathematical biology.

[10]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[11]  K Nishikawa,et al.  Correspondence of homologies in amino acid sequence and tertiary structure of protein molecules. , 1982, Biochimica et biophysica acta.

[12]  I. Cosic,et al.  Is it Possible to Analyze DNA and Protein Sequences by the Methods of Digital Signal Processing? , 1985, IEEE Transactions on Biomedical Engineering.

[13]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[14]  Daniel P. Lopresti,et al.  P-NAC: A Systolic Array for Comparing Nucleic Acid Sequences , 1987, Computer.

[15]  Joseph Felsenstein,et al.  An efficient method for matching nucleic acid sequences , 1982, Nucleic Acids Res..

[16]  Paul N. Swarztrauber,et al.  Vectorizing the FFTs , 1982 .