Rapid and sensitive dot-matrix methods for genome analysis

MOTIVATION Dot-matrix plots are widely used for similarity analysis of biological sequences. Many algorithms and computer software tools have been developed for this purpose. Though some of these tools have been reported to handle sequences of a few 100 kb, analysis of genome sequences with a length of >10 Mb on a microcomputer is still impractical due to long execution time and computer memory requirement. RESULTS Two dot-matrix comparison methods have been developed for analysis of large sequences. The methods initially locate similarity regions between two sequences using a fast word search algorithm, followed with an explicit comparison on these regions. Since the initial screening removes most of random matches, the computing time is substantially reduced. The methods produce high quality dot-matrix plots with low background noise. Space requirements are linear, so the algorithms can be used for comparison of genome size sequences. Computing speed may be affected by highly repetitive sequence structures of eukaryote genomes. A dot-matrix plot of Yeast genome (12 Mb) with both strands was generated in 80 s with a 1 GHz personal computer.

[1]  B. Fristensky,et al.  Improving the efficiency of dot-matrix similarity searches through use of an oligomer table , 1986, Nucleic Acids Res..

[2]  Desmond G. Higgins,et al.  EMBLSCAN: fast approximate DNA database searches on compact disc , 1992, Comput. Appl. Biosci..

[3]  J. P. Dumas,et al.  Efficient algorithms for folding and comparing nucleic acid sequences , 1982, Nucleic Acids Res..

[4]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[5]  Jens G. Reich,et al.  A simple statistical significance test of window scores in large dot matrices obtained from protein or nucleic acid sequences , 1987, Comput. Appl. Biosci..

[6]  P Argos,et al.  A sensitive procedure to compare amino acid sequences. , 1987, Journal of molecular biology.

[7]  J. Bennetzen,et al.  Comparative sequence analysis of colinear barley and rice bacterial artificial chromosomes. , 2001, Plant physiology.

[8]  D. Higgins,et al.  See Blockindiscussions, Blockinstats, Blockinand Blockinauthor Blockinprofiles Blockinfor Blockinthis Blockinpublication Clustal: Blockina Blockinpackage Blockinfor Blockinperforming Multiple Blockinsequence Blockinalignment Blockinon Blockina Minicomputer Article Blockin Blockinin Blockin , 2022 .

[9]  Darren A. Natale,et al.  The complete genome of hyperthermophile Methanopyrus kandleri AV19 and monophyly of archaeal methanogens , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[10]  J. Maizel,et al.  Enhanced graphic matrix analysis of nucleic acid and protein sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[11]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[12]  W. Doolittle,et al.  Bacterial origin for the isoprenoid biosynthesis enzyme HMG-CoA reductase of the archaeal orders Thermoplasmatales and Archaeoglobales. , 2001, Molecular biology and evolution.

[13]  J. Salse,et al.  Synteny between Arabidopsis thaliana and rice at the genome level: a tool to identify conservation in the ongoing rice genome sequencing project. , 2002, Nucleic acids research.

[14]  A. Gibbs,et al.  The Diagram, a Method for Comparing Sequences , 1970 .

[15]  R Staden,et al.  An interactive graphics program for comparing and aligning nucleic acid and amino acid sequences. , 1982, Nucleic acids research.

[16]  J. Bennetzen,et al.  Comparative Sequence Analysis of Colinear Barley and Rice Bacterial Artificial Chromosomes 1 , 2001 .

[17]  Martin Vingron,et al.  A new interactive protein sequence alignment program and comparison of its results with widely used algorithms , 1989, Comput. Appl. Biosci..

[18]  J. Risler,et al.  Amino acid substitutions in structurally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix. , 1988, Journal of molecular biology.

[19]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[20]  C Lefèvre,et al.  A fast word search algorithm for the representation of sequence similarity in genomic DNA. , 1994, Nucleic acids research.

[21]  Mark S. Boguski,et al.  Similarity and Homology , 1991 .

[22]  R. Durbin,et al.  A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. , 1995, Gene.