Allowing mismatches in anchors for wholw genome alignment: Generation and effectiveness

Based on the experiments on 35 pairs of virus genomes using three software tools (MUMmer-3, MaxMinCluster, MSS), we show that using anchors with mismatches does increase the effectiveness of locating conserved regions ( about 10% more conserved gene regions are located, while maintaining a high sensitivity) . To generate a more comprehensive set of anchors with mismatches is not trivial for long sequences due to the time and memory limitation. We propose two practical algorithms for generating this anchor set. One aims at speeding up the process, the other aims at saving memory. Experimental results show that both algorithms are faster (6 times and 5 times, respectively) than a straightforward suffix tree based appr oach.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  Archie L. Cobbs,et al.  Fast Approximate Matching using Suffix Trees , 1995, CPM.

[3]  R. Durbin,et al.  Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs. , 1999, Genome research.

[4]  W. James Kent,et al.  The Intronerator: exploring introns and alternative splicing in Caenorhabditis elegans , 2000, Nucleic Acids Res..

[5]  E. Herniou,et al.  Use of Whole Genome Sequence Data To Infer Baculovirus Phylogeny , 2001, Journal of Virology.

[6]  Siu-Ming Yiu,et al.  Efficient Algorithms for Optimizing Whole Genome Alignment with Noise , 2003, ISAAC.

[7]  Daniel G. Brown,et al.  Vector Seeds: An Extension to Spaced Seeds Allows Substantial Improvements in Sensitivity and Specifity , 2003, WABI.

[8]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.

[9]  Siu-Ming Yiu,et al.  A mutation-sensitive approach for locating conserved gene pairs between related species , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[10]  Serge A. Hazout,et al.  A strategy for finding regions of similarity in complete genome sequences , 1998, Bioinform..

[11]  Gonzalo Navarro,et al.  A Hybrid Indexing Method for Approximate String Matching , 2007 .

[12]  Esko Ukkonen,et al.  Approximate String-Matching over Suffix Trees , 1993, CPM.

[13]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[14]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[15]  Burkhard Morgenstern,et al.  DIALIGN: finding local similarities by multiple sequence alignment , 1998, Bioinform..