Separating Significant Matches from Spurious Matches in DNA Sequences

Word matches are widely used to compare genomic sequences. Complete genome alignment methods often rely on the use of matches as anchors for building their alignments, and various alignment-free approaches that characterize similarities between large sequences are based on word matches. Among matches that are retrieved from the comparison of two genomic sequences, a part of them may correspond to spurious matches (SMs), which are matches obtained by chance rather than by homologous relationships. The number of SMs depends on the minimal match length (ℓ) that has to be set in the algorithm used to retrieve them. Indeed, if ℓ is too small, a lot of matches are recovered but most of them are SMs. Conversely, if ℓ is too large, fewer matches are retrieved but many smaller significant matches are certainly ignored. To date, the choice of ℓ mostly depends on empirical threshold values rather than robust statistical methods. To overcome this problem, we propose a statistical approach based on the use of a mixture model of geometric distributions to characterize the distribution of the length of matches obtained from the comparison of two genomic sequences.

[1]  Louxin Zhang,et al.  Good spaced seeds for homology search , 2004, Bioinform..

[2]  Sylvain Forêt,et al.  Empirical distribution of k , 2009, Pattern Recognit..

[3]  Xingyi Guo,et al.  Maximal sequence length of exact match between members from a gene family during early evolution. , 2005, Journal of Zhejiang University. Science. B.

[4]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[5]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[6]  Carey E. Priebe,et al.  Mixture structure analysis using the Akaike Information Criterion and the bootstrap , 1998, Stat. Comput..

[7]  Serafim Batzoglou,et al.  The many faces of sequence alignment , 2005, Briefings Bioinform..

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  Meriem El Karoui,et al.  A Genomic Distance Based on MUM Indicates Discontinuity between Most Bacterial Species and Genera , 2008, Journal of bacteriology.

[10]  Enno Ohlebusch,et al.  Efficient multiple genome alignment , 2002, ISMB.

[11]  Webb Miller,et al.  Comparison of genomic DNA sequences: solved and unsolved problems , 2001, Bioinform..

[12]  Sophie Schbath,et al.  Robustness Assessment of Whole Bacterial Genome Segmentations , 2011, J. Comput. Biol..

[13]  Alain Guénoche,et al.  Comparing bacterial genomes from linear orders of patterns , 2008, Discret. Appl. Math..

[14]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.

[15]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[16]  M. Waterman,et al.  On the Length of the Longest Exact Position Match in a Random Sequence , 2007, TCBB.

[17]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[18]  Sophie Schbath,et al.  Assessing the Robustness of Complete Bacterial Genome Segmentations , 2010, RECOMB-CG.

[19]  Xavier Messeguer,et al.  M-GCAT: interactively and efficiently constructing large-scale multiple genome comparison frameworks in closely related species , 2006, BMC Bioinformatics.

[20]  Enno Ohlebusch,et al.  Space Efficient Computation of Rare Maximal Exact Matches between Multiple Sequences , 2008, J. Comput. Biol..

[21]  Mona Singh,et al.  A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays , 2009, Bioinform..

[22]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[23]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[24]  Dawn Field,et al.  How do we compare hundreds of bacterial genomes? , 2006, Current opinion in microbiology.

[25]  B. Leroux Consistent estimation of a mixing distribution , 1992 .

[26]  Antonio Restivo,et al.  A New Combinatorial Approach to Sequence Comparison , 2007, Theory of Computing Systems.

[27]  Susan R. Wilson,et al.  Characterizing the D2 Statistic: Word Matches in Biological Sequences , 2009, Statistical applications in genetics and molecular biology.

[28]  Christophe Caron,et al.  MOSAIC: an online database dedicated to the comparative genomics of bacterial strains at the intra-species level , 2008, BMC Bioinformatics.

[29]  E. Birney,et al.  Comparative genomics: genome-wide analysis in metazoan eukaryotes , 2003, Nature Reviews Genetics.

[30]  Meriem El Karoui,et al.  Systematic determination of the mosaic structure of bacterial genomes: species backbone versus strain-specific loops , 2005, BMC Bioinformatics.

[31]  Sorin Istrail,et al.  Finding Anchors for Genomic Sequence Comparison , 2005, J. Comput. Biol..

[32]  Se-Ran Jun,et al.  Whole-genome phylogeny of mammals: Evolutionary information in genic and nongenic regions , 2009, Proceedings of the National Academy of Sciences.

[33]  Alain Guénoche,et al.  Comparison of alignment free string distances for complete genome phylogeny , 2009, Adv. Data Anal. Classif..

[34]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[35]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .