New powerful statistics for alignment-free sequence comparison under a pattern transfer model.

Alignment-free sequence comparison is widely used for comparing gene regulatory regions and for identifying horizontally transferred genes. Recent studies on the power of a widely used alignment-free comparison statistic D2 and its variants D*2 and D(s)2 showed that their power approximates a limit smaller than 1 as the sequence length tends to infinity under a pattern transfer model. We develop new alignment-free statistics based on D2, D*2 and D(s)2 by comparing local sequence pairs and then summing over all the local sequence pairs of certain length. We show that the new statistics are much more powerful than the corresponding statistics and the power tends to 1 as the sequence length tends to infinity under the pattern transfer model.

[1]  Marc S Halfon,et al.  Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs , 2008, Genome Biology.

[2]  Susan R. Wilson,et al.  Characterizing the D2 Statistic: Word Matches in Biological Sequences , 2009, Statistical applications in genetics and molecular biology.

[3]  Alain Giron,et al.  Detection and characterization of horizontal transfers in prokaryotes using genomic signature , 2005, Nucleic acids research.

[4]  Tiee-Jian Wu,et al.  Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences , 2005, Bioinform..

[5]  C. J. Burden,et al.  Asymptotic Behavior of k-Word Matches Between two Uniformly Distributed Sequences , 2007, Journal of Applied Probability.

[6]  E. Hill Journal of Theoretical Biology , 1961, Nature.

[7]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[8]  Qi Dai,et al.  Comparison study on k-word statistical measures for protein: From sequence to 'sequence space' , 2008, BMC Bioinformatics.

[9]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[10]  Se-Ran Jun,et al.  Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution , 2009, Proceedings of the National Academy of Sciences.

[11]  Saurabh Sinha,et al.  A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.

[12]  J. Qi,et al.  Whole genome molecular phylogeny of large dsDNA viruses using composition vector method , 2007, BMC Evolutionary Biology.

[13]  Liqing Zhang,et al.  Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction , 2008, Nucleic acids research.

[14]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[15]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Eva M. Top,et al.  Using Mahalanobis distance to compare genomic signatures between bacterial plasmids and chromosomes , 2008, Nucleic acids research.

[17]  M. Kost-Alimova,et al.  Horizontal transfer of tumor DNA to endothelial cells in vivo , 2009, Cell Death and Differentiation.

[18]  Sascha Ott,et al.  An alignment-free model for comparison of regulatory sequences , 2010, Bioinform..

[19]  R. Sandberg,et al.  Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. , 2001, Genome research.

[20]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[21]  Zhao Xu,et al.  A fungal phylogeny based on 82 complete genomes using the composition vector method , 2009, BMC Evolutionary Biology.

[22]  Konstantinos Mavromatis,et al.  Microbial co-habitation and lateral gene transfer: what transposases can tell us , 2009, Genome Biology.

[23]  M. Eisen,et al.  Identifying Cis-Regulatory Sequences by Word Profile Similarity , 2009, PloS one.

[24]  Sylvain Forêt,et al.  Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences , 2006, BMC Bioinformatics.

[25]  Sung-Hou Kim,et al.  Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method , 2009, Proceedings of the National Academy of Sciences.

[26]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics , 2010, J. Comput. Biol..

[27]  D. Davison,et al.  A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. , 1997, Biometrics.

[28]  J. Finke,et al.  Horizontal DNA transfer from donor to host cells as an alternative mechanism of epithelial chimerism after allogeneic hematopoietic cell transplantation. , 2011, Biology of blood and marrow transplantation : journal of the American Society for Blood and Marrow Transplantation.

[29]  M. Waterman,et al.  Distributional regimes for the number of k-word matches between two random sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Hong Luo,et al.  CVTree: a phylogenetic tree reconstruction tool based on whole genomes , 2004, Nucleic Acids Res..

[31]  Devdatt P. Dubhashi,et al.  Bayesian classifiers for detecting HGT using fixed and variable order markov models of genomic signatures , 2006, Bioinform..

[32]  Sylvain Forêt,et al.  Empirical distribution of k , 2009, Pattern Recognit..

[33]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[34]  P. Bork,et al.  A Molecular Study of Microbe Transfer between Distant Environments , 2008, PloS one.

[35]  Susan R. Wilson,et al.  Approximate word matches between two random sequences , 2008 .

[36]  Randy Goebel,et al.  Nucleotide composition string selection in HIV-1 subtyping using whole genomes , 2007, Bioinform..