Statistical tests to compare motif count exceptionalities

BackgroundFinding over- or under-represented motifs in biological sequences is now a common task in genomics. Thanks to p-value calculation for motif counts, exceptional motifs are identified and represent candidate functional motifs. The present work addresses the related question of comparing the exceptionality of one motif in two different sequences. Just comparing the motif count p-values in each sequence is indeed not sufficient to decide if this motif is significantly more exceptional in one sequence compared to the other one. A statistical test is required.ResultsWe develop and analyze two statistical tests, an exact binomial one and an asymptotic likelihood ratio test, to decide whether the exceptionality of a given motif is equivalent or significantly different in two sequences of interest. For that purpose, motif occurrences are modeled by Poisson processes, with a special care for overlapping motifs. Both tests can take the sequence compositions into account. As an illustration, we compare the octamer exceptionalities in the Escherichia coli K-12 backbone versus variable strain-specific loops.ConclusionThe exact binomial test is particularly adapted for small counts. For large counts, we advise to use the likelihood ratio test which is asymptotic but strongly correlated with the exact binomial test and very simple to use.

[1]  Meriem El Karoui,et al.  Systematic determination of the mosaic structure of bacterial genomes: species backbone versus strain-specific loops , 2005, BMC Bioinformatics.

[2]  M. Rossignol,et al.  Macrodomain organization of the Escherichia coli chromosome , 2004, The EMBO journal.

[3]  Isabelle Debled-Rennesson,et al.  SIGffRid: A tool to search for sigma factor binding sites in bacterial genomes using comparative approach and biologically driven statistics , 2008, BMC Bioinformatics.

[4]  Sophie Schbath,et al.  Compound Poisson approximation of word counts in DNA sequences , 1997 .

[5]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[6]  Etienne Roquain,et al.  Improved compound Poisson approximation for the number of occurrences of any rare word family in a stationary markov chain , 2007, Advances in Applied Probability.

[7]  J. Mcneil,et al.  Word frequency analysis reveals enrichment of dinucleotide repeats on the human X chromosome and [GATA]n in the X escape region. , 2006, Genome research.

[8]  T. Rognes,et al.  Biased distribution of DNA uptake sequences towards genome maintenance genes. , 2004, Nucleic acids research.

[9]  Stéphane Robin,et al.  DNA, words and models , 2005 .

[10]  Stéphane Robin,et al.  A compound Poisson model for word occurrences in DNA sequences , 2002 .

[11]  S. Salzberg,et al.  Skewed oligomers and origins of replication. , 1998, Gene.

[12]  Meriem El Karoui,et al.  KOPS: DNA motifs that control E. coli chromosome segregation by orienting the FtsK translocase , 2005, The EMBO journal.

[13]  S. Schbath,et al.  Characteristics of Chi distribution on different bacterial genomes. , 1999, Research in microbiology.

[14]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[15]  M. Lothaire Applied Combinatorics on Words (Encyclopedia of Mathematics and its Applications) , 2005 .

[16]  A. W. Kemp,et al.  Univariate Discrete Distributions , 1993 .

[17]  J. Claverie,et al.  The significance of digital gene expression profiles. , 1997, Genome research.