Fast Alignment-free Comparison for Regulatory Sequences using Multiple Resolution Entropic Profiles

Enhancers are stretches of DNA (100-1000 bp) that play a major role in development gene expression, evolution and disease. It has been recently shown that in high-level eukaryotes enhancers rarely work alone, instead they collaborate by forming clusters of cis-regulatory modules (CRMs). Even if the binding of transcription factors is sequence-specific, the identification of functionally similar enhancers is very difficult and it cannot be carried out with traditional alignment-based techniques. In this paper we study the use of alignment-free measures for the classification of CRMs. However alignment-free measures are generally tied to a fixed resolution k. Here we propose an alignment-free statistic that is based on multiple resolution patterns derived from Entropic Profiles. Entropic Profile is a function of the genomic location that captures the importance of that region with respect to the whole genome. We evaluate several alignment-free statistics on simulated data and real mouse ChIP-seq sequences. The new statistic is highly successful in discriminating functionally related enhancers and, in almost all experiments, it outperforms fixed-resolution methods.

[1]  Martin Vingron,et al.  Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts , 2012, Bioinform..

[2]  Matteo Comin,et al.  Beyond Fixed-Resolution Alignment-Free Measures for Mammalian Enhancers Sequence Comparison , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  Gesine Reinert,et al.  New powerful statistics for alignment-free sequence comparison under a pattern transfer model. , 2011, Journal of theoretical biology.

[4]  Jonathan Pevsner,et al.  Basic Local Alignment Search Tool (BLAST) , 2005 .

[5]  Matteo Comin,et al.  The Irredundant Class Method for Remote Homology Detection of Protein Sequences , 2011, J. Comput. Biol..

[6]  Matteo Comin,et al.  Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns , 2014, BMC Bioinformatics.

[7]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[8]  A. Stark,et al.  Transcriptional enhancers: from properties to genome-wide predictions , 2014, Nature Reviews Genetics.

[9]  Temple F. Smith,et al.  Comparison of biosequences , 1981 .

[10]  A. Visel,et al.  ChIP-Seq identification of weakly conserved heart enhancers , 2010, Nature Genetics.

[11]  Jonas S. Almeida,et al.  Local Renyi entropic profiles of DNA sequences , 2007, BMC Bioinformatics.

[12]  Susan R. Wilson,et al.  Characterizing the D2 Statistic: Word Matches in Biological Sequences , 2009, Statistical applications in genetics and molecular biology.

[13]  Matteo Comin,et al.  Fast Computation of Entropic Profiles for the Detection of Conservation in Genomes , 2013, PRIB.

[14]  A. Visel,et al.  ChIP-seq accurately predicts tissue-specific activity of enhancers , 2009, Nature.

[15]  Kai Song,et al.  New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing , 2014, Briefings Bioinform..

[16]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Matteo Comin,et al.  Fast Entropic Profiler: An Information Theoretic Approach for the Discovery of Patterns in Genomes , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[19]  Jonas S. Almeida,et al.  Entropic Profiler – detection of conservation in genomes using information theory , 2009, BMC Research Notes.

[20]  Matteo Comin,et al.  Classification of protein sequences by means of irredundant patterns , 2010, BMC Bioinformatics.

[21]  Matteo Comin,et al.  QCluster: Extending Alignment-Free Measures with Quality Values for Reads Clustering , 2014, WABI.

[22]  Saurabh Sinha,et al.  A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.