Fast Entropic Profiler: An Information Theoretic Approach for the Discovery of Patterns in Genomes

Information theory has been used for quite some time in the area of computational biology. In this paper we present a pattern discovery method, named Fast Entropic Profiler, that is based on a local entropy function that captures the importance of a region with respect to the whole genome. The local entropy function has been introduced by Vinga and Almeida in , here we discuss and improve the original formulation. We provide a linear time and linear space algorithm called Fast Entropic Profiler ( FastEP), as opposed to the original quadratic implementation. Moreover we propose an alternative normalization that can be also efficiently implemented. We show that FastEP is suitable for large genomes and for the discovery of patterns with unbounded length. FastEP is available at http://www.dei.unipd.it/~ciompin/main/FastEP.html.

[1]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[2]  Maxime Crochemore,et al.  Zones of Low Entropy in Genomic Sequences , 1999, Comput. Chem..

[3]  Michael S. Waterman,et al.  Introduction to Computational Biology: Maps, Sequences and Genomes , 1998 .

[4]  Matteo Comin,et al.  Alignment-free phylogeny of whole genomes using underlying subwords , 2012, Algorithms for Molecular Biology.

[5]  Matteo Comin,et al.  Subtle Motif Discovery for Detection of DNA Regulatory Sites , 2007, APBC.

[6]  Matteo Comin,et al.  Fast Computation of Entropic Profiles for the Detection of Conservation in Genomes , 2013, PRIB.

[7]  S. Karlin,et al.  Frequent oligonucleotides and peptides of the Haemophilus influenzae genome. , 1996, Nucleic acids research.

[8]  Matteo Comin,et al.  VARUN: Discovering Extensible Motifs under Saturation Constraints , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Meriem El Karoui,et al.  Identification of the Chi site of Haemophilus influenzae as several sequences related to the Escherichia coli Chi site , 1998, Molecular microbiology.

[10]  Peter K Rogan,et al.  Automated splicing mutation analysis by information theory , 2005, Human mutation.

[11]  Matteo Comin,et al.  Detection of subtle variations as consensus motifs , 2008, Theor. Comput. Sci..

[12]  Y. Lutz,et al.  Definition of the DNA-binding site repertoire for the Drosophila transcription factor SNAIL. , 1993, Nucleic acids research.

[13]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[14]  Hubert P. Yockey,et al.  Origin of Life on Earth and Shannon's Theory of Communication , 2000, Comput. Chem..

[15]  Christopher B. Burge,et al.  Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals , 2003, RECOMB '03.

[16]  Matteo Comin,et al.  Bridging Lossy and Lossless Compression by Motif Pattern Discovery , 2005, Electron. Notes Discret. Math..

[17]  Matteo Comin,et al.  Motifs in Ziv-Lempel-Welch Clef , 2004, Data Compression Conference, 2004. Proceedings. DCC 2004.

[18]  Roberto Marangoni,et al.  A Compression-Based Approach for Coding Sequences Identification. I. Application to Prokaryotic Genomes , 2006, J. Comput. Biol..

[19]  Matteo Comin,et al.  Mining, compressing and classifying with extensible motifs , 2006, Algorithms for Molecular Biology.

[20]  Zaher Dawy,et al.  Genomic analysis using methods from information theory , 2004, Information Theory Workshop.

[21]  Matteo Comin,et al.  Whole-Genome Phylogeny by Virtue of Unic Subwords , 2012, 2012 23rd International Workshop on Database and Expert Systems Applications.

[22]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[23]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[24]  Gad M. Landau,et al.  Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity , 2002, Bioinform..

[25]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[26]  Jonas S. Almeida,et al.  Local Renyi entropic profiles of DNA sequences , 2007, BMC Bioinformatics.

[27]  Jignesh M. Patel,et al.  OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences , 2003, VLDB.

[28]  J. Stoye,et al.  REPuter: the manifold applications of repeat analysis on a genomic scale. , 2001, Nucleic acids research.

[29]  Roberto Grossi,et al.  Inferring Mobile Elements in S. Cerevisiae Strains , 2011, BIOINFORMATICS.

[30]  H E Stanley,et al.  Finding borders between coding and noncoding DNA regions by an entropic segmentation method. , 2000, Physical review letters.

[31]  Christopher B. Burge,et al.  Maximum Entropy Modeling of Short Sequence Motifs with Applications to RNA Splicing Signals , 2004, J. Comput. Biol..

[32]  Susan R. Wilson INTRODUCTION TO COMPUTATIONAL BIOLOGY: MAPS, SEQUENCES AND GENOMES. , 1996 .

[33]  Matteo Comin,et al.  The Irredundant Class Method for Remote Homology Detection of Protein Sequences , 2011, J. Comput. Biol..

[34]  R. Mehnert,et al.  Public Collections of DNA and RNA Sequence Reach 100 Gigabases , 2005 .

[35]  Jonas S. Almeida,et al.  Entropic Profiler – detection of conservation in genomes using information theory , 2009, BMC Research Notes.

[36]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.