Generalization of Entropy Based Divergence Measures for Symbolic Sequence Analysis

Entropy based measures have been frequently used in symbolic sequence analysis. A symmetrized and smoothed form of Kullback-Leibler divergence or relative entropy, the Jensen-Shannon divergence (JSD), is of particular interest because of its sharing properties with families of other divergence measures and its interpretability in different domains including statistical physics, information theory and mathematical statistics. The uniqueness and versatility of this measure arise because of a number of attributes including generalization to any number of probability distributions and association of weights to the distributions. Furthermore, its entropic formulation allows its generalization in different statistical frameworks, such as, non-extensive Tsallis statistics and higher order Markovian statistics. We revisit these generalizations and propose a new generalization of JSD in the integrated Tsallis and Markovian statistical framework. We show that this generalization can be interpreted in terms of mutual information. We also investigate the performance of different JSD generalizations in deconstructing chimeric DNA sequences assembled from bacterial genomes including that of E. coli, S. enterica typhi, Y. pestis and H. influenzae. Our results show that the JSD generalizations bring in more pronounced improvements when the sequences being compared are from phylogenetically proximal organisms, which are often difficult to distinguish because of their compositional similarity. While small but noticeable improvements were observed with the Tsallis statistical JSD generalization, relatively large improvements were observed with the Markovian generalization. In contrast, the proposed Tsallis-Markovian generalization yielded more pronounced improvements relative to the Tsallis and Markovian generalizations, specifically when the sequences being compared arose from phylogenetically proximal organisms.

[1]  Alpan Raval,et al.  Detection of genomic islands via segmental genome heterogeneity , 2009, Nucleic acids research.

[2]  C. R. Rao,et al.  On the convexity of some divergence measures based on entropy functions , 1982, IEEE Trans. Inf. Theory.

[3]  W Li,et al.  Delineating relative homogeneous G+C domains in DNA sequences. , 2001, Gene.

[4]  C. Tsallis Generalized entropy-based criterion for consistent testing , 1998 .

[5]  Pedro W. Lamberti,et al.  Non-logarithmic Jensen–Shannon divergence , 2003 .

[6]  Rajeev K. Azad,et al.  Detecting laterally transferred genes: use of entropic clustering methods and genome position , 2007, Nucleic acids research.

[7]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[8]  P Bernaola-Galván,et al.  High-level organization of isochores into gigantic superstructures in the human genome. , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[9]  Rajeev K. Azad,et al.  Interpreting genomic data via entropic dissection , 2012, Nucleic acids research.

[10]  Ram Ramaswamy,et al.  Markov models of genome segmentation. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[11]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[12]  Ernesto P. Borges A possible deformed algebra and calculus inspired in nonextensive thermostatistics , 2003, cond-mat/0304545.

[13]  C. Tsallis,et al.  Nonextensive Entropy: Interdisciplinary Applications , 2004 .

[14]  Pedro Carpena,et al.  Statistical characterization of the mobility edge of vibrational states in disordered materials , 1999 .

[15]  P. Bernaola-Galván,et al.  Compositional segmentation and long-range fractal correlations in DNA sequences. , 1996, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[16]  H. Ochman,et al.  Lateral gene transfer and the nature of bacterial innovation , 2000, Nature.

[17]  I. Grosse,et al.  Analysis of symbolic sequences using the Jensen-Shannon divergence. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[18]  Ramakrishna Ramaswamy,et al.  Simplifying the mosaic description of DNA sequences. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[19]  Dan Graur,et al.  Identifying compositionally homogeneous and nonhomogeneous domains within the human genome using a novel segmentation algorithm , 2010, Nucleic acids research.

[20]  Ramakrishna Ramaswamy,et al.  Segmentation of genomic DNA through entropic divergence: power laws and scaling. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[21]  Rodolfo O. Esquivel,et al.  Jensen–Shannon divergence in conjugate spaces: The entropy excess of atomic systems and sets with respect to their constituents , 2010 .

[22]  José Martínez-Aroza,et al.  An Analysis of Edge Detection by Using the Jensen-Shannon Divergence , 2000, Journal of Mathematical Imaging and Vision.

[23]  C. Tsallis Possible generalization of Boltzmann-Gibbs statistics , 1988 .