论文信息 - Title: Shannon's Uncertainty and Kullback-Leibler Divergence in Microbial Genome and Metagenome Sequences

Title: Shannon's Uncertainty and Kullback-Leibler Divergence in Microbial Genome and Metagenome Sequences

All genome sequence data contains inherent information in it. Shannon's uncertainty theory can be used to measure how much information a sequence has. Here we show that the amount of information in sequences from metagenomes correlates with the number of similar sequences that will be found by comparison to databases of known sequences. Hence, a sequence with more information (higher uncertainty) has a higher probability of being significantly similar to other sequences in the database. Measuring uncertainty maybe a rapid way to screen for sequences likely to be similar to things in the database, to prioritize assignment of computational resources, and to show which sequences with no known similarities are likely to be false negatives. To predict which sequences could be coding based purely on the information content in them, we compared the uncertainty of intergenic and protein coding regions for complete bacterial genomes. The intergenic region was more likely to have higher uncertainty, but was not predictive of the coding potential of short sequences. Since uncertainty could predict useful short sequences; we could divide the long sequences in small fragments (100 bp) and measure the uncertainty; then compare the consecutive uncertainty to predict the useful portion of sequences. Amino acid content in the genome may reflect lifestyle restrictions of an organism, and may also be predictive of coding potential. To compare the amino acid composition for each of the complete bacterial genome sequences we calculated the Kullback-Leibler divergence from the mean amino acid content. We demonstrate that (i) there is a significant difference between amino acid utilization in different phylogenetic groups of bacteria; (ii) that the bacteria with the most skewed amino acid utilization profile are endosymbionts or intracellular pathogens; (iii) the skews are not restricted to one or a few metabolic processes but are across all subsystems; (iv) amino acid utilization profile are strongly correlate with GC percent.