Discovering possible context dependences around SNP sites in human genes with Bayesian network learning

Single nucleotide polymorphisms (SNPs) are loci on the genome where different alleles are observed in the population. It has been observed that there might be some patterns or context dependences in the sequence segments adjacent to SNPs sites. Discovering such dependences is very important for understanding possible origins of SNPs in evolution. We collected 519,767 bi-allelic SNPs of human in gene regions from HGBASE and separated them in 6 groups according to the types of alleles at the SNP loci. Bayesian network structure learning technique is applied to discovery of possible dependences in sequence segments around these sites as well as in reference sequences collected as comparison. Noticeable probabilistic correlations among some loci were detected in all the 6 SNP groups and nothing significant was found in the reference sequences. The dependence relations found with different SNP groups are different. These putative context dependences around SNP sites provide important hints for further analyzing SNP-related sequences patterns. The work also illustrates the powerfulness of the Bayesian network method as a tool for biological sequence analysis.