Inferring ancestry in mouse genomes using a hidden Markov model

The use of recombinant inbred strains in mice is a powerful tool for understanding the genetics of complex diseases, such as alcohol tolerance. In 1981, McClearn and Kakihana introduced the Inbred Long Sleep (ILS) and Inbred Short Sleep (ISS) mouse strains that had been selectively bred from eight ancestor strains for their response to alcohol. The study provides a rare framework for computational analysis as the lineages of the two strains are well documented and six of the eight ancestor mouse strains have been sequenced. We have recently sequenced the complete genomes of the ILS and ISS strains to identify all single-nucleotide polymorphisms (SNPs), where individual nucleotides in the ILS and ISS genomes differ from the reference mouse genome (mm10). Here, we leverage this data as well as fine-scale mouse recombination rates to infer the ancestral origin of each segment of the ILS and ISS genomes using a hidden Markov model (HMM). Our model is a fully connected set of seven states corresponding to each of the six sequenced ancestors and single state to capture the two unsequenced ancestors (the "unknown" state). Transitions between distinct states correspond to cross-over events during the breeding. Our transitions are further informed by incorporating fine-scale mouse recombination rates from Brunschwig et al. as priors to the likelihood of transitioning from one state to a different state (a recombination event) between any two observations. As recombination rates are per-base and region-specific, this has the added benefit of naturally incorporating the distance between SNPs into our transition probabilities as well. Each state emits an indicator variable, specifying whether the detected SNP is consistent with the specified ancestor. We note that inconsistencies can arise from sequencing errors and de novo mutations. We use an Expectation-Maximization (EM) algorithm to determine the best parameters for our model. Furthermore we identify regions within the ILS and ISS genomes that are identical by descent (IBD). Sometimes large segments of the ancestor genomes are absolutely identical and indistinguishable. We sought to identify these regions within the HMM output, where a particular ancestor is chosen (by Viterbi) based on little to no informative positions. In these cases, there is no way to truly distinguish the ancestor of origin so we reclassify the segment as IBD. Our model outputs the inferred ancestor or IBD label for every segment of the ILS and ISS genomes. These regions were then verified by manual inspection using the Integrative Genome Viewer. Furthermore, we examined the consistency of called insertions, deletions, and structural variants between the ILS and ISS strain and the inferred ancestor within the region.