MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions

deletion at this location, the distribution p(Ci) will shift (Fig. 1a). If the observed cluster is the site of a heterozygous indel, approximately half of the observed mate pairs will be generated from the shifted distribution, and the other half will come from the original, unshifted p(Y) (Fig. 1b). MoDIL represents the random variable of the expected size of indel (mean of insert size minus the mapped distance) with two random variables, one for each haplotype. Given a cluster, MoDIL identifies the two distributions, {D1,D2}, with the fixed shape of p(Y) and arbitrary means that best fits the observed data using the Kolmogorov-Smirnov test. To find the means of the two distributions, MoDIL uses the expectation-maximization algorithm and appropriate Bayesian priors to prevent over-fitting. For each distribution Dk ∈ {1,2} the size of the indel event can be estimated with high confidence: its expected size follows a MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions