A probabilistic approach to sequence assembly validation

Sequence assembly is an essential requirement for determining the complete sequence of long DNA. However, sequence assembly programs often generate misassembled contigs by either joining different repeat copies, resulting in joining non contiguous DNA regions (inverted or swapped) or by including many fragments from different repeat copies resulting in errors in the consensus sequence (noisy regions). Usually, sequence assemblies are experimentally validated. While this is the most reliable approach, it is time consuming and labor intensive. In this paper, we propose a probabilistic approach to identify possible misassembled regions in shotgun sequence assemblies. Based on the statistics using a set of randomly sampled patterns from shotgun data, a probability model that measures each fragment's contribution to misassembly is proposed. From the probability model, we compute entropy at each base position in contig assembly. Our approach correctly identified all misassembled regions in the assembly of the Mycoplasma genitalium genome from real shotgun sequence data. Furthermore, using this approach we identified many putative misassembled regions in the assemblies of bacterial genomes we are currently sequencing.