Scaffolding large genomes using mate-pair sequencing and ABySS

The development of long-distance genome sequencing libraries, known as mate-pair or jumping libraries, allows the contigs of a de novo genome sequence assembly to be assembled into scaffolds, which specify the order and orientation of those contigs. For the de novo genome sequence assembly software ABySS, we have developed a series of heuristic algorithms, each of which identifies a small subgraph of the scaffold graph matching a particular topology and applies a transformation to that subgraph to simplify the scaffold graph. These algorithms eliminate ambiguities in the scaffold graph and identify contigs that may be assembled into a scaffold. Background • De novo genome assembly is the task of assembling short reads into a draft genome sequence • Such an assembly is often fragmented due to gaps in sequencing and repetitive genome sequence • Scaffolding is the task of ordering and orienting those assembled sequences, called contigs, using paired-end data • A scaffold graph is a directed graph where each vertex represents a contig sequence and each edge represents a bundle of paired reads Method • Estimate distances between contigs using a maximum likelihood estimator • Identify subgraphs that match a topology typical of a particular genomic feature, such as a repeat sequence • Transform each matching subgraph to simplify the scaffold graph • Assemble contigs into scaffolds Algorithm 1.Estimate distances between contigs