Algorithms for whole genome shotgun sequencing

A monumental achievement in the history of science, the sequencing of the entire human genome, will soon be reached. The Human Genome Project (HGP) has been working toward this goal since 1990 using a two-tiered strategy. Recently it was proposed that using a whole-genome shotgun approach to sequence the genome would be faster and less costly. This thesis expands on that proposal by presenting two algorithms that can be used in whole-genome shotgun sequencing. These algorithms were implemented and tested on simulated data. Essential to this approach is the availability of pairs of short, unique sequence markers at a roughly estimated distance from each other. Determining the sequence of the genome can then be broken into a series of inter-marker assembly problems that determine the sequence between a pair of markers. Unfortunately, marker pairs are not always correct and repeats can greatly confound the assembly. This motivates the first problem of rapidly finding a set of linked contigs, called a scaffold, between a pair of markers that confirms the marker pair and the ability to traverse the region between them. Then an inter-marker assembly algorithm that determines the unique sequence segments between a marker pair is presented. Both algorithms are evaluated with respect to a simulation that can model various types of repeats and for which our only information about the presence of repeats is excessive coverage and the ability to detect their boundaries. Simulation results show that at 10x coverage one can find and assemble the unique sequence between markers more than 99.9% of the time for many of the repeat models. Events in this field have been moving rapidly. Recently a new company called Celera Genomics announced its intention to sequence the human genome before the HGP by using the whole-genome shotgun approach. We end this thesis by briefly discussing Celera's approach, and relating it to the algorithms presented here.