Genome assembly, rearrangement, and repeats.

Genomes evolve at different scales. At a small scale, a single nucleotide may be substituted, deleted, or inserted at a specific position in the genome. At the chromosomal scale, segments of genetic material may be acquired, removed, duplicated, and/or rearranged due to various mechanisms. Eukaryotic genomes are usually much larger than prokaryotic genomes, and often carry many repeats, that is, DNA sequences appearing multiple times as similar copies in the genome. A typical example is the human genome, in which repeats constitute more than half of the whole genome. Genome rearrangements, which alter the chromosomal architecture during evolution, can be observed when comparing the order of genetic markers (e.g., genes) in two genomes sharing a common ancestor. Each genome rearrangement event disrupts homologous segments in two genomes, and createsbreakpointsbetween them. Evidently, repeats are frequently observed around the breakpoint regions and, thus, are hypothesized to be one of the driving forces of genome rearrangement. The richness of repeats in eukaryotes poses a great challenge for fragment assembly when sequencing these genomes. Although a few strategies were proposed to address this issue, several kinds of misassemblies may still exist even in the published, but not yet completely finished, genome sequences, especially for the ones sequenced using the whole genome shotgun (WGS) approach. Some repeats may be missed and left as gaps. Some repeats may be collapsed, resulting in a smaller number of copies and inaccurate sequence for each copy. Finally, assemblers may be confused by † Corresponding author: Haixu Tang, School of Informatics, Indiana University, 901 E. 10th Street, Bloomington, IN 47408, Tel, 812-856-1859; fax, 812-856-1995; e-mail, hatang@indiana.edu. Dr. Haixu Tang is an assistant professor in the School of Informatics, and an affiliated faculty of Center for Genomics and Bioinformatics (CGB) at Indiana University, Bloomington, since 2004. Professor Tang received his Ph.D. in Molecular Biology from the Shanghai Institute of Biochemistry in 1998; between 1999 and 2001, he was a post-Doc associate in the Department of Mathematics at the University of Southern California; between 2001 and 2004, he was an assiatant project scientist in the Department of Computer Science and Engineering, University of California, San Diego. 3391 Chem. Rev. 2007, 107, 3391−3406