Segmental duplications: what's missing, misassigned, and misassembled--and should we care?

For many people, the announcement of the release of working draft sequence of the human genome was the climax of more than 15 years of planning and preparation (International Human Genome Sequencing Consortium 2001). Despite the controversy and sensationalism, it was an awesome achievement, culminating in the “genome party of the century”. There was much to celebrate. The majority of genes were identified, mapped to their appropriate location, and await the ascription of phenotypic data. Among the public, however, there is the impression that the task is a fait accompli. In my case, several family members contacted me after the media blitz to inquire whether I was now out of a job—after all, the Human Genome Project is entering its projected twoyear twilight. Indeed, this may be the appropriate time for sequencers and sequence-gazers alike to “jump ship” or at the very least to look beyond the next horizon. The genomic revolution will now launch the proteomics revolution with its promise of tailor-made therapies for the masses. Association studies using SNP data are expected to provide insight into the molecular etiology of complex genetic diseases (Chakravarti 2001). Comparative sequencing of the genome of model organisms such as the mouse and the rat will be used to discover elements critical in the regulation of our own genes and provide an invaluable resource for future mutagenesis studies (Nadeau et al. 2001). As scientists, we of course know that much work still remains to be done before the final declaration of a finished human genome. We all recognize that gaps remain in the project, and most of the community is committed to rolling up their sleeves and getting on with the final sequence and analysis. Nevertheless, despite this commitment, there remains the impression that gap closure will be akin to “mopping up the dance floor after the band has gone home”; it will be an arduous task with little reward, done by a few people willing to don the overalls, put the trash where it belongs, and pick up the pieces. Currently, two types of gaps are recognized within the working draft sequence (Bork and Copley 2001). There are gaps that are contained within the sequence assembly of the ordered clones. These are trivial gaps, each no more than a few 100 bp in length. Most will be closed during the “topping-off” of sequence from existing projects. Gaps between ordered clones and sequence contigs are the second type of gap. These are larger in size and potentially more problematic in nature. Some of these will be easily closed by the identification and sequencing of bridging clones obtained from paired-end sequence data. Others represent genomic segments not present within existing clone libraries. Such regions were highlighted during the closure of chromosome 21 and 22 (Dunham et al. 1999; Hattori et al. 2000) and purportedly are similarly recalcitrant to subcloning. Specialized technologies are required to close such gaps in the clone map. I would like to propose a third type of gap that may be underestimated at present. These are gaps associated with nearly identical sequence segmental duplications. These gaps result from the underrepresentation and misassembly of duplicated sequences in the human genome. Such gaps are particularly onerous because their resolution requires that the duplicated nature of the segments be first recognized and then the suboptimal assembly be untangled. As part of the International Human Sequencing Consortium, we examined the distribution of nearly identical sequence (90–98% sequence identity and >1 kb in length) duplications throughout the genome and the quality of sequence assembly within such exceptional regions (Bailey et al. 2001; International Human Genome Sequencing Consortium 2001). The analysis revealed that a modest fraction of the genome (∼5%) consists of large duplicated segments often containing complete or partial copies of genic material. The E-MAIL eee@po.cwru.edu; FAX (216) 368-3432. Article and publication are at www.genome.org/cgi/doi/10.1101/ gr.188901. Commentary