Genome assembly comparison identifies structural variants in the human genome

Numerous types of DNA variation exist, ranging from SNPs to larger structural alterations such as copy number variants (CNVs) and inversions. Alignment of DNA sequence from different sources has been used to identify SNPs and intermediate-sized variants (ISVs). However, only a small proportion of total heterogeneity is characterized, and little is known of the characteristics of most smaller-sized (<50 kb) variants. Here we show that genome assembly comparison is a robust approach for identification of all classes of genetic variation. Through comparison of two human assemblies (Celera's R27c compilation and the Build 35 reference sequence), we identified megabases of sequence (in the form of 13,534 putative non-SNP events) that were absent, inverted or polymorphic in one assembly. Database comparison and laboratory experimentation further demonstrated overlap or validation for 240 variable regions and confirmed >1.5 million SNPs. Some differences were simple insertions and deletions, but in regions containing CNVs, segmental duplication and repetitive DNA, they were more complex. Our results uncover substantial undescribed variation in humans, highlighting the need for comprehensive annotation strategies to fully interpret genome scanning and personalized sequencing projects.

[1]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[2]  Nature Genetics , 1991, Nature.

[3]  Gabor T. Marth,et al.  A general approach to single-nucleotide polymorphism discovery , 1999, Nature Genetics.

[4]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[5]  M. Hattori,et al.  The DNA sequence of human chromosome 21 , 2000, Nature.

[6]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[7]  Stephen W. Scherer,et al.  A 1.5 million–base pair inversion polymorphism in families with Williams-Beuren syndrome , 2001, Nature Genetics.

[8]  D R Bentley,et al.  The DNA sequence and comparative analysis of human chromosome 20 , 2004, Nature.

[9]  M. Pfaffl,et al.  A new mathematical model for relative quantification in real-time RT-PCR. , 2001, Nucleic acids research.

[10]  Thomas J. Liesegang,et al.  The sequence of the human genome. Venter JC,∗ Adams MD, Myers EW, et al. Science 2001;291:1304–1351. , 2001 .

[11]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[12]  Eric S. Lander,et al.  On the sequencing of the human genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[14]  Eugene W Myers,et al.  On the sequencing and assembly of the human genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[15]  J. R. MacDonald,et al.  Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence , 2003, Genome Biology.

[16]  M. Adams,et al.  Recent Segmental Duplications in the Human Genome , 2002, Science.

[17]  James M. Eldred,et al.  The DNA sequence of human chromosome 7 , 2003, Nature.

[18]  Eugene W Myers,et al.  The independence of our genome assemblies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[19]  E. Lander,et al.  More on the sequencing of the human genome , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Junjun Zhang,et al.  Human Chromosome 7: DNA Sequence and Biology , 2003, Science.

[21]  Circe W. Tsui,et al.  Single nucleotide polymorphisms (SNPs) that map to gaps in the human SNP map. , 2003, Nucleic acids research.

[22]  Randall A. Bolanos,et al.  Whole-genome shotgun assembly and comparison of human genome assemblies , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Martin J. Pollard,et al.  The complete sequence of human chromosome 5 , 2004 .

[24]  Paul Richardson,et al.  The DNA sequence and comparative analysis of human chromosome 5 , 2004, Nature.

[25]  J. Shendure,et al.  Advanced sequencing technologies: methods and goals , 2004, Nature Reviews Genetics.

[26]  L. Feuk,et al.  Detection of large-scale variation in the human genome , 2004, Nature Genetics.

[27]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[28]  Clive Brown,et al.  Toward the 1,000 dollars human genome. , 2005, Pharmacogenomics.

[29]  E. Eichler,et al.  Fine-scale structural variation of the human genome , 2005, Nature Genetics.

[30]  Clive Brown,et al.  Toward the $1000 human genome , 2005 .

[31]  L. Feuk,et al.  Discovery of Human Inversion Polymorphisms by Comparative Analysis of Human and Chimpanzee DNA Sequence Assemblies , 2005, PLoS genetics.

[32]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[33]  R. Service The Race for the $1000 Genome , 2006, Science.

[34]  Deepak Grover,et al.  dbRIP: A highly integrated database of retrotransposon insertion polymorphisms in humans , 2006, Human mutation.

[35]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..