Characterization of Missing Human Genome Sequences and Copy-number Polymorphic Insertions

The extent of human genomic structural variation suggests that there must be portions of the genome yet to be discovered, annotated and characterized at the sequence level. We present a resource and analysis of 2,363 new insertion sequences corresponding to 720 genomic loci. We found that a substantial fraction of these sequences are either missing, fragmented or misassigned when compared to recent de novo sequence assemblies from short-read next-generation sequence data. We determined that 18–37% of these new insertions are copy-number polymorphic, including loci that show extensive population stratification among Europeans, Asians and Africans. Complete sequencing of 156 of these insertions identified new exons and conserved noncoding sequences not yet represented in the reference genome. We developed a method to accurately genotype these new insertions by mapping next-generation sequencing datasets to the breakpoint, thereby providing a means to characterize copy-number status for regions previously inaccessible to single-nucleotide polymorphism microarrays.

[1]  J. D. Parsons,et al.  Miropeats: graphical DNA sequence comparisons , 1995, Comput. Appl. Biosci..

[2]  W. G. Hill,et al.  Genetic Data Analysis II . By Bruce S. Weir, Sunderland, Massachusetts. Sinauer Associates, Inc.445 pages. ISBN 0-87893-902-4. , 1996 .

[3]  J. Ott Genetic data analysis II , 1997 .

[4]  Leena Peltonen,et al.  Identification of a variant associated with adult-type hypolactasia , 2002, Nature Genetics.

[5]  E. Lander,et al.  Finishing the euchromatic sequence of the human genome , 2004 .

[6]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[7]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[8]  S. Batzoglou,et al.  Distribution and intensity of constraint in mammalian genomic sequence. , 2005, Genome research.

[9]  E. Eichler,et al.  Fine-scale structural variation of the human genome , 2005, Nature Genetics.

[10]  Jean L. Chang,et al.  Initial sequence of the chimpanzee genome and comparison with the human genome , 2005, Nature.

[11]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[12]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[13]  Philip M. Kim,et al.  Paired-End Mapping Reveals Extensive Structural Variation in the Human Genome , 2007, Science.

[14]  G. K. Vostokin,et al.  Chemical characterization of element 112 , 2007, Nature.

[15]  D. Altshuler,et al.  Completing the map of human genetic variation , 2007, Nature.

[16]  J. Lupski,et al.  The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[17]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[18]  Joshua M. Korn,et al.  Integrated detection and population-genetic analysis of SNPs and copy number variation , 2008, Nature Genetics.

[19]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[20]  E. Birney,et al.  Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. , 2008, Genome research.

[21]  A. Tsalenko,et al.  The fine-scale and complex architecture of human copy-number variation. , 2008, American journal of human genetics.

[22]  E. Birney,et al.  Genome-wide nucleotide-level mammalian ancestor reconstruction. , 2008, Genome research.

[23]  Joshua M. Korn,et al.  Mapping and sequencing of structural variation from eight human genomes , 2008, Nature.

[24]  E. Eichler,et al.  Closing gaps in the human genome with fosmid resources generated from multiple individuals , 2008, Nature Genetics.

[25]  Dawei Li,et al.  The diploid genome sequence of an Asian individual , 2008, Nature.

[26]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences: current status, policy and new initiatives , 2008, Nucleic Acids Res..

[27]  Francisco M. De La Vega,et al.  Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. , 2009, Genome research.

[28]  Süleyman Cenk Sahinalp,et al.  Combinatorial Algorithms for Structural Variation Detection in High Throughput Sequenced Genomes , 2009, RECOMB.

[29]  J. Kitzman,et al.  Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing , 2009, Nature Genetics.

[30]  Lars Bolund,et al.  Building the sequence map of the human pan-genome , 2010, Nature Biotechnology.

[31]  Tomas W. Fitzgerald,et al.  Origins and functional impact of copy number variation in the human genome , 2010, Nature.