Context-dependent individualization of nucleotides and virtual genomic hybridization allow the precise location of human SNPs

We have entered the era of individual genomic sequencing, and can already see exponential progress in the field. It is of utmost importance to exclude false-positive variants from reported datasets. However, because of the nature of the used algorithms, this task has not been optimized to the required level of precision. This study presents a unique strategy for identifying SNPs, called COIN-VGH, that largely minimizes the presence of false-positives in the generated data. The algorithm was developed using the X-chromosome–specific regions from the previously sequenced genomes of Craig Venter and James Watson. The algorithm is based on the concept that a nucleotide can be individualized if it is analyzed in the context of its surrounding genomic sequence. COIN-VGH consists of defining the most comprehensive set of nucleotide strings of a defined length that map with 100% identity to a unique position within the human reference genome (HRG). Such set is used to retrieve sequence reads from a query genome (QG), allowing the production of a genomic landscape that represents a draft HRG-guided assembly of the QG. This landscape is analyzed for specific signatures that indicate the presence of SNPs. The fidelity of the variation signature was assessed using simulation experiments by virtually altering the HRG at defined positions. Finally, the signature regions identified in the HRG and in the QG reads are aligned and the precise nature and position of the corresponding SNPs are detected. The advantages of COIN-VGH over previous algorithms are discussed.

[1]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[2]  Joshua M. Korn,et al.  Discovery and genotyping of genome structural polymorphism by sequencing on a population scale , 2011, Nature Genetics.

[3]  Sean M. Grimmond,et al.  The uniqueome: a mappability resource for short-tag sequencing , 2010, Bioinform..

[4]  B. Alberts Editorial expression of concern. , 2011, Science.

[5]  Glenn W Jones Training physicians to communicate. , 2010, Science.

[6]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[7]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[8]  Daniel J. Blankenberg,et al.  Galaxy: A Web‐Based Genome Analysis Tool for Experimentalists , 2010, Current protocols in molecular biology.

[9]  Thomas D. Wu,et al.  A highly annotated whole-genome sequence of a Korean individual , 2009, Nature.

[10]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[11]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[12]  Amy E. Hawkins,et al.  DNA sequencing of a cytogenetically normal acute myeloid leukemia genome , 2008, Nature.

[13]  J. Lupski,et al.  The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[14]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[15]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[16]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[17]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[18]  E. Lander,et al.  Finishing the euchromatic sequence of the human genome , 2004 .

[19]  C. Ponting,et al.  Finishing the euchromatic sequence of the human genome , 2004 .

[20]  L. Feuk,et al.  Detection of large-scale variation in the human genome , 2004, Nature Genetics.

[21]  Kenny Q. Ye,et al.  Large-Scale Copy Number Polymorphism in the Human Genome , 2004, Science.

[22]  Terrence S. Furey,et al.  The UCSC Table Browser data retrieval tool , 2004, Nucleic Acids Res..

[23]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[24]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.