An integrated Asian human SNV and indel benchmark established using multiple sequencing methods

Sequencing technologies have been rapidly developed recently, leading to the breakthrough of sequencing-based clinical diagnosis, but accurate and complete genome variation benchmark would be required for further assessment of precision medicine applications. Despite the human cell line of NA12878 has been successfully developed to be a variation benchmark, population-specific variation benchmark is still lacking. Here, we established an Asian human variation benchmark by constructing and sequencing a stabilized cell line of a Chinese Han volunteer. By using seven different sequencing strategies, we obtained ~3.88 Tb clean data from different laboratories, hoping to reach the point of high sequencing depth and accurate variation detection. Through the combination of variations identified from different sequencing strategies and different analysis pipelines, we identified 3.35 million SNVs and 348.65 thousand indels, which were well supported by our sequencing data and passed our strict quality control, thus should be high confidence variation benchmark. Besides, we also detected 5,913 high-quality SNVs which had 969 sites were novel and  located in the high homologous regions supported by long-range information in both the co-barcoding single tube Long Fragment Read (stLFR) data and PacBio HiFi CCS data. Furthermore, by using the long reads data (stLFR and HiFi CCS), we were able to phase more than 99% heterozygous SNVs, which helps to improve the benchmark to be haplotype level. Our study provided comprehensive sequencing data as well as the integrated variation benchmark of an Asian derived cell line, which would be valuable for future sequencing-based clinical development.

[1]  Hui Jiang,et al.  A reference human genome dataset of the BGISEQ-500 sequencer , 2017, GigaScience.

[2]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[3]  D R Bentley,et al.  The DNA sequence and comparative analysis of human chromosome 20 , 2004, Nature.

[4]  Deanna M. Church,et al.  ClinVar: public archive of relationships among sequence variation and human phenotype , 2013, Nucleic Acids Res..

[5]  A. Yoder,et al.  The utility of PacBio circular consensus sequencing for characterizing complex gene families in non-model organisms , 2014, BMC Genomics.

[6]  Leo van Iersel,et al.  WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads , 2015, J. Comput. Biol..

[7]  J. Shendure,et al.  Genomic Analyses from Non-invasive Prenatal Testing Reveal Genetic Associations, Patterns of Viral Infections, and Chinese Population History , 2018, Cell.

[8]  Jayoung Kim,et al.  Trends in Next-Generation Sequencing and a New Era for Whole Genome Sequencing , 2016, International neurourology journal.

[9]  Hanlee P. Ji,et al.  Linked read sequencing resolves complex genomic rearrangements in gastric cancer metastases , 2017, Genome Medicine.

[10]  H. Bergès Long Read Sequencing Technology to Solve Complex Genomic Regions Assembly in Plants , 2016 .

[11]  Levi C. T. Pierce,et al.  Deep sequencing of 10,000 human genomes , 2016, Proceedings of the National Academy of Sciences.

[12]  In-Hee Lee,et al.  Comparative analysis of whole-genome sequencing pipelines to minimize false negative findings , 2019, Scientific Reports.

[13]  Alexander Hoischen,et al.  Long-Read Sequencing Emerging in Medical Genetics , 2019, Front. Genet..

[14]  Hanlee P. Ji,et al.  Haplotyping germline and cancer genomes using high-throughput linked-read sequencing , 2015, Nature Biotechnology.

[15]  Yong Zhang,et al.  Complete genome sequencing and variant analysis of a Pakistani individual , 2013, Journal of Human Genetics.

[16]  Paul A. Renauer,et al.  The genetics of Takayasu arteritis. , 2017, Presse medicale.

[17]  P. Merkel,et al.  Identification of Susceptibility Loci in IL6, RPS9/LILRB3, and an Intergenic Locus on Chromosome 21q22 in Takayasu Arteritis in a Genome‐Wide Association Study , 2015, Arthritis & rheumatology.

[18]  Ian T. Fiddes,et al.  Resolving the full spectrum of human genome variation using Linked-Reads , 2019, Genome research.

[19]  F. Speleman,et al.  A novel gene family NBPF: intricate structure generated by gene duplications during primate evolution. , 2005, Molecular biology and evolution.

[20]  Birgit Funke,et al.  Navigating highly homologous genes in a molecular diagnostic setting: a resource for clinical next-generation sequencing , 2016, Genetics in Medicine.

[21]  Jonathan Marchini,et al.  Genotype calling and phasing using next-generation sequencing reads and a haplotype scaffold , 2013, Bioinform..

[22]  Alexander A. Morgan,et al.  Clinical assessment incorporating a personal genome , 2010, The Lancet.

[23]  Martin O Pollard,et al.  Long reads: their purpose and place , 2018, Human molecular genetics.

[24]  Jinyang Zhao,et al.  Genome sequencing of the sweetpotato whitefly Bemisia tabaci MED/Q , 2017, GigaScience.

[25]  Euan A Ashley,et al.  Challenges in the clinical application of whole-genome sequencing , 2010, The Lancet.

[26]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[27]  Wolzt,et al.  World Medical Association Declaration of Helsinki: ethical principles for medical research involving human subjects. , 2003, The Journal of the American College of Dentists.

[28]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[29]  Tom R. Gaunt,et al.  The UK10K project identifies rare variants in health and disease , 2016 .

[30]  Suzanne Mate,et al.  No assembly required: Full-length MHC class I allele discovery by PacBio circular consensus sequencing. , 2015, Human immunology.

[31]  Yuzhuo Wang,et al.  A Novel Protein Isoform of the Multicopy Human NAIP Gene Derives from Intragenic Alu SINE Promoters , 2009, PloS one.

[32]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[33]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[34]  S. Eckstein Ethical principles for medical research involving human subjects. , 2001, European journal of emergency medicine : official journal of the European Society for Emergency Medicine.

[35]  Sergey Koren,et al.  Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome , 2019, Nature Biotechnology.

[36]  Ryan E. Mills,et al.  Small insertions and deletions (INDELs) in human genomes. , 2010, Human molecular genetics.

[37]  Hongbin Zhong,et al.  Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers , 2019, Scientific Reports.

[38]  Jian Wang,et al.  Efficient and unique cobarcoding of second-generation sequencing reads from long DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo assembly , 2019, Genome research.

[39]  J. Shendure,et al.  DNA sequencing at 40: past, present and future , 2017, Nature.

[40]  Andrew D. Johnson,et al.  Multiancestry genome-wide association study of 520,000 subjects identifies 32 loci associated with stroke and stroke subtypes , 2018, Nature Genetics.

[41]  Andrew D. Johnson,et al.  Multiancestry genome-wide association study of 520,000 subjects identifies 32 loci associated with stroke and stroke subtypes , 2018, Nature Genetics.

[42]  Martin J. Pollard,et al.  The complete sequence of human chromosome 5 , 2004 .

[43]  P. Gonzalez-Alegre,et al.  Towards precision medicine , 2017 .

[44]  Birgit Funke,et al.  Best practices for benchmarking germline small-variant calls in human genomes , 2019, Nature Biotechnology.

[45]  Melissa J. Green,et al.  Genome-wide association study identifies 30 Loci Associated with Bipolar Disorder , 2017, bioRxiv.

[46]  Paul Richardson,et al.  The DNA sequence and comparative analysis of human chromosome 5 , 2004, Nature.

[47]  D. Holdstock Past, present--and future? , 2005, Medicine, conflict, and survival.

[48]  Christiane,et al.  World Medical Association Declaration of Helsinki: ethical principles for medical research involving human subjects. , 2004, Journal international de bioethique = International journal of bioethics.

[49]  Dawei Li,et al.  The diploid genome sequence of an Asian individual , 2008, Nature.

[50]  Hui Jiang,et al.  Identification of Sequence Variants in Genetic Disease-Causing Genes Using Targeted Next-Generation Sequencing , 2011, PloS one.

[51]  Vineet Bafna,et al.  HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies , 2017, Genome research.