stLFRsv: a germline SV analysis pipeline using co-barcoded reads

Co-barcoded reads originated from long DNA fragment (mean length larger than 50Kbp) with barcodes, maintain both single base level accuracy and long range genomic information. We propose a pipeline stLFRsv to detect structure variation using co-barcoded reads. stLFRsv identifies abnormally large gaps between co-barcoded reads to detect potential breakpoints and reconstruct complex structure variations. The barcodes enabled co-barcoded reads phasing increases the signal to noise ratio and barcode sharing profiles are used to filter out false positives. We integrate the short reads SV caller smoove for smaller variations with stLFRsv. The integrated pipeline was evaluated on the well characterized genome HG002/NA24385 and obtained precision and recall rate of 74.2% and 22.3% for deletion on the whole genome. stLFR found some large variations not included in the benchmark set and verified by means of long reads or assembly. Our work indicates that co-barcoded reads technology has the potential to improve genome completeness.

[1]  Janel O. Johnson,et al.  α-Synuclein Locus Triplication Causes Parkinson's Disease , 2003, Science.

[2]  Serafim Batzoglou,et al.  Genome-wide reconstruction of complex structural variants using read clouds , 2016, Nature Methods.

[3]  Eric Talevich,et al.  CNVkit: copy number variant detection and visualization from targeted DNA resequencing , 2014 .

[4]  Iman Hajirasouliha,et al.  Detection and assembly of novel sequence insertions using Linked-Read technology , 2019, bioRxiv.

[5]  Sergey Koren,et al.  A robust benchmark for germline structural variant detection , 2019, bioRxiv.

[6]  Ryan M. Layer,et al.  LUMPY: a probabilistic framework for structural variant discovery , 2012, Genome Biology.

[7]  Jay Shendure,et al.  Haplotype phasing of whole human genomes using bead-based barcode partitioning in a single tube , 2017, Nature Biotechnology.

[8]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[9]  Michael C. Schatz,et al.  Accurate detection of complex structural variations using single molecule sequencing , 2017, Nature Methods.

[10]  D. Campion,et al.  APP locus duplication causes autosomal dominant early-onset Alzheimer disease with cerebral amyloid angiopathy , 2006, Nature Genetics.

[11]  Vineet Bafna,et al.  HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies , 2017, Genome research.

[12]  Benjamin J. Raphael,et al.  Identifying structural variants using linked-read sequencing data , 2017, bioRxiv.

[13]  Jian Wang,et al.  Efficient and unique cobarcoding of second-generation sequencing reads from long DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo assembly , 2019, Genome research.

[14]  Ian T. Fiddes,et al.  Resolving the full spectrum of human genome variation using Linked-Reads , 2019, Genome research.

[15]  David L. Dill,et al.  Aquila: diploid personal genome assembly and comprehensive variant detection based on linked reads , 2019, bioRxiv.

[16]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[17]  Ying Chen,et al.  Fast and accurate assembly of Nanopore reads via progressive error correction and adaptive read selection , 2020, bioRxiv.

[18]  John Wei,et al.  Towards a comprehensive structural variation map of an individual human genome , 2010, Genome Biology.

[19]  Bba,et al.  CHARGE syndrome: the phenotypic spectrum of mutations in the CHD7 gene , 2005, Journal of Medical Genetics.

[20]  Hanlee P. Ji,et al.  Haplotyping germline and cancer genomes using high-throughput linked-read sequencing , 2015, Nature Biotechnology.

[21]  Eric Talevich,et al.  CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing , 2016, PLoS Comput. Biol..

[22]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[23]  Nancy R. Zhang,et al.  Identification of large rearrangements in cancer genomes with barcode linked reads , 2017, Nucleic acids research.