aWGRS: Automates paired-end whole genome re-sequencing data analysis framework

In order to enable people to avoid too many cumbersome and complex operations of the command line and repeated parameter adjustments, automates pair-end whole genome re-sequence (aWGRS) data processing whereby pre-installed dependencies are presented in this paper, which are used to map reads to a reference and realign variations. This method presents aWGRS which is a method that takes as input paired-end reads and a reference genome and returns re-sequencing information. The concept behind the development of this tool is that re-sequencing requires several steps: alignment to the reference, single nucleotide polymorphisms (SNPs) calling, Insertion / Deletion (InDels) calling, structure variant (SVs) calling, and annotation. By introducing and adjusting a new concept called the recall rate, the coverage rate and accuracy rate can be met at the same time. Within the range of recall rate, a variation is evaluated by two criteria: the quality value and the number of reads that support it, and one read with higher quality value and larger supported number will be picked out finally. Genome-wide genetic variations between precocious trifoliate orange and its wild type are identified in [1], and empirical results show that there is a big reduction in the amount of variation and great improvement of accuracy between the results of aWGRS and [1] which offered by the Beijing Genomics Institute (BGI). Overall, the adjustable parameters adopted in aWGRS can affect the results of the experiment and the default filtering strategy using the mutation recall rate also can attain good results automatically.

[1]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[2]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[3]  K. Lindblad-Toh,et al.  Whole-genome resequencing reveals loci under selection during chicken domestication , 2010, Nature.

[4]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[5]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[6]  Ken Chen,et al.  VarScan: variant detection in massively parallel sequencing of individual and pooled samples , 2009, Bioinform..

[7]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[8]  P. Argos,et al.  Analysis of insertions/deletions in protein structures. , 1992, Journal of molecular biology.

[9]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[10]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[11]  A. Brookes The essence of SNPs. , 1999, Gene.

[12]  J. Lupski Structural variation in the human genome. , 2007, The New England journal of medicine.

[13]  W. Kuo,et al.  High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays , 1998, Nature Genetics.

[14]  Jialing Yao,et al.  Identification of flowering-related genes between early flowering trifoliate orange mutant and wild-type trifoliate orange (Poncirus trifoliata L. Raf.) by suppression subtraction hybridization (SSH) and macroarray. , 2009, Gene.

[15]  Ruiqiang Li,et al.  SOAP: short oligonucleotide alignment program , 2008, Bioinform..

[16]  B. Langmead,et al.  Lighter: fast and memory-efficient sequencing error correction without counting , 2014, Genome Biology.

[17]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[18]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[19]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[20]  Chun-Gen Hu,et al.  Identifying the genome-wide genetic variation between precocious trifoliate orange and its wild type and developing new markers for genetics research , 2016, DNA research : an international journal for rapid publication of reports on genes and genomes.

[21]  Xiaoyan Ai,et al.  Transcriptome profile analysis of flowering molecular processes of early flowering trifoliate orange mutant and the wild-type [Poncirus trifoliata (L.) Raf.] by massively parallel signature sequencing , 2011, BMC Genomics.

[22]  Christopher A. Miller,et al.  ReadDepth: A Parallel R Package for Detecting Copy Number Alterations from Short Sequencing Reads , 2011, PloS one.

[23]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[24]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[25]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[26]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[27]  Huanming Yang,et al.  SNP detection for massively parallel whole-genome resequencing. , 2009, Genome research.

[28]  C. Thermes,et al.  Ten years of next-generation sequencing technology. , 2014, Trends in genetics : TIG.