LD-annot: A Bioinformatics Tool to Automatically Provide Candidate SNPs With Annotations for Genetically Linked Genes

A multitude of model and non-model species studies have now taken full advantage of powerful high-throughput genotyping advances such as SNP arrays and genotyping-by-sequencing (GBS) technology to investigate the genetic basis of trait variation. However, due to incomplete genome coverage by these technologies, the identified SNPs are likely in linkage disequilibrium (LD) with the causal polymorphisms, rather than be causal themselves. In addition, researchers could benefit from annotations for the identified candidate SNPs and, simultaneously, for all neighboring genes in genetic linkage. In such case, LD extent estimation surrounding the candidate SNPs is required to determine the regions encompassing genes of interest. We describe here an automated pipeline, “LD-annot,” designed to delineate specific regions of interest for a given experiment and candidate polymorphisms on the basis of LD extent, and furthermore, provide annotations for all genes within such regions. LD-annot uses standard file formats, bioinformatics tools, and languages to provide identifiers, coordinates, and annotations for genes in genetic linkage with each candidate polymorphism. Although the focus lies upon SNP arrays and GBS data as they are being routinely deployed, this pipeline can be applied to a variety of datasets as long as genotypic data are available for a high number of polymorphisms and formatted into a vcf file. A checkpoint procedure in the pipeline allows to test several threshold values for linkage without having to rerun the entire pipeline, thus saving the user computational time and resources. We applied this new pipeline to four different sample sets: two breeding populations GBS datasets, one within-pedigree SNP set coming from whole genome sequencing (WGS), and a very large multi-varieties SNP dataset obtained from WGS, representing variable sample sizes, and numbers of polymorphisms. LD-annot performed within minutes, even when very high numbers of polymorphisms are investigated and thus will efficiently assist research efforts aimed at identifying biologically meaningful genetic polymorphisms underlying phenotypic variation. LD-annot tool is available under a GPL license from https://github.com/ArnaudDroitLab/LD-annot.

[1]  Noah A Rosenberg,et al.  Mathematical properties of the r2 measure of linkage disequilibrium. , 2008, Theoretical population biology.

[2]  L. Glimcher,et al.  After GWAS: mice to the rescue? , 2012, Current opinion in immunology.

[3]  Pardis C Sabeti,et al.  Linkage disequilibrium in the human genome , 2001, Nature.

[4]  Gonçalo R Abecasis,et al.  Sequence features in regions of weak and strong linkage disequilibrium. , 2005, Genome research.

[5]  Peter Tiffin,et al.  Candidate Genes and Genetic Architecture of Symbiotic and Agronomic Traits Revealed by Whole-Genome, Sequence-Based Association Genetics in Medicago truncatula , 2013, PloS one.

[6]  Andrew D. Johnson,et al.  SNAP: a web-based tool for identification and annotation of proxy SNPs using HapMap , 2008, Bioinform..

[7]  G. McVean,et al.  Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications , 2014, Nature Genetics.

[8]  H. Hakonarson,et al.  Using VAAST to identify an X-linked disorder resulting in lethality in male infants due to N-terminal acetyltransferase deficiency. , 2011, American journal of human genetics.

[9]  Istvan Rajcan,et al.  Identification of loci governing eight agronomic traits using a GBS-GWAS approach and validation by QTL mapping in soya bean. , 2015, Plant biotechnology journal.

[10]  W. G. Hill,et al.  Linkage disequilibrium in finite populations , 1968, Theoretical and Applied Genetics.

[11]  T. White,et al.  Adaptive evolution during an ongoing range expansion: the invasive bank vole (Myodes glareolus) in Ireland , 2013, Molecular ecology.

[12]  John S Witte,et al.  Coverage and power in genomewide association studies. , 2006, American journal of human genetics.

[13]  Jason H. Moore,et al.  Chapter 11: Genome-Wide Association Studies , 2012, PLoS Comput. Biol..

[14]  S. Mwaiko,et al.  Population genomic signatures of divergent adaptation, gene flow and hybrid speciation in the rapid radiation of Lake Victoria cichlid fishes , 2012, Molecular ecology.

[15]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[16]  Verena C. Griess,et al.  Biosurveillance of forest insects: part I—integration and application of genomic tools to the surveillance of non-native forest insects , 2018, Journal of Pest Science.

[17]  J. Pritchard,et al.  Linkage disequilibrium in humans: models and data. , 2001, American journal of human genetics.

[18]  Mitchell J. Machiela,et al.  LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants , 2015, Bioinform..

[19]  A. Tenuta,et al.  Identification of quantitative trait loci for seed isoflavone concentration in soybean (Glycine max) against soybean cyst nematode stress , 2018, Plant Breeding.

[20]  S. Narum,et al.  Population genomics of Pacific lamprey: adaptive variation in a highly dispersive species , 2013, Molecular ecology.

[21]  S. Narum,et al.  Genotyping‐by‐sequencing in ecological and conservation genomics , 2013, Molecular ecology.

[22]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[23]  B. Boyle,et al.  Efficient genome-wide genotyping strategies and data integration in crop plants , 2018, Theoretical and Applied Genetics.

[24]  Pablo Cingolani,et al.  © 2012 Landes Bioscience. Do not distribute. , 2022 .

[25]  N. Chua,et al.  Developing genome-wide SNPs and constructing an ultrahigh-density linkage map in oil palm , 2018, Scientific Reports.

[26]  M. Daly,et al.  Biases and reconciliation in estimates of linkage disequilibrium in the human genome. , 2006, American journal of human genetics.

[27]  Rafael A Irizarry,et al.  Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. , 2006, Biostatistics.

[28]  Robert J. Elshire,et al.  A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species , 2011, PloS one.