Estimating Copy Number and Allelic Variation at the Immunoglobulin Heavy Chain Locus Using Short Reads

The study of genomic regions that contain gene copies and structural variation is a major challenge in modern genomics. Unlike variation involving single nucleotide changes, data on the variation of copy number is difficult to collect and few tools exist for analyzing the variation between individuals. The immunoglobulin heavy variable (IGHV) locus, which plays an integral role in the adaptive immune response, is an example of a complex genomic region that varies in gene copy number. Lack of standard methods to genotype this region prevents it from being included in association studies and is holding back the growing field of antibody repertoire analysis. Here we develop a method that takes short reads from high-throughput sequencing and outputs a genetic profile of the IGHV locus with the read coverage depth and a putative nucleotide sequence for each operationally defined gene cluster. Our operationally defined gene clusters aim to address a major challenge in studying the IGHV locus: the high sequence similarity between gene segments in different genomic locations. Tests on simulated data demonstrate that our approach can accurately determine the presence or absence of a gene cluster from reads as short as 70 bp. More detailed resolution on the copy number of gene clusters can be obtained from read coverage depth using longer reads (e.g., ≥ 100 bp). Detail at the nucleotide resolution of single copy genes (genes present in one copy per haplotype) can be determined with 250 bp reads. For IGHV genes with more than one copy, accurate nucleotide-resolution reconstruction is currently beyond the means of our approach. When applied to a family of European ancestry, our pipeline outputs genotypes that are consistent with the family pedigree, confirms existing multigene variants and suggests new copy number variants. This study paves the way for analyzing population-level patterns of variation in IGHV gene clusters in larger diverse datasets and for quantitatively handling regions of copy number variation in other structurally varying and complex loci.

[1]  E. Sasso,et al.  Prevalence and polymorphism of human VH3 genes. , 1990, Journal of immunology.

[2]  M. Nei,et al.  Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. , 1993, Molecular biology and evolution.

[3]  I. Tomlinson,et al.  HAPPY mapping of a YAC reveals alternative haplotypes in the human immunoglobulin VH locus. , 1993, Nucleic acids research.

[4]  N. Carter,et al.  A map of the human immunoglobulin VH locus completed by analysis of the telomeric region of chromosome 14q , 1994, Nature Genetics.

[5]  M. Nei,et al.  Divergent evolution and evolution by the birth-and-death process in the immunoglobulin VH gene family. , 1994, Molecular biology and evolution.

[6]  E. Sasso,et al.  Ethnic Differences in VH Gene Polymorphism , 1995, Annals of the New York Academy of Sciences.

[7]  A. Glas,et al.  Polymorphism and Utilization of Human VH Genes a , 1995, Annals of the New York Academy of Sciences.

[8]  K. Kuma,et al.  The Complete Nucleotide Sequence of the Human Immunoglobulin Heavy Chain Variable Region Locus , 1998, The Journal of experimental medicine.

[9]  S. Pramanik,et al.  Genetic diversity of the human immunoglobulin heavy chain VH region , 2002, Immunological reviews.

[10]  Leyu Liu,et al.  IGH V3-23*01 and its allele V3-23*03 differ in their capacity to form the canonical human antibody combining site specific for the capsular polysaccharide of Haemophilus influenzae type b , 2003, Immunogenetics.

[11]  Marie-Paule Lefranc,et al.  IMGT , the international ImMunoGeneTics information system , 2003 .

[12]  E. Eichler,et al.  Fine-scale structural variation of the human genome , 2005, Nature Genetics.

[13]  S. Pramanik,et al.  Determination of gene organization in the human IGHV region on single chromosomes , 2005, Genes and Immunity.

[14]  M. Nei,et al.  Evolutionary dynamics of olfactory and other chemosensory receptor genes in vertebrates , 2006, Journal of Human Genetics.

[15]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[16]  M. Nei,et al.  Evolutionary dynamics of the immunoglobulin heavy chain variable region genes in vertebrates , 2008, Immunogenetics.

[17]  Antony V. Cox,et al.  Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing , 2008, Nature Genetics.

[18]  Joshua M. Korn,et al.  Mapping and sequencing of structural variation from eight human genomes , 2008, Nature.

[19]  Kenny Q. Ye,et al.  Sensitive and accurate detection of copy number variants using read depth of coverage. , 2009, Genome research.

[20]  Derek Y. Chiang,et al.  High-resolution mapping of copy-number alterations with massively parallel sequencing , 2009, Nature Methods.

[21]  Lior Pachter,et al.  Fast Statistical Alignment , 2009, PLoS Comput. Biol..

[22]  J. Kitzman,et al.  Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing , 2009, Nature Genetics.

[23]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[24]  S. Pramanik,et al.  Segmental duplication as one of the driving forces underlying the diversity of the human immunoglobulin heavy chain variable gene region , 2011, BMC Genomics.

[25]  M. Egholm,et al.  Individual Variation in the Germline Ig Gene Repertoire Inferred from Variable Region Gene Rearrangements , 2010, The Journal of Immunology.

[26]  E. Eichler,et al.  A Human Genome Structural Variation Sequencing Resource Reveals Insights into Mutational Mechanisms , 2010, Cell.

[27]  W. Pomat,et al.  Genomic screening by 454 pyrosequencing identifies a new human IGHV gene and sixteen other new IGHV allelic variants , 2011, Immunogenetics.

[28]  M. Lefranc IMGT unique numbering for the variable (V), constant (C), and groove (G) domains of IG, TR, MH, IgSF, and MhSF. , 2011, Cold Spring Harbor protocols.

[29]  V. Kuznetsov,et al.  A robust tool for discriminative analysis and feature selection in paired samples impacts the identification of the genes essential for reprogramming lung tissue to adenocarcinoma , 2011, BMC Genomics.

[30]  M. Lefranc IMGT Collier de Perles for the variable (V), constant (C), and groove (G) domains of IG, TR, MH, IgSF, and MhSF. , 2011, Cold Spring Harbor protocols.

[31]  M. Gerstein,et al.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. , 2011, Genome research.

[32]  F. Breden,et al.  The immunoglobulin heavy chain locus: genetic variation, missing data, and implications for human disease , 2012, Genes and Immunity.

[33]  Mark M. Tanaka,et al.  The Inference of Phased Haplotypes for the Immunoglobulin H Chain V Region Gene Loci by Analysis of VDJ Gene Rearrangements , 2012, The Journal of Immunology.

[34]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[35]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[36]  Michael W. McCormick,et al.  Shaping of Human Germline IgH Repertoires Revealed by Deep Sequencing , 2012, The Journal of Immunology.

[37]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[38]  M. Suchard,et al.  Bayesian Phylogenetics with BEAUti and the BEAST 1.7 , 2012, Molecular biology and evolution.

[39]  C. Desmarais,et al.  Ultra-sensitive detection of rare T cell clones. , 2012, Journal of immunological methods.

[40]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[41]  Mark M. Davis,et al.  High Throughput Sequencing of the Human Antibody Repertoire in Response to Influenza Vaccination , 2012 .

[42]  Jamie K. Scott,et al.  Complete haplotype sequence of the human immunoglobulin heavy-chain variable, diversity, and joining genes and characterization of allelic and copy-number variation. , 2013, American journal of human genetics.

[43]  R. Emerson,et al.  Robust Detection Of Minimal Residual Disease In Unselected Patients With B-Cell Precursor Acute Lymphoblastic Leukemia By High-Throughput Sequencing Of IGH , 2013 .

[44]  H. Robins Immunosequencing: applications of immune repertoire deep sequencing. , 2013, Current opinion in immunology.

[45]  Mark M. Davis,et al.  Lineage Structure of the Human Antibody Repertoire in Response to Influenza Vaccination , 2013, Science Translational Medicine.

[46]  J. Faro,et al.  An automated algorithm for extracting functional immunologic V-genes from genomes in jawed vertebrates , 2013, Immunogenetics.

[47]  Ning Ma,et al.  IgBLAST: an immunoglobulin variable domain sequence analysis tool , 2013, Nucleic Acids Res..

[48]  Chaim A. Schramm,et al.  Co-evolution of a broadly neutralizing HIV-1 antibody and founder virus , 2013, Nature.

[49]  J. Calis,et al.  Characterizing immune repertoires by high throughput sequencing: strategies and applications. , 2014, Trends in immunology.

[50]  Mark M. Davis,et al.  Human responses to influenza vaccination show seroconversion signatures and convergent antibody rearrangements. , 2014, Cell host & microbe.

[51]  Chaim A. Schramm,et al.  Developmental pathway for potent V1V2-directed HIV-neutralizing antibodies , 2014, Nature.

[52]  S. Quake,et al.  The promise and challenge of high-throughput sequencing of the antibody repertoire , 2014, Nature Biotechnology.

[53]  Simon A. A. Travers,et al.  Ability To Develop Broadly Neutralizing HIV-1 Antibodies Is Not Restricted by the Germline Ig Gene Repertoire , 2015, The Journal of Immunology.

[54]  Uri Hershberg,et al.  Discrimination of germline V genes at different sequencing lengths and mutational burdens: A new tool for identifying and evaluating the reliability of V gene assignment. , 2015, Journal of immunological methods.

[55]  G. Yaari,et al.  Automated analysis of high-throughput B-cell sequencing data reveals a high frequency of novel immunoglobulin V gene segment alleles , 2015, Proceedings of the National Academy of Sciences.

[56]  W. Robinson Sequencing the functional antibody repertoire—diagnostic and therapeutic discovery , 2015, Nature Reviews Rheumatology.