Deciphering highly similar multigene family transcripts from Iso-Seq data with IsoCon

A significant portion of genes in vertebrate genomes belongs to multigene families, with each family containing several gene copies whose presence/absence, as well as isoform structure, can be highly variable across individuals. Existing de novo techniques for assaying the sequences of such highly-similar gene families fall short of reconstructing end-to-end transcripts with nucleotide-level precision or assigning alternatively spliced transcripts to their respective gene copies. We present IsoCon, a high-precision method using long PacBio Iso-Seq reads to tackle this challenge. We apply IsoCon to nine Y chromosome ampliconic gene families and show that it outperforms existing methods on both experimental and simulated data. IsoCon has allowed us to detect an unprecedented number of novel isoforms and has opened the door for unraveling the structure of many multigene families and gaining a deeper understanding of genome evolution and human diseases.Transcripts from highly-similar multigene families are challenging to decipher. Here, the authors develop IsoCon, a tool for detecting and reconstructing isoforms from multigene families by analyzing long PacBio Iso-Seq reads.

[1]  Prabhakar Raghavan,et al.  Probabilistic construction of deterministic algorithms: Approximating packing integer programs , 1986, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[2]  Single molecule, full-length transcript sequencing provides insight into the extreme metabolism of ruby-throated hummingbird Archilochus colubris , 2017 .

[3]  Peter H. Sudmant,et al.  Diversity of Human Copy Number Variation and Multicopy Genes , 2010, Science.

[4]  Arkarachai Fungtammasan,et al.  Reverse Transcription Errors and RNA–DNA Differences at Short Tandem Repeats , 2016, Molecular biology and evolution.

[5]  Matthew Hurles,et al.  Gene Duplication: The Genomic Trade in Spare Parts , 2004, PLoS biology.

[6]  E. Tseng,et al.  Normalized long read RNA sequencing in chicken reveals transcriptome complexity similar to human , 2017, BMC Genomics.

[7]  M. Sabbaghian,et al.  Isoform-Level Gene Expression Profiles of Human Y Chromosome Azoospermia Factor Genes and Their X Chromosome Paralogs in the Testicular Tissue of Non-Obstructive Azoospermia Patients. , 2015, Journal of proteome research.

[8]  N. Carter Methods and strategies for analyzing copy number variation using DNA microarrays , 2007, Nature Genetics.

[9]  David Laehnemann,et al.  Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction , 2015, Briefings Bioinform..

[10]  Donald Sharon,et al.  Defining a personal, allele-specific, and single-molecule long-read transcriptome , 2014, Proceedings of the National Academy of Sciences.

[11]  Tissue- and Population-Level Microbiome Analysis of the Wasp Spider Argiope bruennichi Identified a Novel Dominant Bacterial Symbiont , 2019, Microorganisms.

[12]  Mauricio O. Carneiro,et al.  Pacific biosciences sequencing technology for genotyping and variation discovery in human data , 2012, BMC Genomics.

[13]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[14]  M Cortina-Borja,et al.  Beyond beta: Other continuous families of distributions with bounded support and applications , 2006 .

[15]  B. Berger,et al.  Allelic decomposition and exact genotyping of highly polymorphic and structurally variant genes , 2018, Nature Communications.

[16]  E. Eskin,et al.  HapIso: An Accurate Method for the Haplotype-Specific Isoforms Reconstruction from Long Single-Molecule Reads , 2016, bioRxiv.

[17]  G. Weinstock,et al.  Direct selection of human genomic loci by microarray hybridization , 2007, Nature Methods.

[18]  T. Graves,et al.  The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes , 2003, Nature.

[19]  C. Tyler-Smith,et al.  TSPY1 copy number variation influences spermatogenesis and shows differences among Y lineages. , 2009, The Journal of clinical endocrinology and metabolism.

[20]  Marghoob Mohiyuddin,et al.  LongISLND: in silico sequencing of lengthy and noisy datatypes , 2016, Bioinform..

[21]  W. Bu,et al.  PacBio full-length transcriptome profiling of insect mitochondrial gene expression , 2016, RNA biology.

[22]  Eleazar Eskin,et al.  Long single-molecule reads can resolve the complexity of the Influenza virus composed of rare, closely related mutant variants , 2016, bioRxiv.

[23]  P. Walsh,et al.  Simultaneous Amplification and Detection of Specific DNA Sequences , 1992, Bio/Technology.

[24]  N. Takahata,et al.  The origin and evolution of human ampliconic gene families and ampliconic structure. , 2007, Genome research.

[25]  Rahulsimham Vegesna,et al.  Dosage regulation, and variation in gene expression and copy number of human Y chromosome ampliconic genes , 2019, PLoS genetics.

[26]  Bernd Weisshaar,et al.  Exploiting single-molecule transcript sequencing for eukaryotic gene prediction , 2015, Genome Biology.

[27]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[28]  Thomas Hackl,et al.  proovread: large-scale high-accuracy PacBio correction through iterative short read consensus , 2014, Bioinform..

[29]  Hongfang Liu,et al.  Single-molecule real-time transcript sequencing facilitates common wheat genome annotation and grain transcriptome research , 2015, BMC Genomics.

[30]  Paul Medvedev,et al.  A time- and cost-effective strategy to sequence mammalian Y Chromosomes: an application to the de novo assembly of gorilla Y , 2016, Genome research.

[31]  Jeff Mellen,et al.  High-Throughput Droplet Digital PCR System for Absolute Quantitation of DNA Copy Number , 2011, Analytical chemistry.

[32]  Joachim Messing,et al.  PacBio sequencing of gene families - a case study with wheat gluten genes. , 2014, Gene.

[33]  Jeff Daily,et al.  Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments , 2016, BMC Bioinformatics.

[34]  S. Antonarakis,et al.  Gene duplication: a drive for phenotypic diversity and cause of human disease. , 2007, Annual review of genomics and human genetics.

[35]  Wing Hung Wong,et al.  Characterization of the human ESC transcriptome by hybrid sequencing , 2013, Proceedings of the National Academy of Sciences.

[36]  Alessandro Vullo,et al.  Ensembl 2017 , 2016, Nucleic Acids Res..

[37]  J. Harrow,et al.  Assessment of transcript reconstruction methods for RNA-seq , 2013, Nature Methods.

[38]  A. Furtado,et al.  Long-read sequencing of the coffee bean transcriptome reveals the diversity of full-length transcripts , 2017, GigaScience.

[39]  Kin-Fan Au,et al.  PacBio Sequencing and Its Applications , 2015, Genom. Proteom. Bioinform..

[40]  B. Trask,et al.  Segmental duplications: organization and impact within the current human genome project assembly. , 2001, Genome research.

[41]  C. Foresta,et al.  Human male infertility and Y chromosome deletions: role of the AZF-candidate genes DAZ, RBM and DFFRY. , 1999, Human reproduction.

[42]  Nam V. Hoang,et al.  A survey of the complex transcriptome from the highly polyploid sugarcane genome using full-length isoform sequencing and de novo assembly from short read sequencing , 2017, BMC Genomics.

[43]  Eleazar Eskin,et al.  HapIso: An Accurate Method for the Haplotype- Specific Isoforms Reconstruction From Long Single-Molecule Reads , 2017, IEEE Transactions on NanoBioscience.

[44]  A. Sharp,et al.  Digital Genotyping of Macrosatellites and Multicopy Genes Reveals Novel Biological Functions Associated with Copy Number Variation of Large Tandem Repeats , 2014, PLoS genetics.

[45]  Frank Harary,et al.  Graph Theory , 2016 .

[46]  Sara Goodwin,et al.  SiLiCO: A Simulator of Long Read Sequencing in PacBio and Oxford Nanopore , 2016, bioRxiv.

[47]  B. Frey,et al.  Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing , 2008, Nature Genetics.

[48]  N. Tsuchiya,et al.  Diversity of human immune system multigene families and its implication in the genetic background of rheumatic diseases. , 2007, Current medicinal chemistry.

[49]  Aibin He,et al.  Isoform Evolution in Primates through Independent Combination of Alternative RNA Processing Events , 2017, Molecular biology and evolution.

[50]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[51]  E. Eichler,et al.  Primate segmental duplications: crucibles of evolution, diversity and disease , 2006, Nature Reviews Genetics.

[52]  Dalyir I. Pretto,et al.  Differential increases of specific FMR1 mRNA isoforms in premutation carriers , 2014, Journal of Medical Genetics.

[53]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[54]  Monkol Lek,et al.  Patterns of genic intolerance of rare copy number variation in 59,898 human exomes , 2016, Nature Genetics.

[55]  Salem Malikic,et al.  Cypiripi: exact genotyping of CYP2D6 using high-throughput sequencing data , 2015, Bioinform..

[56]  Tyson A. Clark,et al.  Characterization of fusion genes and the significantly expressed fusion isoforms in breast cancer by hybrid sequencing , 2015, Nucleic acids research.

[57]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[58]  Qiang Sun,et al.  Distinguishing highly similar gene isoforms with a clustering-based bioinformatics analysis of PacBio single-molecule long reads , 2016, BioData Mining.

[59]  E. Eichler,et al.  Human copy number polymorphic genes , 2009, Cytogenetic and Genome Research.

[60]  R. Irizarry,et al.  Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation , 2015, Nature Biotechnology.

[61]  D. Page,et al.  The Old World monkey DAZ (Deleted in AZoospermia) gene yields insights into the evolution of the DAZ gene cluster on the human Y chromosome. , 1999, Human molecular genetics.

[62]  Xiandong Meng,et al.  Widespread Polycistronic Transcripts in Fungi Revealed by Single-Molecule mRNA Sequencing , 2015, PloS one.

[63]  Elizabeth Tseng,et al.  Altered expression of the FMR1 splicing variants landscape in premutation carriers. , 2017, Biochimica et biophysica acta. Gene regulatory mechanisms.

[64]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[65]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[66]  J. R. MacDonald,et al.  A copy number variation map of the human genome , 2015, Nature Reviews Genetics.

[67]  Sven Rahmann,et al.  Snakemake--a scalable bioinformatics workflow engine. , 2012, Bioinformatics.

[68]  P. Raghavan Probabilistic construction of deterministic algorithms: Approximating packing integer programs , 1986, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[69]  Z. Gu,et al.  Evolutionary analyses of the human genome , 2001, Nature.

[70]  Faye D. Schilkey,et al.  A survey of the sorghum transcriptome using single-molecule long reads , 2016, Nature Communications.

[71]  Michael L. Waskom,et al.  seaborn: v0.5.0 (November 2014) , 2014 .

[72]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[73]  M. Schierup,et al.  Analysis of 62 hybrid assembled human Y chromosomes exposes rapid structural changes and high rates of gene conversion , 2017, PLoS genetics.

[74]  W Brad Barbazuk,et al.  Detecting alternatively spliced transcript isoforms from single‐molecule long‐read sequences without a reference genome , 2017, Molecular ecology resources.

[75]  Tyson A. Clark,et al.  Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing , 2016, Nature Communications.

[76]  Steve Rozen,et al.  Abundant gene conversion between arms of palindromes in human and ape Y chromosomes , 2003, Nature.

[77]  Lennart Martens,et al.  1 SQANTI : extensive characterization of long read transcript sequences for quality control in 1 full-length transcriptome identification and quantification 2 3 , 2017 .