Distinguishing highly similar gene isoforms with a clustering-based bioinformatics analysis of PacBio single-molecule long reads

BackgroundGene isoforms are commonly found in both prokaryotes and eukaryotes. Since each isoform may perform a specific function in response to changing environmental conditions, studying the dynamics of gene isoforms is important in understanding biological processes and disease conditions. However, genome-wide identification of gene isoforms is technically challenging due to the high degree of sequence identity among isoforms. Traditional targeted sequencing approach, involving Sanger sequencing of plasmid-cloned PCR products, has low throughput and is very tedious and time-consuming. Next-generation sequencing technologies such as Illumina and 454 achieve high throughput but their short read lengths are a critical barrier to accurate assembly of highly similar gene isoforms, and may result in ambiguities and false joining during sequence assembly. More recently, the third generation sequencer represented by the PacBio platform offers sufficient throughput and long reads covering the full length of typical genes, thus providing a potential to reliably profile gene isoforms. However, the PacBio long reads are error-prone and cannot be effectively analyzed by traditional assembly programs.ResultsWe present a clustering-based analysis pipeline integrated with PacBio sequencing data for profiling highly similar gene isoforms. This approach was first evaluated in comparison to de novo assembly of 454 reads using a benchmark admixture containing 10 known, cloned msg genes encoding the major surface glycoprotein of Pneumocystis jirovecii. All 10 msg isoforms were successfully reconstructed with the expected length (~1.5 kb) and correct sequence by the new approach, while 454 reads could not be correctly assembled using various assembly programs. When using an additional benchmark admixture containing 22 known P. jirovecii msg isoforms, this approach accurately reconstructed all but 4 these isoforms in their full-length (~3 kb); these 4 isoforms were present in low concentrations in the admixture. Finally, when applied to the original clinical sample from which the 22 known msg isoforms were cloned, this approach successfully identified not only all known isoforms accurately (~3 kb each) but also 48 novel isoforms.ConclusionsPacBio sequencing integrated with the clustering-based analysis pipeline achieves high-throughput and high-resolution discrimination of highly similar sequences, and can serve as a new approach for genome-wide characterization of gene isoforms and other highly repetitive sequences.

[1]  J. Kovacs,et al.  Characterization of major surface glycoprotein genes of human Pneumocystis carinii and high-level expression of a conserved region. , 1998, Infection and immunity.

[2]  J. Donelson,et al.  The Genome of the African Trypanosome , 2002 .

[3]  Jonathan E. Allen,et al.  Genome sequence of the human malaria parasite Plasmodium falciparum , 2002, Nature.

[4]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[5]  David M. A. Martin,et al.  The Genome of the African Trypanosome Trypanosoma brucei , 2005, Science.

[6]  M. Quail,et al.  Gene Arrays at Pneumocystis carinii Telomeres , 2005, Genetics.

[7]  Aleksey A. Porollo,et al.  Draft Assembly and Annotation of the Pneumocystis carinii Genome , 2006, The Journal of eukaryotic microbiology.

[8]  J. Stringer Antigenic Variation in Pneumocystis 1 , 2007, The Journal of eukaryotic microbiology.

[9]  Feng Chen,et al.  Genomic Minimalism in the Early Diverging Intestinal Parasite Giardia lamblia , 2007, Science.

[10]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[11]  B. Frey,et al.  Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing , 2008, Nature Genetics.

[12]  Geetha Kutty,et al.  Variation in the major surface glycoprotein genes in Pneumocystis jirovecii. , 2008, The Journal of infectious diseases.

[13]  J. Stringer,et al.  Complexity of the MSG gene family of Pneumocystis carinii , 2009, BMC Genomics.

[14]  J. Stringer,et al.  Common strategies for antigenic variation by bacterial, fungal and protozoan pathogens , 2009, Nature Reviews Microbiology.

[15]  J. Kovacs,et al.  Evolving health effects of Pneumocystis: one hundred years of progress in diagnosis and treatment. , 2009, JAMA.

[16]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[17]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[18]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[19]  W. Guo,et al.  Titin Diversity—Alternative Splicing Gone Wild , 2010, Journal of biomedicine & biotechnology.

[20]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[21]  Kevin Galinsky,et al.  Hybrid selection for sequencing pathogen genomes from clinical samples , 2011, Genome Biology.

[22]  Elizabeth M. Ryan,et al.  De novo assembly of highly diverse viral populations , 2012, BMC Genomics.

[23]  S. Salzberg,et al.  Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2011, Nature Reviews Genetics.

[24]  M. Pagni,et al.  De Novo Assembly of the Pneumocystis jirovecii Genome from a Single Bronchoalveolar Lavage Fluid Specimen from a Patient , 2012, mBio.

[25]  J. Kovacs,et al.  Outbreaks of Pneumocystis pneumonia in 2 renal transplant centers linked to a single strain of Pneumocystis: implications for transmission and virulence. , 2012, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[26]  Cole Trapnell,et al.  Targeted RNA sequencing reveals the deep complexity of the human transcriptome , 2011, Nature Biotechnology.

[27]  Donald Sharon,et al.  A single-molecule long-read survey of the human transcriptome , 2013, Nature Biotechnology.

[28]  Stephanie L. Servetas,et al.  Comparison of three next-generation sequencing platforms for metagenomic sequencing and identification of pathogens in blood , 2014, BMC Genomics.

[29]  Robert Stephens,et al.  A Benchmark Study on Error Assessment and Quality Control of CCS Reads Derived from the PacBio RS. , 2013, Journal of data mining in genomics & proteomics.

[30]  Mauricio O. Carneiro,et al.  The advantages of SMRT sequencing , 2013, Genome Biology.

[31]  T. Thomas,et al.  Deep sequencing of evolving pathogen populations: applications, errors, and bioinformatic solutions , 2014, Microbial Informatics and Experimentation.

[32]  Li Yin,et al.  Empirical validation of viral quasispecies assembly algorithms: state-of-the-art and challenges , 2013, Scientific Reports.

[33]  Christopher Quince,et al.  Benchmarking of viral haplotype reconstruction programmes: an overview of the capacities and limitations of currently available programmes , 2014, Briefings Bioinform..

[34]  L. Ma Pneumocystis : An Atypical Fungal Pathogen , 2014 .

[35]  J. T. Dunnen,et al.  Next generation sequencing technology: Advances and applications. , 2014, Biochimica et biophysica acta.

[36]  Brad T. Sherman,et al.  Genome analysis of three Pneumocystis species reveals adaptation mechanisms to life exclusively in mammalian hosts , 2016, Nature Communications.