Ambiguous fragment assignment for high-throughput sequencing experiments

Author(s): Roberts, Adam | Advisor(s): Pachter, Lior | Abstract: As the cost of short-read, high-throughput DNA sequencing continues to fall rapidly, new uses for the technology have been developed aside from its original purpose in determining the genome of various species. Many of these new experiments use the sequencer as a digital counter for measuring biological activities such as gene expression (RNA-Seq) or protein binding (ChIP-Seq).A common problem faced in the analysis of these data is that of sequenced fragments that are "ambiguous", meaning they resemble multiple loci in a reference genome or other sequence. In early analyses, such ambiguous fragments were ignored or were assigned to loci using simple heuristics. However, statistical approaches using maximum likelihood estimation have been shown to greatly improve the accuracy of downstream analyses and have become widely adopted Optimization based on the expectation-maximization (EM) algorithm are often employed by these methods to find the optimal sets of alignments, with frequent enhancements to the model. Nevertheless, these improvements increase complexity, which, along with an exponential growth in the size of sequencing datasets, has led to new computational challenges.Herein, we present our model for ambiguous fragment assignment for RNA-Seq, which includes the most comprehensive set of parameters of any model introduced to date, as well as various methods we have explored for scaling our optimization procedure. These methods include the use of an online EM algorithm and a distributed EM solution implemented on the Spark cluster computing system. Our advances have resulted in the first efficient solution to the problem of fragment assignment in sequencing.Furthermore, we are the first to create a fully generalized model for ambiguous fragment assignment and present details on how our method can provide solutions for additional high-throughput sequencing assays including ChIP-Seq, Allele-Specific Expression (ASE), and the detection of RNA-DNA Differences (RDDs) in RNA-Seq.

[1]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[2]  Tao Jiang,et al.  Workshop: Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads , 2012, 2012 IEEE 2nd International Conference on Computational Advances in Bio and medical Sciences (ICCABS).

[3]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[4]  B. Langmead,et al.  Cloud-scale RNA-sequencing differential expression analysis with Myrna , 2010, Genome Biology.

[5]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[6]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[7]  D. Branton,et al.  The potential and challenges of nanopore sequencing , 2008, Nature Biotechnology.

[8]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[9]  Eran Halperin,et al.  Accurate Estimation of Expression Levels of Homologous Genes in RNA-seq Experiments , 2011, J. Comput. Biol..

[10]  S. Srivastava,et al.  A two-parameter generalized Poisson model to improve the analysis of RNA-seq data , 2010, Nucleic acids research.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Joseph K. Pickrell,et al.  Comment on “Widespread RNA and DNA Sequence Differences in the Human Transcriptome” , 2012, Science.

[13]  Peter Meinicke,et al.  Mixture models for analysis of the taxonomic composition of metagenomes , 2011, Bioinform..

[14]  Krishna R. Kalari,et al.  Detection of redundant fusion transcripts as biomarkers or disease-specific therapeutic targets in breast cancer. , 2012, Cancer research.

[15]  Allen D. Delaney,et al.  Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing , 2007, Nature Methods.

[16]  Carsten O. Daub,et al.  Probabilistic resolution of multi-mapping reads in massively parallel sequencing data using MuMRescueLite , 2009, Bioinform..

[17]  J. Bähler,et al.  Cellular and Molecular Life Sciences REVIEW RNA-seq: from technology to biology , 2022 .

[18]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[19]  Colin N. Dewey,et al.  Discovering Transcription Factor Binding Sites in Highly Repetitive Regions of Genomes with Multi-Read Analysis of ChIP-Seq Data , 2011, PLoS Comput. Biol..

[20]  Crispin J. Miller,et al.  A comparison of massively parallel nucleotide sequencing with oligonucleotide microarrays for global transcription profiling , 2010, BMC Genomics.

[21]  Yi Xing,et al.  An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs , 2006, Nucleic acids research.

[22]  Orion J. Buske,et al.  iReckon: Simultaneous isoform discovery and abundance estimation from RNA-seq data , 2013, Genome research.

[23]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[24]  Dan Klein,et al.  Online EM for Unsupervised Models , 2009, NAACL.

[25]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[26]  L. Pachter Models for transcript quantification from RNA-Seq , 2011, 1104.3889.

[27]  Alexandre M. Bayen,et al.  Scaling the mobile millennium system in the cloud , 2011, SoCC.

[28]  Terence P. Speed,et al.  Methods for Allocating Ambiguous Short-reads , 2010, Commun. Inf. Syst..

[29]  Development and evaluation of RNA-seq methods , 2010, Genome Biology.

[30]  Eran Halperin,et al.  Accurate Estimation of Expression Levels of Homologous Genes in RNA-seq Experiments , 2010, RECOMB.

[31]  Yu-ping Wang,et al.  Comparative Studies of Copy Number Variation Detection Methods for Next-Generation Sequencing Technologies , 2013, PloS one.

[32]  Wenwei Zhang,et al.  Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome , 2012, Nature Biotechnology.

[33]  GhemawatSanjay,et al.  The Google file system , 2003 .

[34]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[35]  Y. Xing,et al.  Detection of splice junctions from paired-end RNA-seq data by SpliceMap , 2010, Nucleic acids research.

[36]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[37]  L. Pachter A closer look at RNA editing , 2012, Nature Biotechnology.

[38]  Colin N. Dewey,et al.  RNA-Seq gene expression estimation with read mapping uncertainty , 2009, Bioinform..

[39]  A. Brazma,et al.  Reuse of public genome-wide gene expression data , 2012, Nature Reviews Genetics.

[40]  O. Cappé,et al.  On‐line expectation–maximization algorithm for latent data models , 2009 .

[41]  Mingyao Li,et al.  Widespread RNA and DNA Sequence Differences in the Human Transcriptome , 2011, Science.

[42]  Dawei Li,et al.  The diploid genome sequence of an Asian individual , 2008, Nature.

[43]  Nicholas T. Ingolia,et al.  Genome-Wide Analysis in Vivo of Translation with Nucleotide Resolution Using Ribosome Profiling , 2009, Science.

[44]  Inanç Birol,et al.  Dissect: detection and characterization of novel structural alterations in transcribed sequences , 2012, Bioinform..

[45]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[46]  L. Pachter,et al.  Streaming fragment assignment for real-time analysis of sequencing experiments , 2012, Nature Methods.

[47]  W. Wong,et al.  Modeling non-uniformity in short-read rates in RNA-Seq data , 2010, Genome Biology.

[48]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[49]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[50]  Martin Kircher,et al.  High‐throughput DNA sequencing – concepts and limitations , 2010, BioEssays : news and reviews in molecular, cellular and developmental biology.

[51]  Wing Hung Wong,et al.  Statistical inferences for isoform expression in RNA-Seq , 2009, Bioinform..

[52]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[53]  B. Wold,et al.  Sequence census methods for functional genomics , 2008, Nature Methods.

[54]  Lior Pachter,et al.  Updating RNA-Seq analyses after re-annotation , 2013, Bioinform..

[55]  Gunnar Rätsch,et al.  Oqtans: a Galaxy-integrated workflow for quantitative transcriptome analysis from NGS Data , 2011, BMC Bioinformatics.

[56]  Maqc Consortium The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements , 2006, Nature Biotechnology.

[57]  Jin Billy Li,et al.  Comment on “Widespread RNA and DNA Sequence Differences in the Human Transcriptome” , 2012, Science.

[58]  Tao Jiang,et al.  IsoLasso: A LASSO Regression Approach to RNA-Seq Based Transcriptome Assembly - (Extended Abstract) , 2011, RECOMB.

[59]  K. Hansen,et al.  Biases in Illumina transcriptome sequencing caused by random hexamer priming , 2010, Nucleic acids research.

[60]  Bruce Hendrickson Graph Partitioning , 2011, Encyclopedia of Parallel Computing.

[61]  E. Wang,et al.  Analysis and design of RNA sequencing experiments for identifying isoform regulation , 2010, Nature Methods.

[62]  Ronghua Chen,et al.  Digital transcriptome profiling using selective hexamer priming for cDNA synthesis , 2009, Nature Methods.

[63]  Lior Pachter,et al.  RESEARCH ARTICLE Open Access Identification and correction of systematic error in high-throughput sequence data , 2022 .

[64]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[65]  Raymond K. Auerbach,et al.  The real cost of sequencing: higher than you think! , 2011, Genome Biology.

[66]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[67]  Peter D Karp,et al.  The past, present and future of genome-wide re-annotation , 2002, Genome Biology.

[68]  Cole Trapnell,et al.  Improving RNA-Seq expression estimates by correcting for fragment bias , 2011, Genome Biology.

[69]  Michael W Pfaffl,et al.  RNA integrity and the effect on the real-time qRT-PCR performance. , 2006, Molecular aspects of medicine.

[70]  Ion I. Mandoiu,et al.  Estimation of alternative splicing isoform frequencies from RNA-Seq data , 2010, Algorithms for Molecular Biology.

[71]  David G Hendrickson,et al.  Differential analysis of gene regulation at transcript resolution with RNA-seq , 2012, Nature Biotechnology.

[72]  B. Graveley The developmental transcriptome of Drosophila melanogaster , 2010, Nature.