Statistical Issues in the Analysis of ChIP-Seq and RNA-Seq Data

The recent arrival of ultra-high throughput, next generation sequencing (NGS) technologies has revolutionized the genetics and genomics fields by allowing rapid and inexpensive sequencing of billions of bases. The rapid deployment of NGS in a variety of sequencing-based experiments has resulted in fast accumulation of massive amounts of sequencing data. To process this new type of data, a torrent of increasingly sophisticated algorithms and software tools are emerging to help the analysis stage of the NGS applications. In this article, we strive to comprehensively identify the critical challenges that arise from all stages of NGS data analysis and provide an objective overview of what has been achieved in existing works. At the same time, we highlight selected areas that need much further research to improve our current capabilities to delineate the most information possible from NGS data. The article focuses on applications dealing with ChIP-Seq and RNA-Seq.

[1]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[2]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[3]  Bin Ma,et al.  ZOOM! Zillions of oligos mapped , 2008, Bioinform..

[4]  D. Botstein,et al.  Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF , 2001, Nature.

[5]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[6]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[7]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[8]  R. Guigó,et al.  Transcriptome genetics using second generation sequencing in a Caucasian population , 2010, Nature.

[9]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[10]  A. Oshlack,et al.  Transcript length bias in RNA-seq data confounds systems biology , 2009, Biology Direct.

[11]  H. Bussemaker,et al.  Regulatory element detection using correlation with expression , 2001, Nature Genetics.

[12]  Raymond K. Auerbach,et al.  PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls , 2009, Nature Biotechnology.

[13]  David J Studholme,et al.  A draft genome sequence and functional screen reveals the repertoire of type III secreted proteins of Pseudomonas syringae pathovar tabaci 11528 , 2009, BMC Genomics.

[14]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[15]  Michael Brudno,et al.  SHRiMP: Accurate Mapping of Short Color-space Reads , 2009, PLoS Comput. Biol..

[16]  James T Kadonaga,et al.  Regulation of RNA Polymerase II Transcription by Sequence-Specific DNA Binding Factors , 2004, Cell.

[17]  H. Bussemaker,et al.  Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[18]  M. Pellegrini,et al.  Conservation and divergence of methylation patterning in plants and animals , 2010, Proceedings of the National Academy of Sciences.

[19]  M. Gerstein,et al.  The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing , 2008, Science.

[20]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Heidi Dvinge,et al.  PeakAnalyzer: Genome-wide annotation of chromatin binding and modification loci , 2010, BMC Bioinformatics.

[22]  Wing Hung Wong,et al.  SeqMap: mapping massive amount of oligonucleotides to the genome , 2008, Bioinform..

[23]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[24]  E. Lander,et al.  Construction of multilocus genetic linkage maps in humans. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[26]  Eric S. Lander,et al.  Sequencing the nuclear genome of the extinct woolly mammoth , 2008, Nature.

[27]  Giorgio Valle,et al.  PASS: a program to align short sequences , 2009, Bioinform..

[28]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature biotechnology.

[29]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[30]  Thomas D. Wu,et al.  GMAP: a genomic mapping and alignment program for mRNA and EST sequence , 2005, Bioinform..

[31]  Zhaohui S. Qin,et al.  On the detection and refinement of transcription factor binding sites using ChIP-Seq data , 2010, Nucleic acids research.

[32]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[33]  N. D. Clarke,et al.  Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells , 2008, Cell.

[34]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[35]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[36]  John J. Wyrick,et al.  Genome-wide location and function of DNA binding proteins. , 2000, Science.

[37]  Heejung Shim,et al.  Integrating quantitative information from ChIP-chip experiments into motif finding. , 2008, Biostatistics.

[38]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[39]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[40]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[41]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[42]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[43]  S. Nelson,et al.  BFAST: An Alignment Tool for Large Scale Genome Resequencing , 2009, PloS one.

[44]  Paul Flicek,et al.  Sense from sequence reads: methods for alignment and assembly , 2009, Nature Methods.

[45]  Dustin E. Schones,et al.  High-Resolution Profiling of Histone Methylations in the Human Genome , 2007, Cell.

[46]  David A. Nix,et al.  Empirical methods for controlling false positives and estimating confidence in ChIP-Seq peaks , 2008, BMC Bioinformatics.

[47]  Richard K. Wilson,et al.  Aspects of coverage in medical DNA sequencing , 2008, BMC Bioinformatics.

[48]  T. Mikkelsen,et al.  Genome-scale DNA methylation maps of pluripotent and differentiated cells , 2008, Nature.

[49]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[50]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[51]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[52]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[53]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[54]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[55]  Allen D. Delaney,et al.  Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing , 2007, Nature Methods.

[56]  Matthew D. Young,et al.  Gene ontology analysis for RNA-seq: accounting for selection bias , 2010, Genome Biology.

[57]  Debashis Ghosh Detecting outlier genes from high-dimensional data: a fuzzy approach , 2010, BIOINFORMATICS 2010.

[58]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[59]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[60]  Joseph K. Pickrell,et al.  Understanding mechanisms underlying human gene expression variation with RNA sequencing , 2010, Nature.

[61]  Hyungwon Choi,et al.  Hierarchical hidden Markov model with application to joint analysis of ChIP-chip and ChIP-seq data , 2009, Bioinform..

[62]  Steven J. M. Jones,et al.  FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology , 2008, Bioinform..

[63]  Weihong Qi,et al.  Genomic Diversity and Evolution of Mycobacterium ulcerans Revealed by Next-Generation Sequencing , 2009, PLoS pathogens.

[64]  Simon Tavaré,et al.  BayesPeak: Bayesian analysis of ChIP-seq data , 2009, BMC Bioinformatics.

[65]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[66]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[67]  Amy E. Hawkins,et al.  DNA sequencing of a cytogenetically normal acute myeloid leukemia genome , 2008, Nature.

[68]  Peter B. Gilbert,et al.  A modified false discovery rate multiple‐comparisons procedure for discrete data, applied to human immunodeficiency virus genetics , 2005 .

[69]  Cheng Cheng,et al.  Robust estimation of the false discovery rate , 2006, Bioinform..

[70]  Korbinian Strimmer,et al.  BMC Bioinformatics BioMed Central Methodology article A general modular framework for gene set enrichment analysis , 2009 .

[71]  Zhaohui S. Qin,et al.  An integrated network of androgen receptor, polycomb, and TMPRSS2-ERG gene fusions in prostate cancer progression. , 2010, Cancer cell.

[72]  Dawei Li,et al.  The diploid genome sequence of an Asian individual , 2008, Nature.

[73]  S. Batzoglou,et al.  Genome-Wide Analysis of Transcription Factor Binding Sites Based on ChIP-Seq Data , 2008, Nature Methods.

[74]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[75]  Bertram Ludäscher,et al.  Sole-Search: an integrated analysis program for peak detection and functional annotation using ChIP-seq data , 2009, Nucleic acids research.

[76]  David Botstein,et al.  Promoter-specific binding of Rap1 revealed by genome-wide maps of protein–DNA association , 2001, Nature Genetics.

[77]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[78]  John Quackenbush,et al.  What would you do if you could sequence everything? , 2008, Nature Biotechnology.

[79]  Zhaohui S. Qin,et al.  Operon Prediction for Sequenced Bacterial Genomes without Experimental Information , 2006, Applied and Environmental Microbiology.

[80]  Jun Song,et al.  CEAS: cis-regulatory element annotation system , 2006, Nucleic Acids Res..

[81]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature Biotechnology.

[82]  P. Park,et al.  Design and analysis of ChIP-seq experiments for DNA-binding proteins , 2008, Nature Biotechnology.

[83]  Clifford A. Meyer,et al.  A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences , 2005, ISMB.

[84]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[85]  T. Mikkelsen,et al.  Genome-wide maps of chromatin state in pluripotent and lineage-committed cells , 2007, Nature.

[86]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[87]  Raja Jothi,et al.  Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data , 2008, Nucleic acids research.

[88]  Zachary D. Smith,et al.  Genome-scale DNA methylation mapping of clinical samples at single-nucleotide resolution , 2010, Nature Methods.

[89]  A. Sandelin,et al.  Applied bioinformatics for the identification of regulatory elements , 2004, Nature Reviews Genetics.

[90]  Alan David Hutson,et al.  Resampling Methods for Dependent Data , 2004, Technometrics.

[91]  Ruiqiang Li,et al.  SOAP: short oligonucleotide alignment program , 2008, Bioinform..

[92]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[93]  Douglas L. Brutlag,et al.  Bayesian Segmentation of Protein Secondary Structure , 2000, J. Comput. Biol..

[94]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[95]  Thomas D. Wu,et al.  A highly annotated whole-genome sequence of a Korean individual , 2009, Nature.

[96]  T. Laajala,et al.  A practical comparison of methods for detecting transcription factor binding sites in ChIP-seq experiments , 2009, BMC Genomics.

[97]  Zhaohui S. Qin,et al.  HPeak: an HMM-based algorithm for defining read-enriched regions in ChIP-Seq data , 2010, BMC Bioinformatics.

[98]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[99]  M. Facciotti,et al.  Evaluation of Algorithm Performance in ChIP-Seq Peak Detection , 2010, PloS one.

[100]  W. Miller,et al.  Complete mitochondrial genome of a Pleistocene jawbone unveils the origin of polar bear , 2010, Proceedings of the National Academy of Sciences.

[101]  Raphael Gottardo,et al.  PICS: Probabilistic Inference for ChIP‐seq , 2009, Biometrics.

[102]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[103]  S. Quake,et al.  Single-Molecule DNA Sequencing of a Viral Genome , 2008, Science.

[104]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[105]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[106]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[107]  Timothy E. Reddy,et al.  Distinct DNA methylation patterns characterize differentiated human embryonic stem cells and developing human fetal liver. , 2009, Genome research.

[108]  Jun S. Liu,et al.  Integrating regulatory motif discovery and genome-wide expression analysis , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[109]  J. Shendure,et al.  Materials and Methods Som Text Figs. S1 and S2 Tables S1 to S4 References Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome , 2022 .

[110]  Terrence S. Furey,et al.  F-Seq: a feature density estimator for high-throughput sequence tags , 2008, Bioinform..