论文信息 - Statistical Issues in the Analysis of ChIP-Seq and RNA-Seq Data

Statistical Issues in the Analysis of ChIP-Seq and RNA-Seq Data

The recent arrival of ultra-high throughput, next generation sequencing (NGS) technologies has revolutionized the genetics and genomics fields by allowing rapid and inexpensive sequencing of billions of bases. The rapid deployment of NGS in a variety of sequencing-based experiments has resulted in fast accumulation of massive amounts of sequencing data. To process this new type of data, a torrent of increasingly sophisticated algorithms and software tools are emerging to help the analysis stage of the NGS applications. In this article, we strive to comprehensively identify the critical challenges that arise from all stages of NGS data analysis and provide an objective overview of what has been achieved in existing works. At the same time, we highlight selected areas that need much further research to improve our current capabilities to delineate the most information possible from NGS data. The article focuses on applications dealing with ChIP-Seq and RNA-Seq.

Debashis Ghosh | Zhaohui S. Qin | Z. Qin | D. Ghosh

[1] William Stafford Noble,et al. Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[2] Charles Elkan,et al. Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[3] Bin Ma,et al. ZOOM! Zillions of oligos mapped , 2008, Bioinform..

[4] D. Botstein,et al. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF , 2001, Nature.

[5] E. Lander,et al. Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[6] M. Stephens,et al. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[7] G. Church,et al. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[8] R. Guigó,et al. Transcriptome genetics using second generation sequencing in a Caucasian population , 2010, Nature.

[9] Richard Durbin,et al. Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[10] A. Oshlack,et al. Transcript length bias in RNA-seq data confounds systems biology , 2009, Biology Direct.

[11] H. Bussemaker,et al. Regulatory element detection using correlation with expression , 2001, Nature Genetics.

[12] Raymond K. Auerbach,et al. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls , 2009, Nature Biotechnology.

[13] David J Studholme,et al. A draft genome sequence and functional screen reveals the repertoire of type III secreted proteins of Pseudomonas syringae pathovar tabaci 11528 , 2009, BMC Genomics.

[14] Clifford A. Meyer,et al. Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[15] Michael Brudno,et al. SHRiMP: Accurate Mapping of Short Color-space Reads , 2009, PLoS Comput. Biol..

[16] James T Kadonaga,et al. Regulation of RNA Polymerase II Transcription by Sequence-Specific DNA Binding Factors , 2004, Cell.

[17] H. Bussemaker,et al. Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[18] M. Pellegrini,et al. Conservation and divergence of methylation patterning in plants and animals , 2010, Proceedings of the National Academy of Sciences.

[19] M. Gerstein,et al. The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing , 2008, Science.

[20] M. A. McClure,et al. Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[21] Heidi Dvinge,et al. PeakAnalyzer: Genome-wide annotation of chromatin binding and modification loci , 2010, BMC Bioinformatics.

[22] Wing Hung Wong,et al. SeqMap: mapping massive amount of oligonucleotides to the genome , 2008, Bioinform..

[23] G. Stormo,et al. Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[24] E. Lander,et al. Construction of multilocus genetic linkage maps in humans. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[25] Nancy F. Hansen,et al. Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[26] Eric S. Lander,et al. Sequencing the nuclear genome of the extinct woolly mammoth , 2008, Nature.

[27] Giorgio Valle,et al. PASS: a program to align short sequences , 2009, Bioinform..

[28] J. Rinn,et al. Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature biotechnology.

[29] M S Waterman,et al. Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[30] Thomas D. Wu,et al. GMAP: a genomic mapping and alignment program for mRNA and EST sequence , 2005, Bioinform..

[31] Zhaohui S. Qin,et al. On the detection and refinement of transcription factor binding sites using ChIP-Seq data , 2010, Nucleic acids research.

[32] M. Metzker. Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[33] N. D. Clarke,et al. Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells , 2008, Cell.

[34] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[35] Nicola J. Rinaldi,et al. Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[36] John J. Wyrick,et al. Genome-wide location and function of DNA binding proteins. , 2000, Science.

[37] Heejung Shim,et al. Integrating quantitative information from ChIP-chip experiments into motif finding. , 2008, Biostatistics.

[38] Juliane C. Dohm,et al. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[39] Sean R. Eddy,et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[40] Lior Pachter,et al. Sequence Analysis , 2020, Definitions.

[41] M. Daly,et al. PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[42] James R. Knight,et al. Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[43] S. Nelson,et al. BFAST: An Alignment Tool for Large Scale Genome Resequencing , 2009, PloS one.

[44] Paul Flicek,et al. Sense from sequence reads: methods for alignment and assembly , 2009, Nature Methods.

[45] Dustin E. Schones,et al. High-Resolution Profiling of Histone Methylations in the Human Genome , 2007, Cell.

[46] David A. Nix,et al. Empirical methods for controlling false positives and estimating confidence in ChIP-Seq peaks , 2008, BMC Bioinformatics.

[47] Richard K. Wilson,et al. Aspects of coverage in medical DNA sequencing , 2008, BMC Bioinformatics.

[48] T. Mikkelsen,et al. Genome-scale DNA methylation maps of pluripotent and differentiated cells , 2008, Nature.

[49] John D. Storey,et al. Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[50] Pablo Tamayo,et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[51] E. Mardis. The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[52] Heng Li,et al. A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[53] Jun S. Liu,et al. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[54] R. Durbin,et al. Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[55] Allen D. Delaney,et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing , 2007, Nature Methods.

[56] Matthew D. Young,et al. Gene ontology analysis for RNA-seq: accounting for selection bias , 2010, Genome Biology.

[57] Debashis Ghosh. Detecting outlier genes from high-dimensional data: a fuzzy approach , 2010, BIOINFORMATICS 2010.

[58] Y. Benjamini,et al. Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[59] D. Haussler,et al. Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[60] Joseph K. Pickrell,et al. Understanding mechanisms underlying human gene expression variation with RNA sequencing , 2010, Nature.