Guidelines for Bioinformatics and the Statistical Analysis of Omic Data

This chapter is a resource for those designing omics experiments and those analyzing the data from such experiments. It is organized into two parts, one with a focus on bioinformatics tools and techniques, and the other with a focus on statistical analyses. It is intended to be a high-level instructional chapter for those who are interested in performing their own analyses, not a comprehensive discussion of either area. The first section discusses the bioinformatics tools and algorithms used in genomics and transcriptomics. It describes typical workflows and the tools available for performing an omic experiment and underscores the importance of both the tools being used and a clear understanding of the underlying algorithm. The second section describes general study design principles that should be taken into account before an experiment is begun. It describes some basic principles of statistical analysis and commonly used methods. It is not a comprehensive discussion of statistical theory nor does it describe more complex statistical models. The guidance of a statistician is advised for complex study designs, hypotheses, or statistical models.

[1]  J. Reis-Filho Next-generation sequencing , 2009, Breast Cancer Research.

[2]  Thomas Lengauer,et al.  Comprehensive Analysis of DNA Methylation Data with RnBeads , 2014, Nature Methods.

[3]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[4]  Paul Hoffman,et al.  Integrating single-cell transcriptomic data across different conditions, technologies, and species , 2018, Nature Biotechnology.

[5]  Jeffrey T Leek,et al.  Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown , 2016, Nature Protocols.

[6]  P. Brown,et al.  A DNA microarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization. , 1996, Genome research.

[7]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[8]  Thomas M. Keane,et al.  Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly , 2010, Genome Biology.

[9]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[10]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[11]  N Heddle,et al.  Basic statistics for clinicians: 1. Hypothesis testing. , 1995, CMAJ : Canadian Medical Association journal = journal de l'Association medicale canadienne.

[12]  M. Gerstein,et al.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. , 2011, Genome research.

[13]  H. Milting,et al.  Supplemental Material , 2004 .

[14]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[15]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[18]  Véronique Geoffroy,et al.  AnnotSV: an integrated tool for structural variations annotation , 2018, Bioinform..

[19]  A. Conesa,et al.  Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package , 2015, Nucleic acids research.

[20]  Wolfgang Huber,et al.  Analyzing ChIP-chip Data Using Bioconductor , 2008, PLoS Comput. Biol..

[21]  Måns Magnusson,et al.  MultiQC: summarize analysis results for multiple tools and samples in a single report , 2016, Bioinform..

[22]  M. Robinson,et al.  Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. , 2015, F1000Research.

[23]  Chunlei Liu,et al.  ClinVar: improving access to variant interpretations and supporting evidence , 2017, Nucleic Acids Res..

[24]  Paul Pavlidis,et al.  ErmineJ: Tool for functional analysis of gene expression data sets , 2005, BMC Bioinformatics.

[25]  M. Gerstein,et al.  PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data , 2009, Genome Biology.

[26]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[27]  R. Myers,et al.  Advancements in Next-Generation Sequencing. , 2016, Annual review of genomics and human genetics.

[28]  Yuan Tian,et al.  ChAMP: updated methylation analysis pipeline for Illumina BeadChips , 2017, Bioinform..

[29]  Rafael A. Irizarry,et al.  Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays , 2014, Bioinform..

[30]  Giovanni Manzini,et al.  An experimental study of a compressed index , 2001, Inf. Sci..

[31]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[32]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[33]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[34]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[35]  Colin N. Dewey,et al.  RNA-Seq gene expression estimation with read mapping uncertainty , 2009, Bioinform..

[36]  Jan Schröder,et al.  Socrates: identification of genomic rearrangements in tumour genomes by re-aligning soft clipped reads , 2014, Bioinform..

[37]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[38]  Pablo Cingolani,et al.  © 2012 Landes Bioscience. Do not distribute. , 2022 .

[39]  A. Kasarskis,et al.  A window into third-generation sequencing. , 2010, Human molecular genetics.

[40]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[41]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[42]  Benjamin M. Bolstad,et al.  affy - analysis of Affymetrix GeneChip data at the probe level , 2004, Bioinform..

[43]  Wei Shi,et al.  featureCounts: an efficient general purpose program for assigning sequence reads to genomic features , 2013, Bioinform..

[44]  G. Bell Replicates and repeats , 2016, BMC Biology.

[45]  Martin Kuiper,et al.  BiNGO: a Cytoscape plugin to assess overrepresentation of Gene Ontology categories in Biological Networks , 2005, Bioinform..

[46]  Zhiping Weng,et al.  Evaluation of preprocessing, mapping and postprocessing algorithms for analyzing whole genome bisulfite sequencing data , 2015, Briefings Bioinform..

[47]  Frank Bretz,et al.  Power and sample size when multiple endpoints are considered , 2007, Pharmaceutical statistics.

[48]  Siu-Ming Yiu,et al.  Compressed indexing and local alignment of DNA , 2008, Bioinform..

[49]  Robert Gentleman,et al.  Software for Computing and Annotating Genomic Ranges , 2013, PLoS Comput. Biol..

[50]  J. Ball,et al.  Statistics review 4: Sample size calculations , 2002, Critical care.

[51]  Tyler H. Garvin,et al.  A Reference Methylome Database and Analysis Pipeline to Facilitate Integrative and Comparative Epigenomics , 2013, PloS one.

[52]  L. Hood,et al.  The digital code of DNA , 2003, Nature.

[53]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[54]  Weiqun Peng,et al.  Spatial clustering for identification of ChIP-enriched regions (SICER) to map regions of histone methylation patterns in embryonic stem cells. , 2014, Methods in molecular biology.

[55]  Felix Krueger,et al.  Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications , 2011, Bioinform..

[56]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[57]  Björn Grüning,et al.  Strategies for analyzing bisulfite sequencing data , 2017, bioRxiv.

[58]  Andrew E. Teschendorff,et al.  ChAMP: 450k Chip Analysis Methylation Pipeline , 2014, Bioinform..

[59]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[60]  A. Mikheyev,et al.  A first look at the Oxford Nanopore MinION sequencer , 2014, Molecular ecology resources.

[61]  B. Giusti,et al.  EXCAVATOR: detecting copy number variants from whole-exome sequencing data , 2013, Genome Biology.

[62]  R. Fisher 019: On the Interpretation of x2 from Contingency Tables, and the Calculation of P. , 1922 .

[63]  Andres Salumets,et al.  Guidelines for the design, analysis and interpretation of 'omics' data: focus on human endometrium. , 2014, Human reproduction update.

[64]  Sang-Bae Kim,et al.  ADGO: analysis of differentially expressed gene sets using composite GO annotation , 2006, Bioinform..

[65]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[66]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[67]  Gary D. Bader,et al.  The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function , 2010, Nucleic Acids Res..

[68]  John Ludbrook,et al.  Analysis of 2 x 2 tables of frequencies: matching test to experimental design. , 2008, International journal of epidemiology.

[69]  Vincent Navratil,et al.  Sample size calculation in metabolic phenotyping studies , 2015, Briefings Bioinform..

[70]  Wei Li,et al.  BSMAP: whole genome bisulfite sequence MAPping program , 2009, BMC Bioinformatics.

[71]  Åsa M Wheelock,et al.  Trials and tribulations of 'omics data analysis: assessing quality of SIMCA-based multivariate models using examples from pulmonary medicine. , 2013, Molecular bioSystems.

[72]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[73]  Ryan M. Layer,et al.  LUMPY: a probabilistic framework for structural variant discovery , 2012, Genome Biology.

[74]  Fátima Sánchez-Cabo,et al.  GOplot: an R package for visually combining expression data with functional analysis , 2015, Bioinform..

[75]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[76]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[77]  F. Balloux,et al.  Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast , 2016, Nature Communications.

[78]  J. Eberwine,et al.  Analysis of gene expression in single live neurons. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[79]  Pablo Cingolani,et al.  Using Drosophila melanogaster as a Model for Genotoxic Chemical Mutational Studies with a New Program, SnpSift , 2012, Front. Gene..

[80]  R. Fisher XV.—The Correlation between Relatives on the Supposition of Mendelian Inheritance. , 1919, Transactions of the Royal Society of Edinburgh.

[81]  Jenny Forshed,et al.  Experimental Design in Clinical 'Omics Biomarker Discovery. , 2017, Journal of proteome research.

[82]  Alexander Varshavsky,et al.  Mapping proteinDNA interactions in vivo with formaldehyde: Evidence that histone H4 is retained on a highly transcribed gene , 1988, Cell.

[83]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[84]  Simon Tavaré,et al.  beadarray: R classes and methods for Illumina bead-based data , 2007, Bioinform..

[85]  Julia Richter,et al.  B-SOLANA: an approach for the analysis of two-base encoding bisulfite sequencing data , 2011, Bioinform..

[86]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[87]  Editorial: Would You Like A Hypothesis With Those Data? Omics and the Age of Discovery Science. , 2015, Molecular endocrinology.

[88]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[89]  S. Walker,et al.  Quantitative RT-PCR : Pitfalls and Potential , 1999 .

[90]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[91]  J. Eberwine,et al.  Amplified RNA synthesized from limited quantities of heterogeneous cDNA. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[92]  Pao-Yang Chen,et al.  BS Seeker: precise mapping for bisulfite sequencing , 2010, BMC Bioinformatics.

[93]  S. Salzberg,et al.  StringTie enables improved reconstruction of a transcriptome from RNA-seq reads , 2015, Nature Biotechnology.

[94]  Martin Dugas,et al.  Detection of significantly differentially methylated regions in targeted bisulfite sequencing data , 2013, Bioinform..

[95]  N Heddle,et al.  Basic statistics for clinicians: 2. Interpreting study results: confidence intervals. , 1995, CMAJ : Canadian Medical Association journal = journal de l'Association medicale canadienne.

[96]  M. Pett Nonparametric Statistics for Health Care Research: Statistics for Small Samples and Unusual Distributions , 1997 .

[97]  N Heddle,et al.  Basic statistics for clinicians: 3. Assessing the effects of treatment: measures of association. , 1995, CMAJ : Canadian Medical Association journal = journal de l'Association medicale canadienne.

[98]  Thomas Lengauer,et al.  Improved scoring of functional groups from gene expression data by decorrelating GO graph structure , 2006, Bioinform..

[99]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[100]  John D. Storey A direct approach to false discovery rates , 2002 .

[101]  Amar Mukherjee,et al.  The Burrows-Wheeler Transform:: Data Compression, Suffix Arrays, and Pattern Matching , 2008 .

[102]  J. Selbig,et al.  More effort - more results: recent advances in integrative 'omics' data analysis. , 2016, Current opinion in plant biology.

[103]  Francisco Tirado,et al.  GeneCodis: interpreting gene lists through enrichment analysis and integration of diverse biological information , 2009, Nucleic Acids Res..

[104]  J. Ioannidis,et al.  The False-positive to False-negative Ratio in Epidemiologic Studies , 2011, Epidemiology.

[105]  B. Langmead,et al.  BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions , 2012, Genome Biology.

[106]  Rafael A. Irizarry,et al.  A framework for oligonucleotide microarray preprocessing , 2010, Bioinform..

[107]  Marcel J. T. Reinders,et al.  De novo detection of copy number variation by co-assembly , 2012, Bioinform..

[108]  Robert Gentleman,et al.  Using GOstats to test gene lists for GO term association , 2007, Bioinform..

[109]  Steve Horvath,et al.  WGCNA: an R package for weighted correlation network analysis , 2008, BMC Bioinformatics.

[110]  E. Moodie,et al.  Sample Size, Precision and Power Calculations: A Uniï¬ed Approach , 2011 .

[111]  Debashis Ghosh,et al.  "Omics" data and levels of evidence for biomarker discovery. , 2009, Genomics.

[112]  R. Dahm Discovering DNA: Friedrich Miescher and the early years of nucleic acid research , 2007, Human Genetics.

[113]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[114]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[115]  A. Magi,et al.  Detection of Genomic Structural Variants from Next-Generation Sequencing Data , 2015, Front. Bioeng. Biotechnol..

[116]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[117]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[118]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[119]  Bart De Moor,et al.  BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis , 2005, Bioinform..

[120]  J. Lee,et al.  Single-cell RNA sequencing technologies and bioinformatics pipelines , 2018, Experimental & Molecular Medicine.

[121]  P. McCullagh,et al.  Generalized Linear Models , 1972, Predictive Analytics.

[122]  Hikoya Hayatsu,et al.  Discovery of bisulfite-mediated cytosine conversion to uracil, the key reaction for DNA methylation analysis--a personal account. , 2008, Proceedings of the Japan Academy. Series B, Physical and biological sciences.

[123]  S. Horvath,et al.  A General Framework for Weighted Gene Co-Expression Network Analysis , 2005, Statistical applications in genetics and molecular biology.

[124]  Yutaka Saito,et al.  Detection of differentially methylated regions from bisulfite-seq data by hidden Markov models incorporating genome-wide methylation level distributions , 2015, BMC Genomics.

[125]  Wolfgang Huber,et al.  Ringo – an R/Bioconductor package for analyzing ChIP-chip readouts , 2007, BMC Bioinformatics.

[126]  Lars Feuk,et al.  The Database of Genomic Variants: a curated collection of structural variation in the human genome , 2013, Nucleic Acids Res..

[127]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[128]  Toutai Mituyama,et al.  Bisulfighter: accurate detection of methylated cytosines and differentially methylated regions , 2014, Nucleic acids research.

[129]  A. Nagano,et al.  Heap: a highly sensitive and accurate SNP detection tool for low-coverage high-throughput sequencing data , 2017, DNA research : an international journal for rapid publication of reports on genes and genomes.

[130]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[131]  Qian Wang,et al.  GFOLD: a generalized fold change for ranking differentially expressed genes from RNA-seq data , 2012, Bioinform..

[132]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[133]  Gregory M. Cooper,et al.  CADD: predicting the deleteriousness of variants throughout the human genome , 2018, Nucleic Acids Res..

[134]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[135]  Alyssa C. Frazee,et al.  Ballgown bridges the gap between transcriptome assembly and expression analysis , 2015, Nature Biotechnology.

[136]  C. Alkan,et al.  MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions , 2009, Nature Methods.

[137]  D. Vaux,et al.  Replicates and repeats—what is the difference and is it significant? , 2012, EMBO reports.

[138]  Andrew E. Jaffe,et al.  Bioinformatics Applications Note Gene Expression the Sva Package for Removing Batch Effects and Other Unwanted Variation in High-throughput Experiments , 2022 .

[139]  Michael Eisenstein,et al.  Oxford Nanopore announcement sets sequencing sector abuzz , 2012, Nature Biotechnology.

[140]  Alberto Magi,et al.  Read count approach for DNA copy number variants detection , 2012, Bioinform..

[141]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[142]  A. Wald Sequential Tests of Statistical Hypotheses , 1945 .

[143]  R. Feise Do multiple outcome measures require p-value adjustment? , 2002, BMC medical research methodology.

[144]  Joaquín Dopazo,et al.  From genes to functional classes in the study of biological systems , 2007, BMC Bioinformatics.

[145]  Roger E Bumgarner Overview of DNA microarrays: types, applications, and their future. , 2013, Current protocols in molecular biology.

[146]  M. Rieder,et al.  Detection of structural variants and indels within exome data , 2011, Nature Methods.

[147]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[148]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[149]  Zhe Feng,et al.  A general introduction to adjustment for multiple comparisons. , 2017, Journal of thoracic disease.

[150]  Wei Li,et al.  RSeQC: quality control of RNA-seq experiments , 2012, Bioinform..

[151]  Kiyoshi Asai,et al.  A mostly traditional approach improves alignment of bisulfite-converted DNA , 2012, Nucleic acids research.

[152]  G. Guyatt,et al.  Basic statistics for clinicians: 4. Correlation and regression. , 1995, CMAJ : Canadian Medical Association journal = journal de l'Association medicale canadienne.

[153]  Fulvio Mazzocchi,et al.  Could Big Data be the end of theory in science? , 2015, EMBO reports.