Displaying Variation in Large Datasets: Plotting a Visual Summary of Effect Sizes

Displaying the component-wise between-group differences high-dimensional datasets is problematic because widely used plots such as Bland–Altman and Volcano plots do not show what they are colloquially believed to show. Thus, it is difficult for the experimentalist to grasp why the between-group difference of one component is “significant” while that of another component is not. Here, we propose a type of “Effect Plot” that displays between-group differences in relation to respective underlying variability for every component of a high-dimensional dataset. We use synthetic data to show that such a plot captures the essence of what determines “significance” for between-group differences in each component, and provide guidance in the interpretation of the plot. Supplementary online materials contain the code and data for this article and include simple R functions to produce an effect plot from suitable datasets.

[1]  Jean M. Macklaim,et al.  Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis , 2014, Microbiome.

[2]  Dianne Cook,et al.  Visual Mining Methods for RNA-Seq Data: Data Structure, Dispersion Estimation and Significance Testing , 2013 .

[3]  D. Mccloskey,et al.  The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives , 2008 .

[4]  Douglas G. Altman,et al.  Measurement in Medicine: The Analysis of Method Comparison Studies , 1983 .

[5]  P. Rousseeuw,et al.  Alternatives to the Median Absolute Deviation , 1993 .

[6]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[7]  X. Cui,et al.  Statistical tests for differential expression in cDNA microarray experiments , 2003, Genome Biology.

[8]  E. Lehmann,et al.  Nonparametrics: Statistical Methods Based on Ranks , 1976 .

[9]  P. Robinson,et al.  Whole-exome sequencing for finding de novo mutations in sporadic mental retardation , 2010, Genome Biology.

[10]  D. Curran‐Everett,et al.  The fickle P value generates irreproducible results , 2015, Nature Methods.

[11]  L. Proctor,et al.  The Human Microbiome Project in 2011 and beyond. , 2011, Cell host & microbe.

[12]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[13]  C. Ponting,et al.  Sequencing depth and coverage: key considerations in genomic analyses , 2014, Nature Reviews Genetics.

[14]  Alyssa C. Frazee,et al.  ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets , 2011, BMC Bioinformatics.

[15]  Howard Wainer,et al.  Extracting Sunbeams From Cucumbers , 2011 .

[16]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[17]  Koji Kadota,et al.  TCC: an R package for comparing tag count data with robust normalization strategies , 2013, BMC Bioinformatics.

[18]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[19]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[20]  P. Lachenbruch Statistical Power Analysis for the Behavioral Sciences (2nd ed.) , 1989 .

[21]  Daniel Bottomly,et al.  Evaluating Gene Expression in C57BL/6J and DBA/2J Mouse Striatum Using RNA-Seq and Microarrays , 2011, PloS one.