The shape of gene expression distributions matter: how incorporating distribution shape improves the interpretation of cancer transcriptomic data

In genomics, we often assume that continuous data, such as gene expression, follow a specific kind of distribution. However we rarely stop to question the validity of this assumption, or consider how broadly applicable it may be to all genes that are in the transcriptome. Our study investigated the prevalence of a range of gene expression distributions in three different tumor types from the Cancer Genome Atlas (TCGA). Surprisingly, the expression of less than 50% of all genes was Normally-distributed, with other distributions including Gamma, Bimodal, Cauchy, and Lognormal also represented. Most of the distribution categories contained genes that were significantly enriched for unique biological processes. Different assumptions based on the shape of the expression profile were used to identify genes that could discriminate between patients with good versus poor survival. The prognostic marker genes that were identified when the shape of the distribution was accounted for reflected functional insights into cancer biology that were not observed when standard assumptions were applied. We showed that when multiple types of distributions were permitted, i.e. the shape of the expression profile was used, the statistical classifiers had greater predictive accuracy for determining the prognosis of a patient versus those that assumed only one type of gene expression distribution. Our results highlight the value of studying a gene’s distribution shape to model heterogeneity of transcriptomic data and the impact on using analyses that permit more than one type of gene expression distribution. These insights would have been overlooked when using standard approaches that assume all genes follow the same type of distribution in a patient cohort.

[1]  Jessica C. Mar,et al.  Investigating skewness to understand gene expression heterogeneity in large patient cohorts , 2019, BMC Bioinformatics.

[2]  Mahdi Sarmady,et al.  A comparison of survival analysis methods for cancer gene expression RNA-Sequencing data. , 2019, Cancer genetics.

[3]  Jessica C. Mar,et al.  A novel approach to modelling transcriptional heterogeneity identifies the oncogene candidate CBX2 in invasive breast carcinoma , 2019, British Journal of Cancer.

[4]  Hannes P. Eggertsson,et al.  Characterizing mutagenic effects of recombination through a sequence-level genetic map , 2019, Science.

[5]  Jessica C Mar,et al.  The rise of the distributions: why non-normality is important for understanding the transcriptome and beyond , 2019, Biophysical Reviews.

[6]  R. Gregory,et al.  MicroRNA biogenesis pathways in cancer , 2015, Nature Reviews Cancer.

[7]  Amit Verma,et al.  HSC commitment-associated epigenetic signature is prognostic in acute myeloid leukemia. , 2014, The Journal of clinical investigation.

[8]  Songnian Hu,et al.  Dynamic transcriptomes of human myeloid leukemia cells. , 2013, Genomics.

[9]  Benjamin J. Raphael,et al.  Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. , 2013, The New England journal of medicine.

[10]  Ulrich Mansmann,et al.  Identification of a 24-gene prognostic signature that improves the European LeukemiaNet risk classification of acute myeloid leukemia: an international collaborative study. , 2013, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[11]  Hany Ariffin,et al.  Mature B‐cell acute lymphoblastic leukaemia associated with a rare MLL‐FOXO4 fusion gene , 2012, British journal of haematology.

[12]  M. Datto,et al.  Genes with bimodal expression are robust diagnostic targets that define distinct subtypes of epithelial ovarian cancer with different overall survival. , 2012, The Journal of molecular diagnostics : JMD.

[13]  Ben Davidson,et al.  Epithelial–Mesenchymal Transition in Ovarian Carcinoma , 2012, Front. Oncol..

[14]  Angelo J. Canty,et al.  Stem cell gene expression programs influence clinical outcome in human leukemia , 2011, Nature Medicine.

[15]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[16]  Benjamin J. Raphael,et al.  Integrated Genomic Analyses of Ovarian Carcinoma , 2011, Nature.

[17]  Guido Marcucci,et al.  The prognostic and functional role of microRNAs in acute myeloid leukemia. , 2011, Blood.

[18]  Ash A. Alizadeh,et al.  Association of a leukemic stem cell gene expression signature with clinical outcomes in acute myeloid leukemia. , 2010, JAMA.

[19]  Kevin R. Coombes,et al.  The Bimodality Index: A Criterion for Discovering and Ranking Bimodal Signatures from Cancer Gene Expression Profiling Data , 2009, Cancer informatics.

[20]  Joshua M. Korn,et al.  Comprehensive genomic characterization defines human glioblastoma genes and core pathways , 2008, Nature.

[21]  Hemant Ishwaran,et al.  Random Survival Forests , 2008, Wiley StatsRef: Statistics Reference Online.

[22]  Adam Ertel,et al.  Switch-like genes populate cell communication pathways and are enriched for extracellular proteins , 2008, BMC Genomics.

[23]  R. Verhaak,et al.  Prognostically useful gene-expression profiles in acute myeloid leukemia. , 2004, The New England journal of medicine.

[24]  Chi Wai So,et al.  Common mechanism for oncogenic activation of MLL by forkhead family proteins. , 2003, Blood.

[25]  H. Lilliefors On the Kolmogorov-Smirnov Test for Normality with Mean and Variance Unknown , 1967 .

[26]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[27]  N. Smirnov Table for Estimating the Goodness of Fit of Empirical Distributions , 1948 .

[28]  Kevin R Coombes,et al.  Melanoma antigen family A identified by the bimodality index defines a subset of triple negative breast cancers as candidates for immune response augmentation. , 2012, European journal of cancer.

[29]  S. Shapiro,et al.  An analysis of variance test for normality ( complete samp 1 es ) t , 2007 .

[30]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..