An approach for normalization and quality control for NanoString RNA expression data.

The NanoString RNA counting assay for formalin-fixed paraffin embedded samples is unique in its sensitivity, technical reproducibility and robustness for analysis of clinical and archival samples. While commercial normalization methods are provided by NanoString, they are not optimal for all settings, particularly when samples exhibit strong technical or biological variation or where housekeeping genes have variable performance across the cohort. Here, we develop and evaluate a more comprehensive normalization procedure for NanoString data with steps for quality control, selection of housekeeping targets, normalization and iterative data visualization and biological validation. The approach was evaluated using a large cohort ($N=\kern0.5em 1649$) from the Carolina Breast Cancer Study, two cohorts of moderate sample size ($N=359$ and$130$) and a small published dataset ($N=12$). The iterative process developed here eliminates technical variation (e.g. from different study phases or sites) more reliably than the three other methods, including NanoString's commercial package, without diminishing biological variation, especially in long-term longitudinal multiphase or multisite cohorts. We also find that probe sets validated for nCounter, such as the PAM50 gene signature, are impervious to batch issues. This work emphasizes that systematic quality control, normalization and visualization of NanoString nCounter data are an imperative component of study design that influences results in downstream analyses.

[1]  A Graham Pockley,et al.  Tumor- and cytokine-primed human natural killer cells exhibit distinct phenotypic and transcriptional signatures , 2019, PloS one.

[2]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[3]  S. Dudoit,et al.  Normalization of RNA-seq data using factor analysis of control genes or samples , 2014, Nature Biotechnology.

[4]  P. M. Galetti,et al.  Genetic Pattern and Demographic History of Salminus brasiliensis: Population Expansion in the Pantanal Region during the Pleistocene , 2018, Front. Genet..

[5]  N. Mantel The detection of disease clustering and a generalized regression approach. , 1967, Cancer research.

[6]  Tianye Jia,et al.  A Robust Statistical Method for Association-Based eQTL Analysis , 2011, PloS one.

[7]  A. Stromberg,et al.  NanoStringDiff: a novel statistical method for differential expression analysis based on NanoString nCounter data , 2016, Nucleic acids research.

[8]  Andrey A. Shabalin,et al.  Matrix eQTL: ultra fast eQTL analysis via large matrix operations , 2011, Bioinform..

[9]  Patrick Breheny,et al.  p-Value Histograms: Inference and Diagnostics , 2018, High-throughput.

[10]  Philippe Froguel,et al.  NACHO: an R package for quality control of NanoString nCounter data , 2020, Bioinform..

[11]  G. Shukla,et al.  Housekeeping Gene Selection Advisory: Glyceraldehyde-3-Phosphate Dehydrogenase (GAPDH) and β-Actin Are Targets of miR-644a , 2012, PloS one.

[12]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[13]  R. Tibshirani,et al.  Repeated observation of breast tumor subtypes in independent gene expression data sets , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[14]  C. Perou,et al.  Breast cancer PAM50 signature: correlation and concordance between RNA-Seq and digital multiplexed gene expression technologies in a triple negative breast cancer series , 2019, BMC Genomics.

[15]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[16]  Zhiyuan Hu,et al.  Racial Differences in PAM50 Subtypes in the Carolina Breast Cancer Study , 2018, Journal of the National Cancer Institute.

[17]  Nicola J. Rinaldi,et al.  Genetic effects on gene expression across human tissues , 2017, Nature.

[18]  Melissa A. Troester,et al.  Race-associated biological differences among Luminal A breast tumors , 2015, Breast Cancer Research and Treatment.

[19]  Gábor J. Székely,et al.  The Energy of Data , 2017 .

[20]  Guolong Zhang,et al.  Butyrate Enhances Disease Resistance of Chickens by Inducing Antimicrobial Host Defense Peptide Gene Expression , 2011, PloS one.

[21]  E. Mardis,et al.  Development and verification of the PAM50-based Prosigna breast cancer gene signature assay , 2015, BMC Medical Genomics.

[22]  A. Nobel,et al.  Supervised risk predictor of breast cancer based on intrinsic subtypes. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[23]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[24]  W. Marston Linehan,et al.  Von Hippel-Lindau (VHL) Inactivation in Sporadic Clear Cell Renal Cancer: Associations with Germline VHL Polymorphisms and Etiologic Risk Factors , 2011, PLoS genetics.

[25]  Paul C. Boutros,et al.  NanoStringNorm: an extensible R package for the pre-processing of NanoString mRNA and miRNA data , 2012, Bioinform..

[26]  Chaeyoung Lee,et al.  Genome-Wide Expression Quantitative Trait Loci Analysis Using Mixed Models , 2018, Front. Genet..

[27]  Terence P Speed,et al.  A new normalization for Nanostring nCounter gene expression data , 2019, Nucleic acids research.

[28]  Beth Newman,et al.  Comparative Analysis of Breast Cancer Risk Factors among African-American Women and White Women , 2005 .

[29]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[30]  Anthony Rhodes,et al.  American Society of Clinical Oncology/College of American Pathologists guideline recommendations for immunohistochemical testing of estrogen and progesterone receptors in breast cancer. , 2010, Archives of pathology & laboratory medicine.

[31]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[32]  L. S. Callahan,et al.  Gradient Material Strategies for Hydrogel Optimization in Tissue Engineering Applications. , 2018 .

[33]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[34]  Terence P. Speed,et al.  Systematic noise degrades gene co-expression signals but can be corrected , 2015, BMC Bioinformatics.

[35]  Zhonghu Bai,et al.  Cancer Hallmarks, Biomarkers and Breast Cancer Molecular Subtypes , 2016, Journal of Cancer.

[36]  Arjun Bhattacharya,et al.  A framework for transcriptome-wide association studies in breast cancer in diverse study populations , 2020, Genome Biology.

[37]  Richard W Titball,et al.  Extensive genome analysis of Coxiella burnetii reveals limited evolution within genomic groups , 2019, BMC Genomics.

[38]  Peter W. Laird,et al.  Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer , 2018, Cell.

[39]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[40]  P Boffetta,et al.  Tobacco smoking, body mass index, hypertension, and kidney cancer risk in central and eastern Europe , 2008, British Journal of Cancer.

[41]  F. Speleman,et al.  Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes , 2002, Genome Biology.

[42]  Alan Sharpe,et al.  High-Frequency Targetable EGFR Mutations in Sinonasal Squamous Cell Carcinomas Arising from Inverted Sinonasal Papilloma. , 2015, Cancer research.

[43]  Jennifer L. Osborn,et al.  Direct multiplexed measurement of gene expression with color-coded probe pairs , 2008, Nature Biotechnology.

[44]  Johann A. Gagnon-Bartsch,et al.  Using control genes to correct for unwanted variation in microarray data. , 2012, Biostatistics.

[45]  Matthias Kohl,et al.  ReadqPCR and NormqPCR: R packages for the reading, quality checking and normalisation of RT-qPCR quantification cycle (Cq) data , 2012, BMC Genomics.

[46]  R. Barber,et al.  GAPDH as a housekeeping gene: analysis of GAPDH mRNA expression in a panel of 72 human tissues. , 2005, Physiological genomics.

[47]  Chun Jimmie Ye,et al.  Accurate Discovery of Expression Quantitative Trait Loci Under Confounding From Spurious and Genuine Regulatory Hotspots , 2008, Genetics.

[48]  André F. Vieira,et al.  An Update on Breast Cancer Multigene Prognostic Tests—Emergent Clinical Biomarkers , 2018, Front. Med..

[49]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[50]  Wei Lu,et al.  RCRnorm: An integrated system of random-coefficient hierarchical regression models for normalizing NanoString nCounter data. , 2019, The annals of applied statistics.