Normalization Methods for the Analysis of Unbalanced Transcriptome Data: A Review

Dozens of normalization methods for correcting experimental variation and bias in high-throughput expression data have been developed during the last two decades. Up to 23 methods among them consider the skewness of expression data between sample states, which are even more than the conventional methods, such as loess and quantile. From the perspective of reference selection, we classified the normalization methods for skewed expression data into three categories, data-driven reference, foreign reference, and entire gene set. We separately introduced and summarized these normalization methods designed for gene expression data with global shift between compared conditions, including both microarray and RNA-seq, based on the reference selection strategies. To our best knowledge, this is the most comprehensive review of available preprocessing algorithms for the unbalanced transcriptome data. The anatomy and summarization of these methods shed light on the understanding and appropriate application of preprocessing methods.

[1]  Carl R. Pelz,et al.  Global rank-invariant set normalization (GRSN) to reduce systematic distortions in microarray data , 2008, BMC Bioinformatics.

[2]  Anders Berglund,et al.  Iterative rank-order normalization of gene expression microarray data , 2013, BMC Bioinformatics.

[3]  Jianzhong Su,et al.  Integrative analysis from multi‐centre studies identifies a function‐derived personalized multi‐gene signature of outcome in colorectal cancer , 2019, Journal of cellular and molecular medicine.

[4]  Haixiu Yang,et al.  A potential signature of eight long non-coding RNAs predicts survival in patients with non-small cell lung cancer , 2015, Journal of Translational Medicine.

[5]  Dong Wang,et al.  Exploiting locational and topological overlap model to identify modules in protein interaction networks , 2019, BMC Bioinformatics.

[6]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[7]  Yudi Pawitan,et al.  Filtering genes to improve sensitivity in oligonucleotide microarray data analysis. , 2007, Nucleic acids research.

[8]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[9]  Conrad C. Huang,et al.  MINRMS: an efficient algorithm for determining protein structure similarity using root-mean-squared-distance , 2003, Bioinform..

[10]  Monica Chiogna,et al.  A modified LOESS normalization applied to microRNA arrays: a comparative evaluation , 2009, Bioinform..

[11]  Giorgio Parisi,et al.  Cross-correlations of American baby names , 2014, Proceedings of the National Academy of Sciences.

[12]  João Ricardo Sato,et al.  Evaluating different methods of microarray data normalization , 2006, BMC Bioinformatics.

[13]  Sanjit K. Mitra,et al.  Optimized LOWESS normalization parameter selection for DNA microarray data , 2004, BMC Bioinformatics.

[14]  M. Chrobak,et al.  Improved Probe Selection for DNA Arrays Using Nonparametric Kernel Density Estimation , 2005, 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference.

[15]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[16]  Yudi Pawitan,et al.  Normalization of oligonucleotide arrays based on the least-variant set of genes , 2008, BMC Bioinformatics.

[17]  Kwong-Sak Leung,et al.  SMILE: A Novel Procedure for Subcellular Module Identification with Localization Expansion , 2017, BCB.

[18]  Philge Philip,et al.  Normalization of High Dimensional Genomics Data Where the Distribution of the Altered Variables Is Skewed , 2011, PloS one.

[19]  Shyr Yu,et al.  Use of normalization methods for analysis of microarrays containing a high degree of gene effects , 2008, BMC Bioinformatics.

[20]  G. Church,et al.  Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset , 2005, Genome Biology.

[21]  Zhijin Wu,et al.  Subset Quantile Normalization Using Negative Control Features , 2010, J. Comput. Biol..

[22]  Kwong-Sak Leung,et al.  SMILE: a novel procedure for subcellular module identification with localisation expansion , 2018, IET systems biology.

[23]  P. M. Nissom,et al.  A novel normalization method for effective removal of systematic variation in microarray data , 2006, Nucleic acids research.

[24]  J. Vandesompele,et al.  Differential gene expression analysis tools exhibit substandard performance for long non-coding RNA-sequencing data , 2017, Genome Biology.

[25]  Kwong-Sak Leung,et al.  Quantification of non-coding RNA target localization diversity and its application in cancers , 2018, Journal of molecular cell biology.

[26]  John Quackenbush Microarray data normalization and transformation , 2002, Nature Genetics.

[27]  Dong Wang,et al.  Full Characterization of Localization Diversity in the Human Protein Interactome. , 2017, Journal of proteome research.

[28]  Sinnakaruppan Mathavan,et al.  Normalization of RNA-Sequencing Data from Samples with Varying mRNA Levels , 2014, PloS one.

[29]  Xuhua Xia,et al.  Using Generalized Procrustes Analysis (GPA) for normalization of cDNA microarray data , 2008, BMC Bioinformatics.

[30]  Gordon K Smyth,et al.  The use of miRNA microarrays for the analysis of cancer samples with global miRNA decrease , 2013, RNA.

[31]  Kwong-Sak Leung,et al.  Identification and characterization of moonlighting long non‐coding RNAs based on RNA and protein interactome , 2018, Bioinform..

[32]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[33]  Cheng Li,et al.  Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application , 2001, Genome Biology.

[34]  Sarah L. Veatch,et al.  Steady-state cross-correlations for live two-colour super-resolution localization data sets , 2015, Nature Communications.

[35]  Hinrich W. H. Göhlmann,et al.  I/NI-calls for the exclusion of non-informative genes: a highly effective filtering tool for microarray data , 2007, Bioinform..

[36]  Heidi Ledford,et al.  The death of microarrays? , 2008, Nature.

[37]  Yunyan Gu,et al.  Extensive up-regulation of gene expression in cancer: the normalised use of microarray data. , 2012, Molecular bioSystems.

[38]  Dong Wang,et al.  CrossNorm: a novel normalization strategy for microarray data in cancers , 2016, Scientific Reports.

[39]  P. Khaitovich,et al.  BMC Genomics BioMed Central Methodology article Estimating accuracy of RNA-Seq and microarrays with proteomics , 2022 .

[40]  Sylvain Pradervand,et al.  Impact of normalization on miRNA microarray expression profiling. , 2009, RNA.

[41]  Lei Yang,et al.  Identification and validation of potential prognostic lncRNA biomarkers for predicting survival in patients with multiple myeloma , 2015, Journal of Experimental & Clinical Cancer Research.

[42]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[43]  Joan E Bailey-Wilson,et al.  Normalization of microarray expression data using within-pedigree pool and its effect on linkage analysis , 2007, BMC proceedings.

[44]  Siqi Bao,et al.  Discovery and validation of immune-associated long non-coding RNA biomarkers associated with clinically molecular subtype and prognosis in diffuse large B cell lymphoma , 2017, Molecular Cancer.

[45]  M. Leok The use of , 1996 .

[46]  David A. Orlando,et al.  Revisiting Global Gene Expression Analysis , 2012, Cell.

[47]  Kwong-Sak Leung,et al.  ICN: a normalization method for gene expression data considering the over-expression of informative genes. , 2016, Molecular bioSystems.

[48]  Tomasz Burzykowski,et al.  A Nonhomogeneous Hidden Markov Model for Gene Mapping Based on Next-Generation Sequencing Data , 2015, J. Comput. Biol..

[49]  Yu-Min Lin,et al.  Kernel density weighted loess normalization improves the performance of detection within asymmetrical data , 2011, BMC Bioinformatics.

[50]  Matthew Wongchenko,et al.  bcGST - an interactive bias-correction method to identify over-represented gene-sets in boutique arrays , 2017, bioRxiv.

[51]  Marzena Wojtaszewska,et al.  Gene expression profiling of acute myeloid leukemia samples from adult patients with AML-M1 and -M2 through boutique microarrays, real-time PCR and droplet digital PCR , 2017, International journal of oncology.

[52]  Y. Pawitan,et al.  Modified least-variant set normalization for miRNA microarray. , 2010, RNA.

[53]  Alicia Oshlack,et al.  Normalization of boutique two-color microarrays with a high proportion of differentially expressed probes , 2007, Genome Biology.

[54]  Martin Posch,et al.  Cross-platform comparison of microarray data using order restricted inference , 2011, Bioinform..

[55]  Jianzhong Su,et al.  Analysis of long noncoding RNAs highlights region-specific altered expression patterns and diagnostic roles in Alzheimer's disease , 2019, Briefings Bioinform..

[56]  Jianzhong Su,et al.  Recurrence-Associated Long Non-coding RNA Signature for Determining the Risk of Recurrence in Patients with Colon Cancer , 2018, Molecular therapy. Nucleic acids.