Kernel density weighted loess normalization improves the performance of detection within asymmetrical data

BackgroundNormalization of gene expression data has been studied for many years and various strategies have been formulated to deal with various types of data. Most normalization algorithms rely on the assumption that the number of up-regulated genes and the number of down-regulated genes are roughly the same. However, the well-known Golden Spike experiment presents a unique situation in which differentially regulated genes are biased toward one direction, thereby challenging the conclusions of previous bench mark studies.ResultsThis study proposes two novel approaches, KDL and KDQ, based on kernel density estimation to improve upon the basic idea of invariant set selection. The key concept is to provide various importance scores to data points on the MA plot according to their proximity to the cluster of the null genes under the assumption that null genes are more densely distributed than those that are differentially regulated. The comparison is demonstrated in the Golden Spike experiment as well as with simulation data using the ROC curves and compression rates. KDL and KDQ in combination with GCRMA provided the best performance among all approaches.ConclusionsThis study determined that methods based on invariant sets are better able to resolve the problem of asymmetry. Normalization, either before or after expression summary for probesets, improves performance to a similar degree.

[1]  E. Wurmbach,et al.  De-regulation of common housekeeping genes in hepatocellular carcinoma , 2007, BMC genomics.

[2]  Maitreya J. Dunham,et al.  Genome-Wide Detection of Polymorphisms at Nucleotide Resolution with a Single DNA Microarray , 2006, Science.

[3]  David B. Allison,et al.  Evaluating Statistical Methods Using Plasmode Data Sets in the Age of Massive Public Databases: An Illustration Using False Discovery Rates , 2008, PLoS genetics.

[4]  Korbinian Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology , 2005 .

[5]  M. Ozbun,et al.  Variable expression of some "housekeeping" genes during human keratinocyte differentiation. , 2002, Analytical biochemistry.

[6]  Gordon K. Smyth,et al.  Using DNA microarrays to study gene expression in closely related species , 2007, Bioinform..

[7]  A. Mhashilkar,et al.  Housekeeping genes in cancer: normalization of array data. , 2005, BioTechniques.

[8]  Zhijin Wu,et al.  Feature-level exploration of a published Affymetrix GeneChip control dataset , 2006, Genome Biology.

[9]  Ziv Bar-Joseph,et al.  Cross species analysis of microarray expression data , 2009, Bioinform..

[10]  Carl R. Pelz,et al.  Global rank-invariant set normalization (GRSN) to reduce systematic distortions in microarray data , 2008, BMC Bioinformatics.

[11]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[12]  Alicia Oshlack,et al.  Normalization of boutique two-color microarrays with a high proportion of differentially expressed probes , 2007, Genome Biology.

[13]  W. Cleveland LOWESS: A Program for Smoothing Scatterplots by Robust Locally Weighted Regression , 1981 .

[14]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[15]  Jeffrey C Miecznikowski,et al.  Putative null distributions corresponding to tests of differential expression in the Golden Spike dataset are intensity dependent , 2007, BMC Genomics.

[16]  John D. Storey,et al.  Lymphocyte Anergy in Patients with Carcinoma , 1973, British Journal of Cancer.

[17]  Mario Medvedovic,et al.  Intensity-based hierarchical Bayes method improves testing for differentially expressed genes in microarray experiments , 2006, BMC Bioinformatics.

[18]  Klaus Obermayer,et al.  A new summarization method for affymetrix probe level data , 2006, Bioinform..

[19]  Cheng Li,et al.  Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application , 2001, Genome Biology.

[20]  Richard D. Pearson,et al.  A comprehensive re-analysis of the Golden Spike data: Towards a benchmark for differential expression methods , 2008, BMC Bioinformatics.

[21]  G. Church,et al.  Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset , 2005, Genome Biology.

[22]  Shyr Yu,et al.  Use of normalization methods for analysis of microarrays containing a high degree of gene effects , 2008, BMC Bioinformatics.

[23]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[24]  I. Pirson,et al.  Pitfalls in the use of several "housekeeping" genes as standards for quantitation of mRNA: the example of thyroid cells. , 1997, Analytical biochemistry.

[25]  R. Siebert,et al.  Combined single nucleotide polymorphism-based genomic mapping and global gene expression profiling identifies novel chromosomal imbalances, mechanisms and candidate genes important in the pathogenesis of T-cell prolymphocytic leukemia with inv(14)(q11q32) , 2007, Leukemia.

[26]  BMC Bioinformatics , 2005 .