Batch correction of genomic data in chronic fatigue syndrome using CMA-ES

Modern genomic sequencing machines can measure thousands of probes from different specimens. Nevertheless, theoretically comparable datasets can show considerably distinguishable properties, depending on both platform and specimen, a phenomenon known as batch effect. Batch correction is the technique aiming at removing this effect from the data. A possible approach to batch correction is to find a transformation function between different datasets, but optimizing the weights of such a function is not trivial: As there is no explicit gradient to follow, traditional optimization techniques would fail. In this work, we propose to use a state-of-the-art evolutionary algorithm, Covariance Matrix Adaptation Evolution Strategy, to optimize the weights of a transformation function for batch correction. The fitness function is driven by the classification accuracy of an ensemble of algorithms on the transformed data. The case study selected to test the proposed approach is mRNA gene expression data of Chronic Fatigue Syndrome, a disease for which there is currently no established diagnostic test. The transformation function obtained from three datasets, produced from different specimens, remarkably improves the performance of classifiers on the task of diagnosing Chronic Fatigue. The presented results are an important steppingstone towards a reliable diagnostic test for this syndrome.

[1]  David C Hoaglin,et al.  Prevalence and incidence of chronic fatigue syndrome in Wichita, Kansas. , 2003, Archives of internal medicine.

[2]  Hugues Bersini,et al.  Batch effect removal methods for microarray gene expression data integration: a survey , 2013, Briefings Bioinform..

[3]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[4]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[5]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[6]  Chunyu Liu,et al.  Removing Batch Effects in Analysis of Expression Microarray Data: An Evaluation of Six Batch Adjustment Methods , 2011, PloS one.

[7]  A. Šimundić Measures of Diagnostic Accuracy: Basic Definitions , 2009, EJIFCC.

[8]  Jessica M. Lindvall,et al.  Whole blood gene expression in adolescent chronic fatigue syndrome: an exploratory cross-sectional study suggesting altered B cell differentiation and survival , 2017, Journal of Translational Medicine.

[9]  P. Sullivan,et al.  Gene Expression in Peripheral Blood Leukocytes in Monozygotic Twins Discordant for Chronic Fatigue: No Evidence of a Biomarker , 2009, PloS one.

[10]  C. Marmar,et al.  Gene expression profiling of whole blood: A comparative assessment of RNA-stabilizing collection methods , 2019, PloS one.

[11]  Weida Tong,et al.  Gene expression profile exploration of a large dataset on chronic fatigue syndrome. , 2006, Pharmacogenomics.

[12]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[13]  Lucila Ohno-Machado,et al.  Analysis of matched mRNA measurements from two different microarray technologies , 2002, Bioinform..

[14]  Angela P. Presson,et al.  Integrated Weighted Gene Co-expression Network Analysis with an Application to Chronic Fatigue Syndrome , 2008, BMC Systems Biology.

[15]  Nikolaus Hansen,et al.  Completely Derandomized Self-Adaptation in Evolution Strategies , 2001, Evolutionary Computation.

[16]  Trevor Hastie,et al.  Multi-class AdaBoost ∗ , 2009 .

[17]  D. Staines,et al.  The prevalence of chronic fatigue syndrome/ myalgic encephalomyelitis: a meta-analysis , 2013, Clinical epidemiology.

[18]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[19]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.