Novel and simple transformation algorithm for combining microarray data sets

BackgroundWith microarray technology, variability in experimental environments such as RNA sources, microarray production, or the use of different platforms, can cause bias. Such systematic differences present a substantial obstacle to the analysis of microarray data, resulting in inconsistent and unreliable information. Therefore, one of the most pressing challenges in the field of microarray technology is how to integrate results from different microarray experiments or combine data sets prior to the specific analysis.ResultsTwo microarray data sets based on a 17k cDNA microarray system were used, consisting of 82 normal colon mucosa and 72 colorectal cancer tissues. Each data set was prepared from either total RNA or amplified mRNA, and the difference of RNA source between these two data sets was detected by ANOVA (Analysis of variance) model. A simple integration method was introduced which was based on the distributions of gene expression ratios among different microarray data sets. The method transformed gene expression ratios into the form of a reference data set on a gene by gene basis. Hierarchical clustering analysis, density and box plots, and mixture scores with correlation coefficients revealed that the two data sets were well intermingled, indicating that the proposed method minimized the experimental bias. In addition, any RNA source effect was not detected by the proposed transformation method. In the mixed data set, two previously identified subgroups of normal and tumor were well separated, and the efficiency of integration was more prominent in tumor groups than normal groups. The transformation method was slightly more effective when a data set with strong homogeneity in the same experimental group was used as a reference data set.ConclusionProposed method is simple but useful to combine several data sets from different experimental conditions. With this method, biologically useful information can be detectable by applying various analytic methods to the combined data set with increased sample size.

[1]  R. Ward,et al.  The role of MYH and microsatellite instability in the development of sporadic colorectal cancer , 2006, British Journal of Cancer.

[2]  Sangsoo Kim,et al.  Integrative analysis of multiple gene expression profiles applied to liver cancer study , 2004, FEBS letters.

[3]  T. Barrette,et al.  Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. , 2002, Cancer research.

[4]  D. Botstein,et al.  For Personal Use. Only Reproduce with Permission from the Lancet Publishing Group , 2022 .

[5]  Joel S. Parker,et al.  Adjustment of systematic microarray data biases , 2004, Bioinform..

[6]  E. Lander,et al.  A molecular signature of metastasis in primary solid tumors , 2003, Nature Genetics.

[7]  T. Hudson,et al.  Control genes and variability: absence of ubiquitous reference transcripts in diverse mammalian expression studies. , 2002, Genome research.

[8]  Hyun Cheol Chung,et al.  Determination of genes related to gastrointestinal tract origin cancer cells using a cDNA microarray. , 2005, Clinical cancer research : an official journal of the American Association for Cancer Research.

[9]  Taesung Park,et al.  Combining multiple microarrays in the presence of controlling variables , 2006, Bioinform..

[10]  Hugues Bersini,et al.  Integration and cross‐validation of high‐throughput gene expression data: comparing heterogeneous data sets , 2003, FEBS letters.

[11]  R. Tibshirani,et al.  Repeated observation of breast tumor subtypes in independent gene expression data sets , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Jun Chen,et al.  Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes , 2004, BMC Bioinformatics.

[13]  Lucila Ohno-Machado,et al.  Analysis of matched mRNA measurements from two different microarray technologies , 2002, Bioinform..

[14]  Sangsoo Kim,et al.  Combining multiple microarray studies and modeling interstudy variation , 2003, ISMB.

[15]  S K Libutti,et al.  Advantages of mRNA amplification for microarray analysis. , 2002, BioTechniques.

[16]  Kim Ki-Yeol,et al.  Significant Gene Selection Using Integrated Microarray Data Set with Batch Effect , 2006 .

[17]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[18]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Rainer Breitling,et al.  Loss of Compartmentalization Causes Misregulation of Lysine Biosynthesis in Peroxisome-Deficient Yeast Cells , 2002, Eukaryotic Cell.

[20]  Wolfgang Huber,et al.  Systematic analysis of T7 RNA polymerase based in vitro linear RNA amplification for use in microarray experiments , 2004, BMC Genomics.

[21]  Carsten Wiuf,et al.  Frequent occurrence of uniparental disomy in colorectal cancer. , 2007, Carcinogenesis.

[22]  Ian Tomlinson,et al.  Evidence for a colorectal cancer susceptibility locus on chromosome 3q21-q24 from a high-density SNP genome-wide linkage scan. , 2006, Human molecular genetics.