Strategies for aggregating gene expression data: The collapseRows R function

BackgroundGenomic and other high dimensional analyses often require one to summarize multiple related variables by a single representative. This task is also variously referred to as collapsing, combining, reducing, or aggregating variables. Examples include summarizing several probe measurements corresponding to a single gene, representing the expression profiles of a co-expression module by a single expression profile, and aggregating cell-type marker information to de-convolute expression data. Several standard statistical summary techniques can be used, but network methods also provide useful alternative methods to find representatives. Currently few collapsing functions are developed and widely applied.ResultsWe introduce the R function collapseRows that implements several collapsing methods and evaluate its performance in three applications. First, we study a crucial step of the meta-analysis of microarray data: the merging of independent gene expression data sets, which may have been measured on different platforms. Toward this end, we collapse multiple microarray probes for a single gene and then merge the data by gene identifier. We find that choosing the probe with the highest average expression leads to best between-study consistency. Second, we study methods for summarizing the gene expression profiles of a co-expression module. Several gene co-expression network analysis applications show that the optimal collapsing strategy depends on the analysis goal. Third, we study aggregating the information of cell type marker genes when the aim is to predict the abundance of cell types in a tissue sample based on gene expression data ("expression deconvolution"). We apply different collapsing methods to predict cell type abundances in peripheral human blood and in mixtures of blood cell lines. Interestingly, the most accurate prediction method involves choosing the most highly connected "hub" marker gene. Finally, to facilitate biological interpretation of collapsed gene lists, we introduce the function userListEnrichment, which assesses the enrichment of gene lists for known brain and blood cell type markers, and for other published biological pathways.ConclusionsThe R function collapseRows implements several standard and network-based collapsing methods. In various genomic applications we provide evidence that both types of methods are robust and biologically relevant tools.

[1]  Peter A C 't Hoen,et al.  Coexpression network analysis identifies transcriptional modules related to proastrocytic differentiation and sprouty signaling in glioma. , 2010, Cancer research.

[2]  S. Horvath,et al.  Functional organization of the transcriptome in human brain , 2008, Nature Neuroscience.

[3]  Bin Zhang,et al.  Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R , 2008, Bioinform..

[4]  A. Butte,et al.  AILUN: reannotating gene expression data automatically , 2007, Nature Methods.

[5]  Steve Horvath,et al.  WGCNA: an R package for weighted correlation network analysis , 2008, BMC Bioinformatics.

[6]  G. Pertea,et al.  RESOURCERER: a database for annotating and linking microarray resources within and across species , 2001, Genome Biology.

[7]  Jun Dong,et al.  Geometric Interpretation of Gene Coexpression Network Analysis , 2008, PLoS Comput. Biol..

[8]  Aleksey A. Nakorchevskiy,et al.  Expression deconvolution: A reinterpretation of DNA microarray data reveals dynamic changes in cell populations , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Peter Langfelder,et al.  Weighted gene co-expression network analysis of the peripheral blood from Amyotrophic Lateral Sclerosis patients , 2009, BMC Genomics.

[10]  P. Flicek,et al.  Consistent annotation of gene expression arrays , 2010, BMC Genomics.

[11]  Damien Chaussabel,et al.  Genomic transcriptional profiling identifies a candidate blood biomarker signature for the diagnosis of septicemic melioidosis , 2009, Genome Biology.

[12]  Alistair Rogers,et al.  Connecting genes, coexpression modules, and molecular signatures to environmental stress phenotypes in plants , 2008, BMC Systems Biology.

[13]  Z. Modrušan,et al.  Deconvolution of Blood Microarray Data Identifies Cellular Activation Patterns in Systemic Lupus Erythematosus , 2009, PloS one.

[14]  Jennifer Clarke,et al.  Statistical expression deconvolution from mixed tissue samples , 2010, Bioinform..

[15]  L. Almasy,et al.  Discovery of expression QTLs using large-scale transcriptional profiling in human lymphocytes , 2007, Nature Genetics.

[16]  Peter Langfelder,et al.  Eigengene networks for studying the relationships between co-expression modules , 2007, BMC Systems Biology.

[17]  Jill P. Mesirov,et al.  GeneCruiser: a web service for the annotation of microarray data , 2005, Bioinform..

[18]  S. Horvath,et al.  Statistical Applications in Genetics and Molecular Biology , 2011 .

[19]  D. Geschwind,et al.  Genome-wide analyses of human perisylvian cerebral cortical patterning , 2007, Proceedings of the National Academy of Sciences.

[20]  S. Horvath,et al.  Conservation and evolution of gene coexpression networks in human and chimpanzee brains , 2006, Proceedings of the National Academy of Sciences.

[21]  J. Cerhan,et al.  Gene networks and microRNAs implicated in aggressive prostate cancer. , 2009, Cancer research.

[22]  J. Wang-Rodriguez,et al.  In silico dissection of cell-type-associated patterns of gene expression in prostate cancer. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Gregory Nuel,et al.  Deciphering Normal Blood Gene Expression Variation—The NOWAC Postgenome Study , 2010, PLoS genetics.

[24]  Jian Huang,et al.  Incorporating higher-order representative features improves prediction in network-based cancer prognosis analysis , 2011, BMC Medical Genomics.

[25]  G. Church,et al.  Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae , 2001, Nature Genetics.

[26]  Haiyan Hu,et al.  Integrative Array Analyzer: a software package for analysis of cross-platform and cross-species microarray data , 2006, Bioinform..

[27]  T. Nikolcheva,et al.  Deconvoluting Post-Transplant Immunity: Cell Subset-Specific Mapping Reveals Pathways for Activation and Expansion of Memory T, Monocytes and B Cells , 2010, PloS one.

[28]  S. Horvath,et al.  Divergence of human and mouse brain transcriptome highlights Alzheimer disease pathways , 2010, Proceedings of the National Academy of Sciences.

[29]  R. Myers,et al.  Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data , 2005, Nucleic acids research.

[30]  BMC Bioinformatics , 2005 .