Literature based Bayesian analysis of gene expression data

Recent research has focused on incorporating biological function and pathway information into the analysis of gene expression data, partly as a means of compensating for insufficient experimental replications, low signal to noise, lack of reproducibility and/or multiple testing confounds. A Bayesian approach seems to be ideal for incorporating functional information into gene expression data analysis. In this study, we tested the feasibility of using literature derived gene relationships in a Bayesian model to analyze gene expression data. Prior distributions were constructed based on gene associations derived from the biomedical literature using Latent Semantic Indexing (LSI). The LSI model was built using more than 1 million Medline abstracts corresponding to 22,000 human and mouse genes. A key advantage of LSI is that both explicit and implicit gene relationships can be derived from the literature. Gene neighborhoods were determined using latent Gaussian Markov random fields and logistic transformation of the latent variables. We tested the procedure on a microarray dataset for interferon-stimulated genes in mouse embryonic fibroblasts. By integrating functional information from literature, Bayesian approach identified relevant genes that previously did not meet the 0.05 significance level. In comparison to a standard mixture model, spatial mixture model has more power for identifying direct and indirect interferon regulated genes. The spatial model enhanced the ranks of some genes which are known to be affected by interferon treatment, such as Nmi (NMI N-myc and STAT interactor) and ifi35 (interferon-induced protein 35). It also identified some genes that previously were ignored because of the marginal p-values, such as dpysl2, map2k1, msn, Psck5, and Il6st. Interestingly, these genes appear to be indirectly related to interferon treatment. In summary, we show that our procedure increases statistical power and produces more biologically meaningful gene lists. These results suggest that Bayesian methods which incorporate functional information from the literature may improve analysis of gene expression data.