Expert knowledge without the expert: integrated analysis of gene expression and literature to derive active functional contexts

MOTIVATION The interpretation of expression data without appropriate expert knowledge is difficult and usually limited to exploratory data analysis, such as clustering and detecting differentially regulated genes. However, comparing experimental results against manually compiled knowledge resources might limit or bias the perspective on the data. Thus, manual analysis by experts is required to obtain confident predictions about involved processes. RESULTS We present an algorithm to simultaneously derive interpretations of expression measurements together with biological hypotheses from biomedical publications. It identifies active functional contexts ('concepts'), i.e. gene clusters that exhibit both a significant gene expression as well as a coherent literature profile. Manual intervention by an expert in specifying prior knowledge is not required. The approach scales to realistic applications and does not rely on controlled vocabularies or pathway resources. We validated our algorithm by analyzing a current juvenile arthritis dataset. A number of gene clusters and accompanying literature topics are identified as an interpretation of the data that coincide well with the phenotype and biological processes known to be involved in the disease. We demonstrate that generated clusters are both more sensitive and more specific than Gene Ontology categories detected on the same data. The method allows for in-depth investigation of subsets of genes, the associated literature topics and publications. AVAILABILITY Supplementary data on clusters is available upon request.

[1]  Michael Gribskov,et al.  Use of keyword hierarchies to interpret gene expression patterns , 2001, Bioinform..

[2]  A. Valencia,et al.  Mining functional information associated with expression arrays , 2001, Functional & Integrative Genomics.

[3]  D. Chaussabel,et al.  Mining microarray expression data by literature profiling , 2002, Genome Biology.

[4]  Bart De Moor,et al.  Meta-clustering of gene expression data and literature-based information , 2003, SKDD.

[5]  Edgar Wingender,et al.  TRANSFAC, TRANSPATH and CYTOMER as starting points for an ontology of regulatory networks. , 2004, In silico biology.

[6]  Daniel Hanisch,et al.  New methods for joint analysis of biological networks and expression data , 2004, German Conference on Bioinformatics.

[7]  Edgar Wingender,et al.  TRANSPATH: An integrated database on signal transduction and a tool for array analysis , 2003, Nucleic Acids Res..

[8]  Francis D. Gibbons,et al.  Judging the quality of gene expression-based clustering methods using gene annotation. , 2002, Genome research.

[9]  W. John Wilbur,et al.  Automatic MeSH term assignment and quality assessment , 2001, AMIA.

[10]  W. Kuis,et al.  Heat shock proteins in juvenile idiopathic arthritis: Keys for understanding remitting arthritis and candidate antigens for immune therapy , 2002, Current rheumatology reports.

[11]  Betsy L. Humphreys,et al.  Relationships in Medical Subject Headings (MeSH) , 2001 .

[12]  Ash A. Alizadeh,et al.  'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns , 2000, Genome Biology.

[13]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[14]  Robert Stevens Ontology Based Document Enrichment in Bioinformatics , 2002, Comparative and functional genomics.

[15]  Guoying Liu,et al.  NetAffx Gene Ontology Mining Tool: a visual approach for microarray data analysis. , 2004, Bioinformatics.

[16]  Xiaohua Hu,et al.  Integration of cluster ensemble and text summarization for gene expression analysis , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[17]  Shamkant B. Navathe,et al.  Text Mining Functional Keywords Associated with Genes , 2004, MedInfo.

[18]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[19]  Daniel Hanisch,et al.  Co-clustering of biological networks and gene expression data , 2002, ISMB.

[20]  S D Thompson,et al.  Gene expression in juvenile arthritis and spondyloarthropathy: pro-angiogenic ELR+ chemokine genes relate to course of arthritis. , 2004, Rheumatology.

[21]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[22]  Thomas Lengauer,et al.  Analysis of Gene Expression Data with Pathway Scores , 2000, ISMB.

[23]  Boris Adryan,et al.  Gene-Ontology-based clustering of gene expression data , 2004, Bioinform..

[24]  Steven C. Lawlor,et al.  MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data , 2003, Genome Biology.

[25]  D. Edwards,et al.  Statistical Analysis of Gene Expression Microarray Data , 2003 .

[26]  R. Zimmer,et al.  ProMiner: Organism-specific protein name detection using approximate string matching , 2004 .

[27]  Hagit Shatkay,et al.  Genes, Themes, and Microarrays: Using Information Retrieval for Large-Scale Gene Analysis , 2000, ISMB.

[28]  R. Zimmer,et al.  Exact versus approximate string matching for protein name identication , 2004 .

[29]  Yaning Yang,et al.  Microarray expression profiling: analysis and applications. , 2003, Current opinion in drug discovery & development.

[30]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[31]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .