Text analysis of MEDLINE for discovering functional relationships among genes: evaluation of keyword extraction weighting schemes

One of the key challenges of microarray studies is to derive biological insights from the gene-expression patterns. Clustering genes by functional keyword association can provide direct information about the functional links among genes. However, the quality of the keyword lists significantly affects the clustering results. We compared two keyword weighting schemes: normalised z-score and term frequency-inverse document frequency (TFIDF). Two gene sets were tested to evaluate the effectiveness of the weighting schemes for keyword extraction for gene clustering. Using established measures of cluster quality, the results produced from TFIDF-weighted keywords outperformed those produced from normalised z-score weighted keywords. The optimised algorithms should be useful for partitioning genes from microarray lists into functionally discrete clusters.

[1]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[2]  B. Kégl,et al.  Principal curves: learning, design, and applications , 2000 .

[3]  Dong Xu,et al.  EXCAVATOR: a computer program for efficiently mining gene expression data. , 2003, Nucleic acids research.

[4]  L Hunter,et al.  MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. , 1999, BioTechniques.

[5]  B. Mishra,et al.  Shrinkage-based similarity metric for cluster analysis of microarray data , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Michael Gribskov,et al.  Use of keyword hierarchies to interpret gene expression patterns , 2001, Bioinform..

[7]  D. Chaussabel,et al.  Mining microarray expression data by literature profiling , 2002, Genome Biology.

[8]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[9]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[10]  Phipps Arabie,et al.  The bond energy algorithm revisited , 1990, IEEE Trans. Syst. Man Cybern..

[11]  Joydeep Ghosh,et al.  Relationship-based clustering and cluster ensembles for high-dimensional data mining , 2002 .

[12]  K. Murali,et al.  MedMeSH Summarizer: Text Mining for Gene Clusters , 2002, SDM.

[13]  A. Valencia,et al.  Mining functional information associated with expression arrays , 2001, Functional & Integrative Genomics.

[14]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[15]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[16]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[17]  Paul J. Schweitzer,et al.  Problem Decomposition and Data Reorganization by a Clustering Technique , 1972, Oper. Res..

[18]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[19]  R. Altman,et al.  Using text analysis to identify functionally coherent gene groups. , 2002, Genome research.

[20]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001 .

[21]  Alfonso Valencia,et al.  A hierarchical unsupervised growing neural network for clustering gene expression patterns , 2001, Bioinform..

[22]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[23]  Miguel A. Andrade-Navarro,et al.  Automated genome sequence analysis and annotation , 1999, Bioinform..

[24]  Shamkant B. Navathe,et al.  Vertical partitioning algorithms for database design , 1984, TODS.

[25]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[26]  P. Törönen,et al.  Analysis of gene expression data using self‐organizing maps , 1999, FEBS letters.

[27]  Jan Komorowski,et al.  Predicting Gene Function from Gene Expressions and Ontologies , 2000, Pacific Symposium on Biocomputing.

[28]  Marina MeWi Comparing Clusterings , 2002 .

[29]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[30]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Ying Liu,et al.  Text Mining Biomedical Literature for Discovering Gene-to-Gene Relationships: A Comparative Study of Algorithms , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  Jeffrey T. Chang,et al.  The computational analysis of scientific literature to define and recognize gene expression clusters. , 2003, Nucleic acids research.

[33]  Kyo Kageura,et al.  METHODS OF AUTOMATIC TERM RECOGNITION : A REVIEW , 1996 .

[34]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[35]  Shamkant B. Navathe,et al.  Text Mining Functional Keywords Associated with Genes , 2004, MedInfo.

[36]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.