TEXTQUEST: Document Clustering of MEDLINE Abstracts For Concept Discovery In Molecular Biology

We present an algorithm for large-scale document clustering of biological text, obtained from Medline abstracts. The algorithm is based on statistical treatment of terms, stemming, the idea of a 'go-list', unsupervised machine learning and graph layout optimization. The method is flexible and robust, controlled by a small number of parameter values. Experiments show that the resulting document clusters are meaningful as assessed by cluster-specific terms. Despite the statistical nature of the approach, with minimal semantic analysis, the terms provide a shallow description of the document corpus and support concept discovery.

[1]  Gerard Salton,et al.  Automatic Content Analysis in Information Retrieval , 1968 .

[2]  Peter Eades,et al.  A Heuristic for Graph Drawing , 1984 .

[3]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[4]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[5]  Pat Langley,et al.  Models of Incremental Concept Formation , 1990, Artif. Intell..

[6]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[7]  Edward M. Reingold,et al.  Graph drawing by force‐directed placement , 1991, Softw. Pract. Exp..

[8]  R. Steward,et al.  Dorsal-ventral polarity in the Drosophila embryo. , 1993, Current opinion in genetics & development.

[9]  K. Anderson,et al.  A conserved signaling pathway: the Drosophila toll-dorsal pathway. , 1996, Annual review of cell and developmental biology.

[10]  Y Yang,et al.  An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts , 1996, Comput. Biol. Medicine.

[11]  C. Sander,et al.  Computational comparisons of model genomes. , 1996, Trends in biotechnology.

[12]  R. Ray,et al.  Intercellular signaling and the polarization of body axes during Drosophila oogenesis. , 1996, Genes & development.

[13]  R. Lehmann,et al.  Germ plasm assembly and germ cell migration in Drosophila. , 1997, Cold Spring Harbor symposia on quantitative biology.

[14]  W. McGinnis,et al.  Regulation of segmentation and segmental identity by Drosophila homeoproteins: the role of DNA binding in functional activity and specificity. , 1997, Development.

[15]  Proux,et al.  Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction. , 1998, Genome informatics. Workshop on Genome Informatics.

[16]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[17]  Park,et al.  Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. , 1998, Genome informatics. Workshop on Genome Informatics.

[18]  Chris Sander,et al.  EUCLID: automatic classification of proteins in functional classes by their database annotations , 1998, Bioinform..

[19]  L. Pick Segmentation: painting stripes from flies to vertebrates. , 1998, Developmental genetics.

[20]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[21]  Yiming Yang,et al.  Learning approaches for detecting and tracking news events , 1999, IEEE Intell. Syst..

[22]  F. V. van Eeden,et al.  The polarisation of the anterior-posterior and dorsal-ventral axes during Drosophila oogenesis. , 1999, Current opinion in genetics & development.

[23]  M. Mannervik Target genes of homeodomain proteins. , 1999, BioEssays : news and reviews in molecular, cellular and developmental biology.

[24]  Miguel A. Andrade-Navarro,et al.  Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.

[25]  C. Ouzounis,et al.  Automatic extraction of protein interactions from scientific abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[26]  P Bork,et al.  Automated extraction of information in molecular biology , 2000, FEBS letters.