Text mining biomedical literature for improving medline retrieval

A major problem faced in biomedical informatics involves how best to present information retrieval results. This dissertation developed an approach that present users with reduced sets of relevant citations together with topic label. A text mining system is designed to group the retrieved citations, rank the citations in each cluster, and generate a set of keywords and MeSH terms to describe the common theme of each cluster. A series of follow-up researches were conducted for better performance of the system. A spectral analysis clustering method was proposed based on the content similarity network techniques for information retrieval systems. The new approach organizes all these search results into categories intelligently. Our experimental results demonstrated that the presented method performs well in real world applications. Automated concept recognition for each cluster is one of the important tasks in our text mining system. The system can perform keyword, key MeSH term and key noun-phrase extraction. Within each cluster, the extraction of keyword and key MeSH term is based on modeling the document-term-matrix as a weighted bipartite graph. A mutual reinforcement principle is used to rank the terms. Our new key noun-phrase extraction method is based on the context-free grammatical rules extracted from the input documents. An existing algorithm called Sequitur is used for constructing the context-free grammar rules that re-represent a sequence as a hierarchical structure. Noun-phrases are extracted from the grammatical rules. The key noun-phrases were identified from top frequency rules without extracting all the grammatical rules. The experimental results showed that our key noun-phrase extraction method is effective in identifying key concepts from documents, and outperforms current widely-used methods. We also explored to rank MEDLINE citations using an existing web ranking algorithm, HITS (Hyperlink-Induced Topic Search) algorithm. We further extended HITS to supervised HITS to rank citations. Our results showed that supervised HITS algorithm significantly outperforms HITS algorithm (p<0.01). Compared with HITS, supervised HITS can improve citation ranking from 15% to more than 59% in almost all the cases we tested. Furthermore, MeSH terms outperforms text words in ranking citations, especially when HITS was applied (p<0.01).