Getting Started in Text Mining

Text mining is the use of automated methods for exploiting the enormous amount of knowledge available in the biomedical literature. There are at least as many motivations for doing text mining work as there are types of bioscientists. Model organism database curators have been heavy participants in the development of the field due to their need to process large numbers of publications in order to populate the many data fields for every gene in their species of interest. Bench scientists have built biomedical text mining applications to aid in the development of tools for interpreting the output of high-throughput assays and to improve searches of sequence databases (see [1] for a review). Bioscientists of every stripe have built applications to deal with the dual issues of the double-exponential growth in the scientific literature over the past few years and of the unique issues in searching PubMed/MEDLINE for genomics-related publications. A surprising phenomenon can be noted in the recent history of biomedical text mining: although several systems have been built and deployed in the past few years—Chilibot, Textpresso, and PreBIND (see Text S1 for these and most other citations), for example—the ones that are seeing high usage rates and are making productive contributions to the working lives of bioscientists have been built not by text mining specialists, but by bioscientists. We speculate on why this might be so below. Three basic types of approaches to text mining have been prevalent in the biomedical domain. Co-occurrence–based methods do no more than look for concepts that occur in the same unit of text—typically a sentence, but sometimes as large as an abstract—and posit a relationship between them. (See [2] for an early co-occurrence–based system.) For example, if such a system saw that BRCA1 and breast cancer occurred in the same sentence, it might assume a relationship between breast cancer and the BRCA1 gene. Some early biomedical text mining systems were co-occurrence–based, but such systems are highly error prone, and are not commonly built today. In fact, many text mining practitioners would not consider them to be text mining systems at all. Co-occurrence of concepts in a text is sometimes used as a simple baseline when evaluating more sophisticated systems; as such, they are nontrivial, since even a co-occurrence–based system must deal with variability in the ways that concepts are expressed in human-produced texts. For example, BRCA1 could be referred to by any of its alternate symbols—IRIS, PSCP, BRCAI, BRCC1, or RNF53 (or by any of their many spelling variants, which include BRCA1, BRCA-1, and BRCA 1)—or by any of the variants of its full name, viz. breast cancer 1, early onset (its official name per Entrez Gene and the Human Gene Nomenclature Committee), as breast cancer susceptibility gene 1, or as the latter's variant breast cancer susceptibility gene-1. Similarly, breast cancer could be referred to as breast cancer, carcinoma of the breast, or mammary neoplasm. These variability issues challenge more sophisticated systems, as well; we discuss ways of coping with them in Text S1. Two more common (and more sophisticated) approaches to text mining exist: rule-based or knowledge-based approaches, and statistical or machine-learning-based approaches. The variety of types of rule-based systems is quite wide. In general, rule-based systems make use of some sort of knowledge. This might take the form of general knowledge about how language is structured, specific knowledge about how biologically relevant facts are stated in the biomedical literature, knowledge about the sets of things that bioscientists talk about and the kinds of relationships that they can have with one another, and the variant forms by which they might be mentioned in the literature, or any subset or combination of these. (See [3] for an early rule-based system, and [4] for a discussion of rule-based approaches to various biomedical text mining tasks.) At one end of the spectrum, a simple rule-based system might use hard-coded patterns—for example, plays a role in or is associated with —to find explicit statements about the classes of things in which the researcher is interested. At the other end of the spectrum, a rule-based system might use sophisticated linguistic and semantic analyses to recognize a wide range of possible ways of making assertions about those classes of things. It is worth noting that useful systems have been built using technologies at both ends of the spectrum, and at many points in between. In contrast, statistical or machine-learning–based systems operate by building classifiers that may operate on any level, from labelling part of speech to choosing syntactic parse trees to classifying full sentences or documents. (See [5] for an early learning-based system, and [4] for a discussion of learning-based approaches to various biomedical text mining tasks.) Rule-based and statistical systems each have their advantages and disadvantages. For example, rule systems are often assumed (not necessarily correctly) to take a significant amount of time to develop. Statistical systems typically require large amounts of expensive-to-get labelled training data. In practice, statistical and rule-based systems can be fruitfully combined. For example, a statistical system that classifies documents as to whether or not they are relevant to the subject of genetic variation in mouse genes might use the output of a rule-based mutation recognizer as one of its feature extractors. Many systems also employ an initial statistical processing step, followed by rule-based post-processing. A primary problem that either type of system must deal with is the issue of ambiguity: the existence of multiple relationships between language and meanings or categories. Ambiguity exists at every level of linguistic structure, from the part of speech of words to subtle issues in pragmatics. A common example of ambiguity in genomics text is related to gene names and symbols. Consider the string fat: is it an adjective, or a noun? Either part of speech is entirely plausible in biomedical texts, and PubMed returns almost 112 K hits for that single-word query (and more than 13 K even if we try to restrict the query to genomics by including the disjunction (gene OR genetic OR genetics). This ambiguity is relatively easy to resolve, but fat also turns out to be the name or symbol of a number of different genes—humans, mice, rats, Drosophila, zebrafish, chickens, M. mulatta, and two Lactobacilli have at least one gene whose name, official symbol, or alias is fat. Even if the species whose gene is being referred to can be determined, the ambiguity may still not be resolved—in humans, fat is the official symbol of Entrez Gene entry 2195 and an alternate symbol for Entrez Gene entry 948. The distinction is not trivial. The former is a cadhedrin, and is associated with tumor suppression and with bipolar disorder, while the latter is a thrombospondin receptor associated with atherosclerosis, platelet glycoprotein deficiency, hyperlipidemia, and insulin resistance, to name just a few phenotypes. These ambiguities are not trivial: if your analysis is wrong, you miss or erroneously extract information on relations between molecular biology and human disease.

[1]  K. E. Ravikumar,et al.  Literature mining and database annotation of protein phosphorylation using a rule-based system , 2005, Bioinform..

[2]  K. Bretonnel Cohen,et al.  Natural Language Processing and Systems Biology , 2004, Artificial Intelligence Methods And Tools For Systems Biology.

[3]  Linda A. Watson,et al.  Information Retrieval: A Health and Biomedical Perspective. , 2005 .

[4]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .

[5]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001 .

[6]  K. Cohen,et al.  Biomedical language processing: what's beyond PubMed? , 2006, Molecular cell.

[7]  Miguel A. Andrade-Navarro,et al.  Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.

[8]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[9]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[10]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[11]  K. Bretonnel Cohen,et al.  A Resource for Constructing Customized Test Suites for Molecular Biology Entity Identification Systems , 2004 .