OF THE THESIS Information Extraction Using Hidden Markov Models by Timothy Robert Leek Master of Science in Computer Science University of California, San Diego, 1997 Professor Charles Peter Elkan, Chair This thesis shows how to design and tune a hidden Markov model to extract factual information from a corpus of machine-readable English prose. In particular, the thesis presents a HMM that classi es and parses natural language assertions about genes being located at particular positions on chromosomes. The facts extracted by this HMM can be inserted into biological databases. The HMM is trained on a small set of sentence fragments chosen from the collected scienti c abstracts in the OMIM (On-Line Mendelian Inheritance in Man) database and judged to contain the target binary relationship between gene names and gene locations. Given a novel sentence, all contiguous fragments are ranked by log-odds score, i.e. the log of the ratio of the probability of the fragment according to the target HMM to that according to a \null" HMM trained on all OMIM sentences. The most probable path through the HMM gives bindings for the annotations with precision as high as 80%. In contrast with traditional natural language processing methods, this stochastic approach makes no use either of part-of-speech taggers or dictionaries, instead employing non-emitting states to assemble modules roughly corresponding to noun, verb, and prepostional phrases. Algorithms for reestimating parameters for HMMs with non-emitting states are presented in detail. The ability to tolerate new words and recognize a wide variety of syntactic forms arises from the judicious use of \gap" states. v Chapter I Good Facts Are Hard to Find Finding facts in English prose is a task that humans are good at and computers are bad at. However, humans cannot stand to spend more than a few minutes at a time occupied with such drudgery. In this respect, nding facts is unlike a host of the other jobs computers are currently hopeless at, like telling a joke, riding a bike, and cooking a dinner. While there is no pressing need for computers to be good at those things, it is already of paramount importance that computers be pro cient at nding information with precision in the proliferating archives of electronic text available on the Internet and elsewhere. The state of the art in information retrieval technology is of limited use in this application. Standard boolean searching, vectorbased approaches and latent semantic indexing are geared more toward open-ended exploration than toward the targeted, detailed subsentence processing necessary for the fact nding or information extraction task. Since these approaches discard syntax, a large class of targets, in which the relationships between groups of words are important, must be fundamentally beyond them. The critical noun and verb groups of a fact can only be found by doing some kind of parsing. Information extraction is in most cases what people really want to do when they rst set about searching text, i.e. before they lower their sights to correspond to available tools. But this does not mean that nothing less than full-blown NLP (natural language processing) will satisfy. There are many real-world text searching 1 2 tasks that absolutely require syntactic information and yet are restricted enough to be tractable. An historian might want to locate passages in the Virginia colony records mentioning the \event" of a slave running away. The words slave, run, and away, all very common words, and their various synonyms used in an unconstrained search would return much dross. To nd this fact with precision we need to place constraints upon the arrangement of the words in the sentence; we need to limit the search with syntax. For instance, one might require that when two groups of words corresponding to slave and run appear in a sentence, that the slave is in fact the one doing the running. Similar examples of what we call fact searching are commonplace in most domains. A market analyst might want to scan the Wall Street Journal and pick out all mentions of corporate management changes. And a geneticist would be thrilled to be able to tease out of scienti c abstracts facts mapping genes to speci c locations on chromosomes. Historically, the eld of information extraction has employed discrete manipulations in order to process sentences into the critical noun and verb groups. An incoming sentence is tagged for part-of-speech and then handed o to a scaled-down parser or DFA (deterministic nite automaton) which uses local syntax to decide if the elements of a fact are present and to divide the sentence up into logical elements. Recent advances in statistical natural language processing have been applied to this problem but typically only in an ancillary role, e.g. in constructing dictionaries [17] and tagging words for part-of-speech [4]. The main processing engine remains combinatorial in avor. Systems like FASTUS [8] and CIRCUS [14] do surprisingly well, considering the di culty of the task, achieving precision and recall of better than 80%. But they require hand-built grammars or dictionaries of extraction patterns in order to attain this level of performance. A notable exception is the LIEP [9] system which learns to generalize extraction patterns from training examples. We have chosen to pursue a uni ed stochastic approach to the information extraction task, modeling sentence fragments containing the target fact with a hidden Markov model (HMM) which we use both to decide if a candidate sentence fragment 3 contains the fact and also to identify the important elements or slot llers in the fact. An HMM trained to recognize a small set of representative sentence fragments di ers radically from a DFA or discrete pattern matcher designed for the same task in that it outputs a probability. Unlike a DFA, an HMM will accept any sequence of words with non-zero probability. The probability it computes (after some corrections for sentence length and background frequencies of words) varies gracefully between the extremes of predicting extremely low probability for sequences that tend not to contain the fact to predicting high probability for ones that tend to contain it. There is no need, if we use an HMM to nd and process facts, to employ heuristics in order to rank and choose between competing explanations for a sentence; symbolic approaches often do so [9]. The probability the HMM computes is meaningful information we can use directly to reason about candidate facts in principled ways that submit to analysis. The HMM is a very compact and exible representation for the information extraction task which seems to be less reliant upon human engineering and prior knowledge than non-probabilistic approaches. This thesis will discuss our e orts to construct a model for a binary relationship between gene names and gene locations, as found in a variety of syntactic forms in scienti c abstracts. The model is structured hierarchically: at the top level states are collected into modules corresponding to noun or verb groups, whereas at the bottom level, in some cases, states function entirely deterministically, employing DFAs to recognize commonly occurring patterns. The HMM consists of only 64 states with an average of 3 transitions each, and explicitly mentions less than 150 words. When deploying the model to nd facts in novel sentences, no attempt is made to tag for part-of-speech. \Gap" states, which assign emission probability according to word frequency in the entire corpus, permit the HMM to recognize disconnected segments of a fact and tolerate new words. Unknown words, if they appear in the right local context, are accepted by the HMM essentially without penalty. So while the list of words likely to participate in forming a gene name or gene location is long and populated by words both common and rare to the corpus our approach is competent at correctly identifying even unknown words as 4 long as they appear anked by other words that serve to index the fact well. The accuracy of this HMM approach to information extraction, in the context of the gene name|location fact, is on par with symbolic approaches. This thesis is organized as follows. We begin with a description of the gene name|location information extraction task. Next, we present the modular HMM architecture constructed for this task, motivating our choice of null or background model and demonstrating the discriminatory power it adds to this approach. A brief technical discussion comes next, of the precise formulae used to reestimate parameters for an HMM with non-emitting states. Then we provide implementation and optimization details, followed by training and testing performance. We conclude with some remarks on the use of prior knowledge and ideas for future work. Chapter II Automatic Annotation Generation We consider the question of nding facts in unrestricted prose in the context of lling in slots in a database of facts about genes. The slots in the database correspond to biological entities. These are described by single words or simple phrases, three examples of which might be the name of a gene, some speci cation of its location, and some list of diseases in which it is known to be involved. An example pair of acceptable entries is SLOT ENTRY Gene Name: (The gene encoding BARK2) Gene Location: (mouse chromosome 5) which we might nd buried in a sentence like The gene encoding BARK2 mapped to mouse chromosome 5, whereas that encoding BARK1 was localized to mouse chromosome 19. This is valuable information that is available nowhere except in the published literature. Specialized databases like SwissProt and GenBank do not contain these kinds of associations. So there is interest in developing automated systems for lling in these slots. In order to populate these slots, we must locate and correctly analyze binary (or perhaps even ternary and higher) relations between likely ele
[1]
Scott B. Huffman,et al.
Learning information extraction patterns from examples
,
1995,
Learning for Natural Language Processing.
[2]
Robert L. Mercer,et al.
Class-Based n-gram Models of Natural Language
,
1992,
CL.
[3]
Eric Brill,et al.
Some Advances in Transformation-Based Part of Speech Tagging
,
1994,
AAAI.
[4]
Anders Krogh,et al.
SAM: SEQUENCE ALIGNMENT AND MODELING SOFTWARE SYSTEM
,
1995
.
[5]
L. Baum,et al.
An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology
,
1967
.
[6]
Richard Hughey,et al.
Scoring hidden Markov models
,
1997,
Comput. Appl. Biosci..
[7]
D. Haussler,et al.
Hidden Markov models in computational biology. Applications to protein modeling.
,
1993,
Journal of molecular biology.
[8]
John Cocke,et al.
A Statistical Approach to Machine Translation
,
1990,
CL.
[9]
Douglas E. Appelt,et al.
FASTUS: A System for Extracting Information from Natural-Language Text
,
1992
.
[10]
Richard M. Schwartz,et al.
Nymble: a High-Performance Learning Name-finder
,
1997,
ANLP.
[11]
Wendy G. Lehnert,et al.
Wrap-Up: a Trainable Discourse Module for Information Extraction
,
1994,
J. Artif. Intell. Res..
[12]
Anders Krogh,et al.
Hidden Markov models for sequence analysis: extension and analysis of the basic method
,
1996,
Comput. Appl. Biosci..
[13]
Lawrence R. Rabiner,et al.
A tutorial on hidden Markov models and selected applications in speech recognition
,
1989,
Proc. IEEE.
[14]
Jerry R. Hobbs.
The Generic Information Extraction System
,
1993,
MUC.
[15]
Naftali Tishby,et al.
Distributional Clustering of English Words
,
1993,
ACL.
[16]
Ellen Riloff,et al.
Automatically Constructing a Dictionary for Information Extraction Tasks
,
1993,
AAAI.
[17]
John Lafferty,et al.
Grammatical Trigrams: A Probabilistic Model of Link Grammar
,
1992
.