Feature-Rich Memory-Based Classification for Shallow NLP and Information Extraction

Memory-Based Learning (MBL) is based on the storage of all available training data, and similarity-based reasoning for handling new cases. By interpreting tasks such as POS tagging and shallow parsing as classification tasks, the advantages of MBL (implicit smoothing of sparse data, automatic integration and relevance weighting of information sources, handling exceptional data) contribute to state-of-the-art accuracy. However, Hidden Markov Models (HMM) typically achieve higher accuracy than MBL (and other Machine Learning approaches) for tasks such as POS tagging and chunking. In this paper, we investigate how the advantages of MBL, such as its potential to integrate various sources of information, come to play when we compare our approach to HMMs on two Information Extraction (IE) datasets: the well-known Seminar Announcement data set and a new German Curriculum Vitae data set. 1 Memory-Based Language Processing Memory-Based Learning (MBL) is a supervised classification-based learning method. A vector of feature values (an instance) is associated with a class by a classifier that lazily extrapolates from the most similar set (nearest neighbors) selected from all stored training examples. This is in contrast to eager learning methods like decision tree learning [26], rule induction [9], or Inductive Logic Programming [7], which abstract a generalized structure from the training set beforehand (forgetting the examples themselves), and use that to derive a classification for a new instance. In MBL, a distance metric on the feature space defines what are the nearest neighbors of an instance. Metrics with feature weights based on information-theory or other relevance statistics allow us to use rich representations of instances and their context, and to balance the influences of diverse information sources in computing distance. Natural Language Processing (NLP) tasks typically concern the mapping of an input representation (e.g., a series of words) into an output representation (e.g., the POS tags corresponding to each word in the input). Most NLP tasks can therefore easily be interpreted as sequences of classification 34 Zavrel, Daelemans tasks: e.g., given a word and some representation of its context, decide what tag to assign to each word in its context. By creating a separate classification instance (a “moving window” approach) for each word and its context, shallow syntactic or semantic structures can be produced for whole sentences or texts. In this paper, we argue that more semantic and complex input-output mappings, such as Information Extraction, can also effectively be modeled by such a Memory-based classification-oriented framework, and that this approach has a number of very interesting advantages over rivalling methods, most notably that each classification decision can be made dependent on a very rich and diverse set of features. The properties of MBL as a lazy, similarity-based learning method seem make a good fit to the properties of typical disambiguation problems in NLP: • Similar input representations lead to similar output. E.g., words occurring in a similar context in general have the same POS tag. Similarity-based reasoning is the core of MBL. • Many sub-generalizations and exceptions. By keeping in memory all training instances, exceptions included, an MBL approach can capture generalization from exceptional or low-frequency cases according to [12]. • Need for integration of diverse types of information. E.g., in Information Extraction, lexical features, spelling features, syntactic as well as phrasal context features, global text structure, and layout features can potentially be very relevant. • Automatic smoothing in very rich event spaces. Supervised learning of NLP tasks regularly runs into problems of sparse data; not enough training data is available to extract reliable parameters for complex models. MBL incorporates an implicit robust form of smoothing by similarity [33]. In the remainder of this Section, we will show how a memory-, similarity-, and classification-based approach can be applied to shallow syntactic parsing, and can lead to state-of-the-art accuracy. Most of the tasks discussed here can also easily be modeled using Hidden Markov Models (HMM), and often with surprising accuracy. We will discuss the strengths of the HMMs and draw a comparison between the classification-based MBL method and the sequence-optimizing HMM approach (Section 1.2). 1.1 Memory-Based Shallow Parsing Shallow parsing is an important component of most text analysis systems in Text Mining applications such as information extraction, summary generation, and question answering. It includes discovering the main constituents of sentences (NPs, VPs, PPs) and their heads, and determining syntactic relationships like subject, object, adjunct relations between verbs and heads of other constituents. This is an important first step to understanding the who, what, when, and where of sentences in a text. Feature-Rich Memory-Based Classification for Information Extraction 35 In our approach to memory-based shallow parsing, we carve up the syntactic analysis process into a number of classification tasks with input vectors representing a focus item and a dynamically selected surrounding context. These classification tasks can be segmentation tasks (e.g., decide whether a focus word or tag is the start or end of an NP) or disambiguation tasks (e.g., decide whether a chunk is the subject NP, the object NP or neither). Output of some memory-based modules is used as input by other memory-based modules (e.g., a tagger feeds a chunker and the latter feeds a syntactic relation assignment module). Similar ideas about cascading of processing steps have also been explored in other approaches to text analysis: e.g., finite state partial parsing [1,18], statistical decision tree parsing [23], and maximum entropy parsing [30]. The approach briefly described here is explained and evaluated in more detail in [10,11,6] . Chunking The phrase chunking task can be defined as a classification task by generalizing the approach of [28], who proposed to convert NP-chunking to tagging each word with I for a word inside an NP, O for outside an NP, and B for between two NPs). The decision on these so called IOB tags for a word can be made by looking at the Part-of-Speech tag and the identity of the focus word and its local context. For the more general task of chunking other non-recursive phrases, we simply extend the tag set with IOB tags for each type of phrase. To illustrate this encoding with the extended IOB tag set, we can tag the sentence: But/CC [NP the/DT dollar/NN NP] [ADVP later/RB ADVP] [VP rebounded/VBD VP] ,/, [VP finishing/VBG VP] [ADJP slightly/RB higher/R ADJP] [Prep against/IN Prep] [NP the/DT yen/NNS NP] [ADJP although/IN ADJP] [ADJP slightly/RB lower/JJR ADJP] [Prep against/IN Prep] [NP the/DT mark/NN NP] ./. as: But/CCO the/DTI−NP dollar/NNI−NP later/RBI−ADV P rebounded/VBDI−V P ,/,O finishing/VBGI−V P slightly/RBI−ADV P higher/RBRI−ADV P against/INI−Prep the/DTI−NP yen/NNSI−NP although/INI−ADJP slightly/RBB−ADJP lower/JJRI−ADJP against/INI−Prep the/DTI−NP mark/NNI−NP ./.O Table 1 (from [6]) shows the accuracy of this memory-based chunking approach when training and testing on Wall Street Journal material. We report Precision, Recall, and Fβ=1 scores, a weighted harmonic mean of Recall and Precision (Fβ = (β+1)∗P∗R β2∗P+R) ). 1 An online demonstration of the Memory-Based Shallow Parser can be found at http://ilk.kub.nl . 36 Zavrel, Daelemans type precision recall Fβ=1 NPchunks 92.5 92.2 92.3 VPchunks 91.9 91.7 91.8 ADJPchunks 68.4 65.0 66.7 ADVPchunks 78.0 77.9 77.9 Prepchunks 95.5 96.7 96.1 PPchunks 91.9 92.2 92.0 ADVFUNCs 78.0 69.5 73.5 Table 1. Results of chunking–labeling experiments. Reproduced from [6]. Grammatical Relation Finding After POS tagging, phrase chunking and labeling, the last step of the shallow parsing consists of resolving the (types of) attachment between labeled phrases. This is done by using a classifier to assign a grammatical relation (GR) between pairs of words in a sentence. In our approach, one of these words is always a verb, since this yields the most important GRs. The other word (focus) is the head of the phrase which is annotated with this grammatical relation in the treebank (e.g., a noun as head of an NP). An instance for such a pair of words is constructed by extracting a set of feature values from the sentence. The instance contains information about the verb and the focus: a feature for the word form and a feature for the POS of both. It also has similar features for the local context of the focus. Experiments on the training data suggest an optimal context width of two words to the left and one to the right. In addition to the lexical and the local context information, superficial information about clause structure was included: the distance from the verb to the focus, counted in words. A negative distance means that the focus is to the left of the verb. Other features contain the number of other verbs between the verb and the focus, and the number of intervening commas. These features were chosen by manual “feature engineering”. Table 2 shows some of the feature-value instances corresponding to the following sentence (POS tags after the slash, chunks denoted with square and curly brackets, and adverbial functions after the dash): [ADVP Not /RB surprisingly /RB ADVP] ,/, [NP Peter /NNP Miller /NNP NP] ,/, [NP who /WP NP] [VP organized /VBD VP] [NP the /DT conference /NN NP] {PP-LOC [Prep in /IN Prep] [NP New /NNP York /NNP NP] PP-LOC} ,/, [VP does /VBZ not /RB want /VB to /TO come /VB VP] {PP-DIR [Prep to /IN Prep] [NP Paris /NNP NP] PP-DIR} [Prep without /IN Prep] [VP bringing /VBG VP] [NP his /PRP$ wife /NN NP]. Table 3 shows the results of the experiments. In the first row, only POS tag features are used. Other rows show the results when adding several types of chunk informati

[1]  Walter Daelemans,et al.  Memory-Based Learning: Using Similarity for Smoothing , 1997, ACL.

[2]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[3]  Ellen Riloff,et al.  Automatically Constructing a Dictionary for Information Extraction Tasks , 1993, AAAI.

[4]  Andrew McCallum,et al.  Information Extraction with HMM Structures Learned by Stochastic Optimization , 2000, AAAI/IAAI.

[5]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[6]  Walter Daelemans,et al.  Forgetting Exceptions is Harmful in Language Learning , 1998, Machine Learning.

[7]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[8]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[9]  David Fisher,et al.  CRYSTAL: Inducing a Conceptual Dictionary , 1995, IJCAI.

[10]  Walter Daelemans,et al.  Improving Accuracy in word class tagging through the Combination of Machine Learning Systems , 2001, CL.

[11]  Walter Daelemans,et al.  Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers , 2000, LREC.

[12]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[13]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[14]  Dayne Freitag,et al.  Machine Learning for Information Extraction in Informal Domains , 2000, Machine Learning.

[15]  Scott B. Huffman,et al.  Learning information extraction patterns from examples , 1995, Learning for Natural Language Processing.

[16]  David M. Magerman Natural Language Parsing as Statistical Pattern Recognition , 1994, ArXiv.

[17]  Dan Roth,et al.  The Use of Classifiers in Sequential Inference , 2001, NIPS.

[18]  Walter Daelemans,et al.  Cascaded Grammatical Relation Assignment , 1999, EMNLP.

[19]  Steven Abney,et al.  Part-of-Speech Tagging and Partial Parsing , 1997 .

[20]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[21]  Steven J. DeRose,et al.  Grammatical Category Disambiguation by Statistical Optimization , 1988, CL.

[22]  Eric Brill,et al.  Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[23]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[24]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[25]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[26]  Erik F. Tjong Kim Sang,et al.  Memory-Based Shallow Parsing , 2002, J. Mach. Learn. Res..

[27]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[28]  Yuji Matsumoto,et al.  Chunking with Support Vector Machines , 2001, NAACL.

[29]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[30]  Gregory Grefenstette Light parsing as finite state filtering , 1999 .

[31]  Adwait Ratnaparkhi,et al.  A Linear Observed Time Statistical Parser Based on Maximum Entropy Models , 1997, EMNLP.

[32]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.