Incremental Information Extraction Using Tree-Based Context Representations

The purpose of information extraction (IE) is to find desired pieces of information in natural language texts and store them in a form that is suitable for automatic processing. Providing annotated training data to adapt a trainable IE system to a new domain requires a considerable amount of work. To address this, we explore incremental learning. Here training documents are annotated sequentially by a user and immediately incorporated into the extraction model. Thus the system can support the user by proposing extractions based on the current extraction model, reducing the workload of the user over time. We introduce an approach to modeling IE as a token classification task that allows incremental training. To provide sufficient information to the token classifiers, we use rich, tree-based context representations of each token as feature vectors. These representations make use of the heuristically deduced document structure in addition to linguistic and semantic information. We consider the resulting feature vectors as ordered and combine proximate features into more expressive joint features, called “Orthogonal Sparse Bigrams” (OSB). Our results indicate that this setup makes it possible to employ IE in an incremental fashion without a serious performance penalty.

[1]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[2]  N. Kushmerick,et al.  Information Extraction by Convergent Boundary Classification , 2004 .

[3]  Hwee Tou Ng,et al.  A maximum entropy approach to information extraction from semi-structured and free text , 2002, AAAI/IAAI.

[4]  Ingrid Renz,et al.  Text Mining, Theoretical Aspects and Applications , 2002 .

[5]  Aidan Finn,et al.  Active Learning Selection Strategies for Information Extraction , 2003 .

[6]  David Fisher,et al.  CRYSTAL: Inducing a Conceptual Dictionary , 1995, IJCAI.

[7]  William S. Yerazunis,et al.  Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering , 2004, PKDD.

[8]  Oskar Dressler,et al.  Künstliche Intelligenz? , 1986, FIFF Jahrestagung.

[9]  Ido Dagan,et al.  Mistake-Driven Learning in Text Categorization , 1997, EMNLP.

[10]  Fabio Ciravegna,et al.  (LP) 2 , an Adaptive Algorithm for Information Extraction from Web-related Texts , 2001 .

[11]  Walter Daelemans,et al.  Feature-Rich Memory-Based Classification for Shallow NLP and Information Extraction , 2003, Text Mining.

[12]  Andrew McCallum,et al.  Information Extraction with HMM Structures Learned by Stochastic Optimization , 2000, AAAI/IAAI.

[13]  Dan Roth,et al.  Relational Learning via Propositional Algorithms: An Information Extraction Case Study , 2001, IJCAI.

[14]  Dayne Freitag,et al.  Boosted Wrapper Induction , 2000, AAAI/IAAI.

[15]  Christian Siefkes,et al.  A Shallow Algorithm for Correcting Nesting Errors and Other Well-Formedness Violations in XML-like Input , 2004, Extreme Markup Languages®.

[16]  Andrew McCallum,et al.  Information Extraction with HMMs and Shrinkage , 1999 .

[17]  Dino Pedreschi,et al.  Knowledge Discovery in Databases: PKDD 2004 , 2004, Lecture Notes in Computer Science.

[18]  Leonid Peshkin,et al.  Bayesian Information Extraction Network , 2003, IJCAI.

[19]  Claudio Giuliano,et al.  A Critical Survey of the Methodology for IE Evaluation , 2004, LREC.

[20]  Susanne Hoche,et al.  Lerning Hidden Markov Models for Information Extraction Actively from Partially Labeled Text , 2002, Künstliche Intell..

[21]  Walter Daelemans,et al.  Information Extraction via Double Classification , 2003 .