A corpus-based bootstrapping algorithm for Semi-Automated semantic lexicon construction

Many applications need a lexicon that represents semantic information but acquiring lexical information is time consuming. We present a corpus-based bootstrapping algorithm that assists users in creating domain-specific semantic lexicons quickly. Our algorithm uses a representative text corpus for the domain and a small set of ‘seed words’ that belong to a semantic class of interest. The algorithm hypothesizes new words that are also likely to belong to the semantic class because they occur in the same contexts as the seed words. The best hypotheses are added to the seed word list dynamically, and the process iterates in a bootstrapping fashion. When the bootstrapping process halts, a ranked list of hypothesized category words is presented to a user for review. We used this algorithm to generate a semantic lexicon for eleven semantic classes associated with the MUC-4 terrorism domain.

[1]  Ellen Riloff,et al.  A Corpus-Based Approach for Building Semantic Lexicons , 1997, EMNLP.

[2]  Steven L. Lytinen,et al.  The Ups and Downs of Lexical Acquisition , 1994, AAAI.

[3]  Jaime G. Carbonell,et al.  Towards a Self-Extending Parser , 1979, ACL.

[4]  Richard M. Schwartz,et al.  Coping with Ambiguity and Unknown Words through Probabilistic Models , 1993, CL.

[5]  Eric Brill,et al.  Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[6]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[7]  Brian Roark,et al.  Noun-phrase co-occurrence statistics for semi-automatic semantic lexicon construction , 2000, COLING.

[8]  Richard Granger,et al.  FOUL-UP: A Program that Figures Out Meanings of Words from Context , 1977, IJCAI.

[9]  Douglas B. Lenat,et al.  CYC: Using Common Sense Knowledge to Overcome Brittleness and Knowledge Acquisition Bottlenecks , 1986, AI Mag..

[10]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[11]  David Yarowsky,et al.  Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[12]  Lucy Vanderwende,et al.  Automatically Deriving Structured Knowledge Bases From On-Line Dictionaries , 1993 .

[13]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[14]  Ellen Riloff,et al.  An Empirical Approach to Conceptual Case Frame Acquisition , 1998, VLC@COLING/ACL.

[15]  Claire Cardie,et al.  A Case-Based Approach to Knowledge Acquisition for Domain-Specific Sentence Analysis , 1993, AAAI.

[16]  Paul S. Jacobs,et al.  Acquiring Lexical Knowledge from Text: A Case Study , 1988, AAAI.