Customizing a Lexicon to Better Suit a Computational Task

We discuss a method for augmenting and rearranging a structured lexicon in order to make it more suitable for a topic labefing task, by making use of lexical association information from a large text corpus. We first describe an algorithm for converting the hierarchical structure of WordNet [13] into a set of flat categories. We then use lexical cooccurrence statistics in combination with these categories to classify proper names, assign more specific senses to broadly defined terms, and classify new words into existing categories. We also describe how to use these statistics to assign schema-like information to the categories and show how the new categories improve a text-labeling algorithm. In effect, we provide a mechanism for successfully combining a hand-built lexicon with knowledge-free, statistically-derived information. 1 I n t r o d u c t i o n Much effort is being appl ied to the creation of lexicons and the acquisi t ion of semant ic and syntact ic a t t r ibu tes of the lexical i tems tha t comprise them, e.g, [1], [4],[7],[8], [11], [16], [18], [20]. However, a lexicon as given may not suit the requirements of a par t icu la r computa t iona l task. Because lexicons are expensive to build, ra ther than create new ones from scratch, i t is preferable to ad jus t existing ones to meet an appl ica t ion ' s needs. In this paper we describe such an effort: we add associat ional informat ion to a hierarchical ly s t ructured lexicon in order to bet ter serve a text labeling task. An a lgor i thm for par t i t ion ing a full-length exposi tory text into a sequence of subtopiea l discussions is described in [9]. Once the par t i t ioning is done, we need to assign labels 1 indicat ing what the subtopical discussions are about , for the purposes of informat ion retr ieval and hyper tex t navigat ion. One way to label texts, when working within a l imited domain of discourse, is to s t a r t with a pre-defined set of topics and specify the word contexts tha t indicate the topics of interest (e.g., [10]). Another way, assuming tha t a large collection of prelabeled texts exists, is to use s tat is t ics to au tomat i ca l ly infer which lexical i tems indicate which labels (e.g., [12]). In contrast , we are interested in assigning labels to general, domainindependent text , wi thout benefit of pre-classified texts. In all three cases, a lexicon tha t specifies which lexical i tems correspond to which topics is required. The topic label ing method we use is s ta t is t ica l and thus requires a large number of representat ive lexical i tems for each category. The s ta r t ing point for our lexicon is WordNet [13], which is readi ly available online and provides a large reposi tory of English lexical i tems. WordNet 2 is composed of synse t s , 1 The terms "label" and "topic" are used interchangeably in this paper. 2 All work described here pertains to Version 1.3 of WordNet.

[1]  Philip Resnik,et al.  WordNet and Distributional Analysis: A Class-based Approach to Lexical Discovery , 1992, AAAI 1992.

[2]  Hinrich Schütze,et al.  Word Space , 1992, NIPS.

[3]  Marvin Minsky,et al.  A framework for representing knowledge" in the psychology of computer vision , 1975 .

[4]  Michael W. Berry,et al.  Large-Scale Sparse Singular Value Computations , 1992 .

[5]  David Yarowsky,et al.  Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[6]  Hiyan Alshawi,et al.  Processing Dictionary Definitions with Phrasal Pattern Hierarchies , 1987, CL.

[7]  Hinrich Schütze,et al.  Part-of-Speech Induction From Scratch , 1993, ACL.

[8]  James Pustejovsky,et al.  On the Acquisition of Lexical Entries: The Perceptual Origin of Thematic Relations , 1987, ACL.

[9]  Edward M. Reingold,et al.  Graph drawing by force‐directed placement , 1991, Softw. Pract. Exp..

[10]  Lisa F. Rau,et al.  SCISOR: extracting information from on-line news , 1990, CACM.

[11]  Martha W. Evens,et al.  Semantically Significant Patterns in Dictionary Definitions , 1986, ACL.

[12]  Nicoletta Calzolari,et al.  Acquisition of Lexical Information from a Large Textual Italian Corpus , 1990, COLING.

[13]  Marti A. Hearst TextTiling: A Quantitative Approach to Discourse , 1993 .

[14]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[15]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[16]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[17]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[18]  Marvin Minsky,et al.  A framework for representing knowledge , 1974 .