Scaling Context Space

Context is used in many NLP systems as an indicator of a term's syntactic and semantic function. The accuracy of the system is dependent on the quality and quantity of contextual information available to describe each term. However, the quantity variable is no longer fixed by limited corpus resources. Given fixed training time and computational resources, it makes sense for systems to invest time in extracting high quality contextual information from a fixed corpus. However, with an effectively limitless quantity of text available, extraction rate and representation size need to be considered. We use thesaurus extraction with a range of context extracting tools to demonstrate the interaction between context quantity, time and size on a corpus of 300 million words.

[1]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[2]  James R. Curran,et al.  Improvements in Automatic Thesaurus Extraction , 2002, ACL 2002.

[3]  Sharon A. Caraballo Automatic construction of a hypernym-labeled noun hierarchy from text , 1999, ACL.

[4]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[5]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[6]  John A. Carroll,et al.  Robust, applied morphological generation , 2000, INLG.

[7]  Carolyn J. Crouch,et al.  A cluster-based approach to thesaurus construction , 1988, SIGIR '88.

[8]  Stephen Clark,et al.  Class-based probability estimation using a semantic hierarchy , 2001, HTL 2001.

[9]  B. V. Verghese,et al.  Thesaurus of English Words and Phrases , 2002 .

[10]  Steven P. Abney Partial parsing via finite-state cascades , 1996, Natural Language Engineering.

[11]  W. Bruce Croft,et al.  Deriving concept hierarchies from text , 1999, SIGIR '99.

[12]  Gerda Ruge,et al.  Automatic Detection of Thesaurus relations for Information Retrieval Applications , 1997, Foundations of Computer Science: Potential - Theory - Cognition.

[13]  Darren Pearce,et al.  Synonymy in collocation extraction , 2001 .

[14]  Haruo Kimoto,et al.  Construction of a dynamic Thesaurus and its use for associated information retrieval , 1989, SIGIR '90.

[15]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[16]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[17]  Dekang Lin,et al.  Dependency-Based Evaluation of Minipar , 2003 .

[18]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[19]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.