Using the Web to Overcome Data Sparseness

This paper shows that the web can be employed to obtain frequencies for bigrams that are unseen in a given corpus. We describe a method for retrieving counts for adjective-noun, noun-noun, and verb-object bigrams from the web by querying a search engine. We evaluate this method by demonstrating that web frequencies and correlate with frequencies obtained from a carefully edited, balanced corpus. We also perform a task-based evaluation, showing that web frequencies can reliably predict human plausibility judgments.

[1]  Gregory Grefenstette,et al.  Estimation of English and non-English Language Use on the WWW , 2000, RIAO.

[2]  A. Sorace,et al.  MAGNITUDE ESTIMATION OF LINGUISTIC ACCEPTABILITY , 1996 .

[3]  Ido Dagan,et al.  Similarity-Based Models of Word Cooccurrence Probabilities , 1998, Machine Learning.

[4]  Steven P. Abney Partial parsing via finite-state cascades , 1996, Natural Language Engineering.

[5]  Rosie Jones,et al.  Automatically Building a Corpus for a Minority Language from the Web , 2000, ACL 2000.

[6]  Frank Keller,et al.  Finding Syntactic Structure in Unparsed Corpora The Gsearch Corpus Query System , 2001, Comput. Humanit..

[7]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[8]  Lillian Lee,et al.  Measures of Distributional Similarity , 1999, ACL.

[9]  Frank Keller,et al.  Evaluating Smoothing Algorithms against Plausibility Judgements , 2001, ACL.

[10]  M. Corley,et al.  Syntactic priming in English sentence production: Categorical and latency evidence from an Internet-based study , 2002, Psychonomic bulletin & review.

[11]  Eneko Agirre,et al.  Exploring Automatic Word Sense Disambiguation with Decision Lists and the Web , 2000, SAIC@COLING.

[12]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.

[13]  Frank Keller,et al.  Phonology competes with syntax: experimental evidence for the interaction of word order and accent placement in the realization of Information Structure , 2001, Cognition.

[14]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[15]  Maria Lapata,et al.  A Corpus-based Account of Regular Polysemy: The Case of Context-sensitive Adjectives , 2001, NAACL.

[16]  Ash Asudeh,et al.  Constraints on Linguistic Coreference: Structural vs. Pragmatic Factors , 2001 .

[17]  Frank Keller,et al.  Determinants of Adjective-Noun Plausibility , 1999, EACL.

[18]  Michele Banko,et al.  Mitigating the Paucity-of-Data Problem: Exploring the Effect of Training Corpus Size on Classifier Performance for Natural Language Processing , 2001, HLT.

[19]  Martin Volk,et al.  Exploiting the WWW as a corpus to resolve PP attachment ambiguities , 2001 .

[20]  Gregory Grefenstette,et al.  The World Wide Web as a Resource for Example-Based Machine Translation Tasks , 1999, TC.

[21]  Rada Mihalcea,et al.  A Method for Word Sense Disambiguation of Unrestricted Text , 1999, ACL.

[22]  Wayne Cowart,et al.  Experimental Syntax: Applying Objective Methods to Sentence Judgments , 1997 .

[23]  P. Resnik Selection and information: a class-based approach to lexical relationships , 1993 .