Using the Web to Obtain Frequencies for Unseen Bigrams

This article shows that the Web can be employed to obtain frequencies for bigrams that are unseen in a given corpus. We describe a method for retrieving counts for adjective-noun, noun-noun, and verb-object bigrams from the Web by querying a search engine. We evaluate this method by demonstrating: (a) a high correlation between Web frequencies and corpus frequencies; (b) a reliable correlation between Web frequencies and plausibility judgments; (c) a reliable correlation between Web frequencies and frequencies recreated using class-based smoothing; (d) a good performance of Web frequencies in a pseudo disambiguation task.

[1]  Judith N. Levi,et al.  The syntax and semantics of complex nominals , 1978 .

[2]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[3]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[4]  Casimir A. Kulikowski,et al.  Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems , 1990 .

[5]  Mats Rooth,et al.  Structural Ambiguity and Lexical Relations , 1991, ACL.

[6]  Dania Egedi,et al.  A Freely Available Wide Coverage Morphological Analyzer for English , 1992, COLING.

[7]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[8]  P. Resnik Selection and information: a class-based approach to lexical relationships , 1993 .

[9]  Geoffrey Leech,et al.  CLAWS4: The Tagging of the British National Corpus , 1994, COLING.

[10]  Ivan A. Sag,et al.  Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[11]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[12]  Ralph Grishman,et al.  Generalizing Automatically Generated Selectional Patterns , 1994, COLING.

[13]  Dekang Lin,et al.  PRINCIPAR - An Efficient, Broad-coverage, Principle-based Parser , 1994, COLING.

[14]  Geoffrey Sampson English for the computer , 1995 .

[15]  A. Sorace,et al.  MAGNITUDE ESTIMATION OF LINGUISTIC ACCEPTABILITY , 1996 .

[16]  Steven P. Abney Partial parsing via finite-state cascades , 1996, Natural Language Engineering.

[17]  Carson T. Schütze The empirical base of linguistics , 2016 .

[18]  Ted Briscoe,et al.  Automatic Extraction of Subcategorization from Corpora , 1997, ANLP.

[19]  Wayne Cowart,et al.  Experimental Syntax: Applying Objective Methods to Sentence Judgments , 1997 .

[20]  Marc Light,et al.  Hiding a Semantic Class Hierarchy in a Markov Model , 1998 .

[21]  Adwait Ratnaparkhi,et al.  Statistical Models for Unsupervised Prepositional Phrase Attachment , 1998, ACL.

[22]  Carson T. Schütze The empirical base of linguistics: Grammaticality judgments and linguistic methodology , 1998 .

[23]  Hang Li,et al.  Generalizing Case Frames Using a Thesaurus and the MDL Principle , 1995, CL.

[24]  Mats Rooth,et al.  Valence Induction with a Head-Lexicalized PCFG , 1998, EMNLP.

[25]  Rada Mihalcea,et al.  A Method for Word Sense Disambiguation of Unrestricted Text , 1999, ACL.

[26]  Lillian Lee,et al.  Measures of Distributional Similarity , 1999, ACL.

[27]  Mats Rooth,et al.  Inducing a Semantically Annotated Lexicon via EM-Based Clustering , 1999, ACL.

[28]  Gregory Grefenstette,et al.  The World Wide Web as a Resource for Example-Based Machine Translation Tasks , 1999, TC.

[29]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.

[30]  Frank Keller,et al.  Determinants of Adjective-Noun Plausibility , 1999, EACL.

[31]  Mats Rooth,et al.  Using a Probabilistic Class-Based Lexicon for Lexical Ambiguity Resolution , 2000, COLING.

[32]  Rosie Jones,et al.  Automatically Building a Corpus for a Minority Language from the Web , 2000, ACL 2000.

[33]  Diana McCarthy,et al.  Using Semantic Preferences to Identify Verbal Participation in Role Switching Alternations , 2000, ANLP.

[34]  Gregory Grefenstette,et al.  Estimation of English and non-English Language Use on the WWW , 2000, RIAO.

[35]  Eneko Agirre,et al.  Exploring Automatic Word Sense Disambiguation with Decision Lists and the Web , 2000, SAIC@COLING.

[36]  Philip Resnik,et al.  Measuring Verb Similarity , 2000 .

[37]  Frank Keller,et al.  Finding Syntactic Structure in Unparsed Corpora The Gsearch Corpus Query System , 2001, Comput. Humanit..

[38]  Frank Keller,et al.  Phonology competes with syntax: experimental evidence for the interaction of word order and accent placement in the realization of Information Structure , 2001, Cognition.

[39]  Ronald Rosenfeld,et al.  Improving trigram language modeling with the World Wide Web , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[40]  Stephen Clark,et al.  Class-Based Probability Estimation Using a Semantic Hierarchy , 2001, NAACL.

[41]  Maria Lapata,et al.  A Corpus-based Account of Regular Polysemy: The Case of Context-sensitive Adjectives , 2001, NAACL.

[42]  Frank Keller,et al.  Evaluating Smoothing Algorithms against Plausibility Judgements , 2001, ACL.

[43]  Ash Asudeh,et al.  Constraints on Linguistic Coreference: Structural vs. Pragmatic Factors , 2001 .

[44]  Michele Banko,et al.  Mitigating the Paucity-of-Data Problem: Exploring the Effect of Training Corpus Size on Classifier Performance for Natural Language Processing , 2001, HLT.

[45]  Dekang Lin LaTaT: Language and Text Analysis Tools , 2001, HLT.

[46]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[47]  Martin Volk,et al.  Exploiting the WWW as a corpus to resolve PP attachment ambiguities , 2001 .

[48]  Maria Lapata The Disambiguation of Nominalisations , 2002 .

[49]  Frank Keller,et al.  Using the Web to Overcome Data Sparseness , 2002, EMNLP.

[50]  Geoffrey Sampson,et al.  English for the Computer: The SUSANNE Corpus and Analytic Scheme , 1995, Computational Linguistics.

[51]  Maria Lapata,et al.  The Disambiguation of Nominalizations , 2002, CL.

[52]  James R. Curran,et al.  Scaling Context Space , 2002, ACL.

[53]  M. Corley,et al.  Syntactic priming in English sentence production: Categorical and latency evidence from an Internet-based study , 2002, Psychonomic bulletin & review.

[54]  Malvina Nissim,et al.  Using the Web for Nominal Anaphora Resolution , 2003 .

[55]  Dekang Lin,et al.  Dependency-Based Evaluation of Minipar , 2003 .

[56]  Ido Dagan,et al.  Similarity-Based Models of Word Cooccurrence Probabilities , 1998, Machine Learning.

[57]  Frank Keller,et al.  The Web as a Baseline: Evaluating the Performance of Unsupervised Web-based Models for a Range of NLP Tasks , 2004, NAACL.

[58]  Mike Thelwall,et al.  Text characteristics of English language university Web sites , 2005, J. Assoc. Inf. Sci. Technol..