论文信息 - Using the Web to Obtain Frequencies for Unseen Bigrams

Using the Web to Obtain Frequencies for Unseen Bigrams

This article shows that the Web can be employed to obtain frequencies for bigrams that are unseen in a given corpus. We describe a method for retrieving counts for adjective-noun, noun-noun, and verb-object bigrams from the Web by querying a search engine. We evaluate this method by demonstrating: (a) a high correlation between Web frequencies and corpus frequencies; (b) a reliable correlation between Web frequencies and plausibility judgments; (c) a reliable correlation between Web frequencies and frequencies recreated using class-based smoothing; (d) a good performance of Web frequencies in a pseudo disambiguation task.

Frank Keller | Mirella Lapata | Mirella Lapata | Frank Keller

[1] Judith N. Levi,et al. The syntax and semantics of complex nominals , 1978 .

[2] Slava M. Katz,et al. Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[3] George A. Miller,et al. Introduction to WordNet: An On-line Lexical Database , 1990 .

[4] Casimir A. Kulikowski,et al. Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems , 1990 .

[5] Mats Rooth,et al. Structural Ambiguity and Lexical Relations , 1991, ACL.

[6] Dania Egedi,et al. A Freely Available Wide Coverage Morphological Analyzer for English , 1992, COLING.

[7] Naftali Tishby,et al. Distributional Clustering of English Words , 1993, ACL.

[8] P. Resnik. Selection and information: a class-based approach to lexical relationships , 1993 .

[9] Geoffrey Leech,et al. CLAWS4: The Tagging of the British National Corpus , 1994, COLING.

[10] Ivan A. Sag,et al. Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[11] Gregory Grefenstette,et al. Explorations in automatic thesaurus discovery , 1994 .

[12] Ralph Grishman,et al. Generalizing Automatically Generated Selectional Patterns , 1994, COLING.

[13] Dekang Lin,et al. PRINCIPAR - An Efficient, Broad-coverage, Principle-based Parser , 1994, COLING.

[14] Geoffrey Sampson. English for the computer , 1995 .

[15] A. Sorace,et al. MAGNITUDE ESTIMATION OF LINGUISTIC ACCEPTABILITY , 1996 .

[16] Steven P. Abney. Partial parsing via finite-state cascades , 1996, Natural Language Engineering.

[17] Carson T. Schütze. The empirical base of linguistics , 2016 .

[18] Ted Briscoe,et al. Automatic Extraction of Subcategorization from Corpora , 1997, ANLP.

[19] Wayne Cowart,et al. Experimental Syntax: Applying Objective Methods to Sentence Judgments , 1997 .

[20] Marc Light,et al. Hiding a Semantic Class Hierarchy in a Markov Model , 1998 .

[21] Adwait Ratnaparkhi,et al. Statistical Models for Unsupervised Prepositional Phrase Attachment , 1998, ACL.

[22] Carson T. Schütze. The empirical base of linguistics: Grammaticality judgments and linguistic methodology , 1998 .

[23] Hang Li,et al. Generalizing Case Frames Using a Thesaurus and the MDL Principle , 1995, CL.

[24] Mats Rooth,et al. Valence Induction with a Head-Lexicalized PCFG , 1998, EMNLP.

[25] Rada Mihalcea,et al. A Method for Word Sense Disambiguation of Unrestricted Text , 1999, ACL.

[26] Lillian Lee,et al. Measures of Distributional Similarity , 1999, ACL.

[27] Mats Rooth,et al. Inducing a Semantically Annotated Lexicon via EM-Based Clustering , 1999, ACL.

[28] Gregory Grefenstette,et al. The World Wide Web as a Resource for Example-Based Machine Translation Tasks , 1999, TC.

[29] Philip Resnik,et al. Mining the Web for Bilingual Text , 1999, ACL.

[30] Frank Keller,et al. Determinants of Adjective-Noun Plausibility , 1999, EACL.

[31] Mats Rooth,et al. Using a Probabilistic Class-Based Lexicon for Lexical Ambiguity Resolution , 2000, COLING.

[32] Rosie Jones,et al. Automatically Building a Corpus for a Minority Language from the Web , 2000, ACL 2000.

[33] Diana McCarthy,et al. Using Semantic Preferences to Identify Verbal Participation in Role Switching Alternations , 2000, ANLP.

[34] Gregory Grefenstette,et al. Estimation of English and non-English Language Use on the WWW , 2000, RIAO.

[35] Eneko Agirre,et al. Exploring Automatic Word Sense Disambiguation with Decision Lists and the Web , 2000, SAIC@COLING.

[36] Philip Resnik,et al. Measuring Verb Similarity , 2000 .

[37] Frank Keller,et al. Finding Syntactic Structure in Unparsed Corpora The Gsearch Corpus Query System , 2001, Comput. Humanit..

[38] Frank Keller,et al. Phonology competes with syntax: experimental evidence for the interaction of word order and accent placement in the realization of Information Structure , 2001, Cognition.

[39] Ronald Rosenfeld,et al. Improving trigram language modeling with the World Wide Web , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[40] Stephen Clark,et al. Class-Based Probability Estimation Using a Semantic Hierarchy , 2001, NAACL.

[41] Maria Lapata,et al. A Corpus-based Account of Regular Polysemy: The Case of Context-sensitive Adjectives , 2001, NAACL.

[42] Frank Keller,et al. Evaluating Smoothing Algorithms against Plausibility Judgements , 2001, ACL.

[43] Ash Asudeh,et al. Constraints on Linguistic Coreference: Structural vs. Pragmatic Factors , 2001 .

[44] Michele Banko,et al. Mitigating the Paucity-of-Data Problem: Exploring the Effect of Training Corpus Size on Classifier Performance for Natural Language Processing , 2001, HLT.

[45] Dekang Lin. LaTaT: Language and Text Analysis Tools , 2001, HLT.

[46] Michele Banko,et al. Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[47] Martin Volk,et al. Exploiting the WWW as a corpus to resolve PP attachment ambiguities , 2001 .

[48] Maria Lapata. The Disambiguation of Nominalisations , 2002 .

[49] Frank Keller,et al. Using the Web to Overcome Data Sparseness , 2002, EMNLP.

[50] Geoffrey Sampson,et al. English for the Computer: The SUSANNE Corpus and Analytic Scheme , 1995, Computational Linguistics.

[51] Maria Lapata,et al. The Disambiguation of Nominalizations , 2002, CL.

[52] James R. Curran,et al. Scaling Context Space , 2002, ACL.

[53] M. Corley,et al. Syntactic priming in English sentence production: Categorical and latency evidence from an Internet-based study , 2002, Psychonomic bulletin & review.

[54] Malvina Nissim,et al. Using the Web for Nominal Anaphora Resolution , 2003 .

[55] Dekang Lin,et al. Dependency-Based Evaluation of Minipar , 2003 .

[56] Ido Dagan,et al. Similarity-Based Models of Word Cooccurrence Probabilities , 1998, Machine Learning.

[57] Frank Keller,et al. The Web as a Baseline: Evaluating the Performance of Unsupervised Web-based Models for a Range of NLP Tasks , 2004, NAACL.

[58] Mike Thelwall,et al. Text characteristics of English language university Web sites , 2005, J. Assoc. Inf. Sci. Technol..