论文信息 - How Random is a Corpus? The Library Metaphor

How Random is a Corpus? The Library Metaphor

Abstract There is a stark contrast between the random sample model underlying the statistical analysis of corpus frequency data and our intuitive knowledge that sentences are more than random bags of words. The 'library metaphor' illustrates how randomness results from the selection of a corpus as the basis for a linguistic study. At the same time it reveals two reasons why corpus data do not fully meet the assumptions of the random sample model. Finally, practicable methods for identifying and quantifying non-randomness are introduced and demonstrated on the example of passive verb forms.

Stefan Evert | S. Evert

[1] Stefan Evert,et al. The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .

[2] Sebastian Hoffmann. BNCweb (CQP edition) - the marriage of two corpus tools. , 2006 .

[3] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[4] Kie Zuraw. Probability in Language Change , 2002 .

[5] Kenneth Ward Church. Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p2 , 2000, COLING.

[6] A. Kilgarriff. Comparing Corpora , 2001 .

[7] Slava M. Katz. Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.

[8] Stefan Evert,et al. Using web data for linguistic purposes , 2007 .

[9] Douglas Biber,et al. Dimensions of Register Variation: A Cross-Linguistic Comparison , 1995 .

[10] Walter L. Smith. Probability and Statistics , 1959, Nature.

[11] H. Kucera,et al. Computational analysis of present-day American English , 1967 .

[12] R. Harald Baayen,et al. Probabilistic approaches to morphology , 2003 .

[13] Oliver Christ,et al. A Modular and Flexible Architecture for an Integrated Corpus Query System , 1994, ArXiv.

[14] R. Harald Baayen,et al. Word Frequency Distributions , 2001 .