论文信息 - Fishing for Exactness

Fishing for Exactness

Statistical methods for automatically identifying dependent word pairs (i.e. dependent bigrams) in a corpus of natural language text have traditionally been performed using asymptotic tests of significance. This paper suggests that Fisher's exact test is a more appropriate test due to the skewed and sparse data samples typical of this problem. Both theoretical and experimental comparisons between Fisher's exact test and a variety of asymptotic tests (the t-test, Pearson's chi-square test, and Likelihood-ratio chi-square test) are presented. These comparisons show that Fisher's exact test is more reliable in identifying dependent word pairs. The usefulness of Fisher's exact test extends to other problems in statistical natural language processing as skewed and sparse data appears to be the rule in natural language. The experiment presented in this paper was performed using PROC FREQ of the SAS System.

Ted Pedersen | Ted Pedersen

[1] Timothy R. C. Read,et al. Goodness-Of-Fit Statistics for Discrete Multivariate Data , 1988 .

[2] J. I. The Design of Experiments , 1936, Nature.

[3] Uri Zernik,et al. Lexical acquisition: Exploiting on-line resources to build a lexicon. , 1991 .

[4] G. Zipf,et al. The Psycho-Biology of Language , 1936 .

[5] Mehmet Kayaalp,et al. Signiicant Lexical Relationships , 1996 .

[6] G. Āllport. The Psycho-Biology of Language. , 1936 .

[7] Ted Dunning,et al. Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[8] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.