论文信息 - Corpus Similarity and Homogeneity via Word Frequency

Corpus Similarity and Homogeneity via Word Frequency

A m e a s u r e o f c o r p u s s imi lar i ty w o u l d b e v e r y use fu l for l e x i c o g r a p h y . W o r d f r e q u e n c y l i s t s are c h e a p and easy to g e n e r a t e s o a m e a s u r e b a s e d o n t h e m c a n b e u s e d w h e r e a d e t a i l e d c o m p a r i s o n o f the t w o c o r p o r a is no t v i a b l e , for e x a m p l e , t o j u d g e h o w a n e w c o r p u s relates to a l r e a d y f a m i l i a r o n e s . W e s h o w that c o r p u s s imi lar i ty c a n o n l y b e interpreted in the l ight o f c o r p u s h o m o g e n e i t y , and present a m e a s u r e , b a s e d o n the ch i square stat ist ic , for m e a s u r i n g b o t h c o r p u s s imi lar i ty and c o r p u s h o m o g e n e i t y .

Adam Kilgarriff | Raphael Salkie

[1] Ted Dunning,et al. Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[2] Adam Kilgarriff,et al. Which words are particularly characteristic of a text? a survey of statistical approaches , 1996 .

[3] S. Johansson,et al. Frequency analysis of English vocabulary and grammar : based on the LOB Corpus , 1989 .