Corpus Similarity and Homogeneity via Word Frequency

A m e a s u r e o f c o r p u s s imi lar i ty w o u l d b e v e r y use fu l for l e x i c o g r a p h y . W o r d f r e q u e n c y l i s t s are c h e a p and easy to g e n e r a t e s o a m e a s u r e b a s e d o n t h e m c a n b e u s e d w h e r e a d e t a i l e d c o m p a r i s o n o f the t w o c o r p o r a is no t v i a b l e , for e x a m p l e , t o j u d g e h o w a n e w c o r p u s relates to a l r e a d y f a m i l i a r o n e s . W e s h o w that c o r p u s s imi lar i ty c a n o n l y b e interpreted in the l ight o f c o r p u s h o m o g e n e i t y , and present a m e a s u r e , b a s e d o n the ch i square stat ist ic , for m e a s u r i n g b o t h c o r p u s s imi lar i ty and c o r p u s h o m o g e n e i t y .