Retrieving Collocations From Korean Text

This paper describes a statistical methodology ibr automatically retrieving collocations from POS tagged Korean text using interrupted bigrams. The free order of Korean makes it hard to identify collocations. We devised four statistics, 'frequency', 'randomness', 'condensation', and 'correlation' .to account for the more flexible word order properties of Korean collocations. We extracted meaningful bigrams using an evaluation ihnction and extended the bigrams to n-gram collocations by generating equivalence sets, a-covers. We view a modeling problem for n-gram collocations as that for clustering of cohesive words. 1 I n t r o d u c t i o n There have been many theoretical and applied works related to collocations. A rapidly growing awfilability of copora has attracted interests m statistical methods for automatically extractmg ¢:o]loeations from textual corpora. However, it is not easy to )dentify the central tendencies of collocation distribution and the borderlines of criteria are often fuzzy because the expressions can be of arbitrary lengths in a large variety of forms. Getting reliable collocation patterns is particularly difficult in Korean which allows arguments to scamble so freely. This paper presents a statistical method using 'interrupted bigrams' for automatically retrieving ~:ollocations and idiomatic expressions from Korean text. We suggest several statistics to account for the more flexible word order. If the distribution of a random sample is unknown, we often try to make inferences about its properties described by suitably defined measures. For the properties of arbitrary collocation distribution, four measure statistics: 'high frequency ' , ' condensa t ion ' , ' r andomness ' , and ' cor re la t ion ' were devised. Given a morpheme, our system begins by retrieving the frequency distributions of all bigrams within window and then meaningful bigrams are extracted. We produce a-covers to extend them into n-gram collocations 1 According to the definition of Kjellmer and Cowie, a fossilized phrase is a sequence, where the occurrence of one word almost predicts the rest of the phrase and one word predicts a very limited number of words in a semi-fossilized phrase (Kjellmer, 1995) (Cowie, 1981). However, in both fossilized and semi-fossilized types there is a high degree of cohesion among the members of the phrases (Kjellmer, 1995). We consider the cohesions as a-covers that are obtained by applying a fuzzy compatibility relation, which satisfies symmetry and reflexivity, to meaningful bigrams. Namely, n-gram collocations could be interpreted as equivalent sets of the meaningful bigrams through partitioning. Here, a-covers mean the clustered sets of the meaningful bigrams. 2 R e l a t e d W o r k s In determining properties of collocations, most of corpus-based approaches accepted that the words of a collocation have a particular statistical distribution(Cruse, 1986). Although previous approaches have shown good results in retrieving collocations and many properties have been identified, they depend heavily on the frequency factor. (Choueka et al., 1983) proposed an algorithm for retrieving only uninterrupted collocations, 2 IBigrams and n-grams can be either adjacent morphemes or separated morphems by an arbitrary number of other words. 2In the case of an interrupted collocation, words can be separated by an arbitrary number of words, whereas