Automatic Extraction of Multiword Units for Estonian : Phrasal Verbs

In order to be able to analyse and synthesise real sentences of a language, it's not sufficient if one knows the words and syntax rules of that language. In addition, one has to be aware of the common expressions, which may be complicated idioms as well as simple frequent phrases. At present, we don't know much about frequent Estonian expressions. There exist a few publications dealing with such phenomena (EKSS, Hasselblatt 1990, Õim 1993, Õim 1998), aimed at a human reader. Based on these studies, a database of multiword units has been compiled (http://www.cl.ut.ee/ee/ressursid/pysiyhendid.html), but their usage and frequency in real-life texts is still unexplored. Fortunately, language-independent computational tools have been developed in order to identify and extract multiword units from electronic text corpora (Dias et al. 2000). Their ability to deal with all kinds of languages, and in particular Estonian, is a great motivation to find the frequency of various expressions in reallife texts, and to identify the expressions missing from the database that could enrich it. The procedure is simple: run a statistical program, find expressions among multiword unit candidates, compare the results with the existing database, and add new information. However, drawbacks are likely to occur: the program may find expressions that make little sense for a linguist, and may fail to find those that a linguist would identify from the text by hand. In order to get most out of a statistical tool, we must take into account the linguistic properties of the text and the expressions we are interested in, as well as the requirements of the statistical tool. Below we will present a case study to demonstrate a successful way of combining linguistic and statistical processing: extracting Estonian phrasal verbs from a text corpus. We evaluate the results by comparing them to a database of phrasal verbs, built manually from existing dictionaries beforehand. We also evaluate the database itself.