Automatic Extraction of Verb Phrases from Annotated Corpora : A Linguistic Evaluation for Estonian

In order to be able to analyze and synthesize real sentences of a language, one has to be aware of the common expressions, which may be complicated idioms as well as simple frequent phrases. A special case of such common expressions is verb phrases i.e. phrasal verbs like to pay off and idiomatic expressions like to laugh one to pieces. In this paper, we will present the SENTA system that proposes an innovative architecture that avoids the definition of global association measure thresholds and defines a new association measure that does not over-evaluate the degree of cohesion of sequences of words containing frequent fragments. Finally, we will present a case study to demonstrate a successful way of combining linguistic and statistical processing to extract Estonian phrasal verbs from a text corpus.