A Hybrid System for MorphoSyntactic Disambiguation in Bulgarian
暂无分享,去创建一个
The MorphoSyntactic Disambiguation Problem (MSDP) similarly to the Part-Of-Speech (POS) disambiguation problem attracts researchers' attention in NLP since it became obvious that certain partial analyses can be useful in practice for such tasks as document indexing, reducing the ambiguity in subsequent parsing stages and others. The interest in this problem is also supported by the hope that it is decidable with a high percentage of certainty without deep syntactic analysis to be involved. In languages with rich morphology, like Bulgarian, the tagset is likely to increase in size (for example, for one of the Spanish tagsets [...,] the number of the tags is as high as 475. (Garside, Leech and McEnery 1997)). Hence de ning an adequate tagging scheme becomes a question of importance. The main problem with having so many tags is the well-known problem of the sparseness of a corpus, i.e. from a set of linguistic descriptions only a few are frequent in the corpus. Thus representativeness with respect to all grammatical features relies on the very large size of the corpus. This phenomenon motivated us to choose compositional tags instead of atomic ones. As each word in our corpus is connected with a bunch of grammatical features, it happens that less amount of text demonstrates more dependencies between these features.