Tiered Tagging Revisited

In this paper we describe a new baseline tagset induction algorithm, which unlike the one described in previous work is fully automatic and produces tagsets with better performance than before. The algorithm is an information lossless transformation of the MULTEXTEAST compliant lexical tags into a reduced tagset that can be mapped back on the lexicon tagset fully deterministic. From the baseline tagsets, a corpus linguist, expert in the language in case, may further reduce the tagsets taking into account language distributional properties. As any further reduction of the baseline tagsets assumes losing information, adequate recovering rules should be designed for ensuring the final tagging in terms of lexicon encoding.