Improving Morphosyntactic Tagging of Slovene Language through Meta-tagging

Part-of-speech (PoS) or, better, morphosyntactic tagging is the process of assigning morphosyntactic categories to words in a text, an important pre-processing step for most human language technology applications. PoS-tagging of Slovene texts is a challenging task since the size of the tagset is over one thousand tags (as opposed to English, where the size is typically around sixty) and the state-of-the-art tagging accuracy is still below levels desired. The paper describes an experiment aimed at improving tagging accuracy for Slovene, by combining the outputs of two taggers – a proprietary rule-based tagger developed by the Amebis HLT company, and TnT, a tri-gram HMM tagger, trained on a handannotated corpus of Slovene. The two taggers have comparable accuracy, but there are many cases where, if the predictions of the two taggers differ, one of the two does assign the correct tag. We investigate training a classifier on top of the outputs of both taggers that predicts which of the two taggers is correct. We experiment with selecting different classification algorithms and constructing different feature sets for training and show that some cases yield a meta-tagger with a significant increase in accuracy compared to that of either tagger in isolation.

[1]  Simon Krek,et al.  The JOS Morphosyntactically Tagged Corpus of Slovene , 2008, LREC.

[2]  Thierry Pun,et al.  Rotation, scale and translation invariant digital image watermarking , 1997, Proceedings of International Conference on Image Processing.

[3]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[4]  Fernando Pérez-González,et al.  DCT-domain watermarking techniques for still images: detector performance analysis and a new structure , 2000, IEEE Trans. Image Process..

[5]  Shiyan Hu Document image watermarking algorithm based on neighborhood pixel ratio , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[6]  Ralf Steinmetz,et al.  New approach for transformation-invariant image and video watermarking in the spatial domain: self-spanning patterns (SSP) , 2000, Electronic Imaging.

[7]  Ahmed H. Tewfik,et al.  Geometric Invariance in image watermarking , 2004, IEEE Transactions on Image Processing.

[8]  Jörg Schwenk,et al.  Combining digital watermarks and collusion-secure fingerprints for digital images , 1999, Electronic Imaging.

[9]  Sushil K. Bhattacharjee,et al.  Towards second generation watermarking schemes , 1999, Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348).

[10]  Josep Domingo-Ferrer,et al.  Simple collusion-secure fingerprinting schemes for images , 2000, Proceedings International Conference on Information Technology: Coding and Computing (Cat. No.PR00540).

[11]  Adnan M. Alattar,et al.  > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Reversible Watermark Using the Difference Expansion of A Generalized Integer Transform , 2022 .

[12]  Nikolas P. Galatsanos,et al.  Affine transformation resistant watermarking based on image normalization , 2002, Proceedings. International Conference on Image Processing.

[13]  Blaz Zupan,et al.  Orange: From Experimental Machine Learning to Interactive Data Mining , 2004, PKDD.

[14]  Matt Cutts An introduction to the GIMP , 1997, CROS.

[15]  Markus G. Kuhn,et al.  Attacks on Copyright Marking Systems , 1998, Information Hiding.

[16]  Pierre Moulin,et al.  Design and statistical analysis of a hash-aided image watermarking system , 2004, IEEE Transactions on Image Processing.

[17]  Xiaojun Qi,et al.  Improved affine resistant watermarking by using robust templates , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Sviatoslav Voloshynovskiy,et al.  Watermark template attack , 2001, IS&T/SPIE Electronic Imaging.

[19]  Mohamed Ben Ahmed,et al.  An Efficient Multi-agent System Combining POS-Taggers for Arabic Texts , 2006, CICLing.

[20]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .

[21]  Peter Clark,et al.  The CN2 induction algorithm , 2004, Machine Learning.

[22]  Yuan-Pei Lin,et al.  Wavelet tree quantization for copyright protection watermarking , 2004, IEEE Transactions on Image Processing.

[23]  Jan Hajic,et al.  Tagging Inflective Languages: Prediction of Morphological Categories for a Rich Structured Tagset , 1998, ACL.

[24]  Ingemar J. Cox,et al.  Secure spread spectrum watermarking for multimedia , 1997, IEEE Trans. Image Process..

[25]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[26]  A. Murat Tekalp,et al.  Collusion-resilient fingerprinting by random pre-warping , 2004, IEEE Signal Processing Letters.

[27]  Lluís Màrquez i Villodre,et al.  SVMTool: A general POS Tagger Generator Based on Support Vector Machines , 2004, LREC.

[28]  Wen-Liang Hwang,et al.  An asymmetric subspace watermarking method for copyright protection , 2005, IEEE Trans. Signal Process..

[29]  Sheng-He Sun,et al.  Multipurpose image watermarking algorithm based on multistage vector quantization , 2005, IEEE Transactions on Image Processing.

[30]  Jonas Sj̈obergh Combining POS-taggers for improved accuracy on Swedish text , 2003 .

[31]  Dan Boneh,et al.  Collusion-Secure Fingerprinting for Digital Data , 1998, IEEE Trans. Inf. Theory.

[32]  Min Wu,et al.  Collusion-resistant fingerprinting for multimedia , 2004 .

[33]  Chip-Hong Chang,et al.  Fuzzy-ART based adaptive digital watermarking scheme , 2005 .

[34]  Bernd Girod,et al.  Capacity of digital watermarks subjected to an optimal collusion attack , 2000, 2000 10th European Signal Processing Conference.

[35]  Fabien A. P. Petitcolas,et al.  Watermarking schemes evaluation , 2000, IEEE Signal Process. Mag..

[36]  Marc Noy,et al.  A lower bound on the number of triangulations of planar point sets , 2004, Comput. Geom..

[37]  Jan Hajič,et al.  The Best of Two Worlds: Cooperation of Statistical and Rule-Based Taggers for Czech , 2007, ACL 2007.

[38]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[39]  Joshua R. Smith,et al.  Modulation and Information Hiding in Images , 1996, Information Hiding.

[40]  Tomaz Erjavec,et al.  MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora , 2004, LREC.

[41]  Benoit M. Macq,et al.  Geometrically invariant watermarking using feature points , 2002, IEEE Trans. Image Process..

[42]  Jakub Zavrel,et al.  Morphosyntactic Tagging of Slovene: Evaluating Taggers and Tagsets , 2000, LREC.

[43]  Min Wu,et al.  Anti-collusion of group-oriented fingerprinting , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).