论文信息 - Slovak Web Discussion Corpus

Slovak Web Discussion Corpus

This contribution aims to provide a representative sample of Slovak colloquial language in an organized corpus. The corpus makes it possible to study spontaneous, interactive communication that often includes various incorrect or unusual words. The corpus includes a complete set of web discussions about various topics from a single site. Each discussion is marked with a topic and talking person and is assigned to a specific section. The corpus includes an index for easy searching using regular expressions. Text of the discussions is processed with our tools for word tokenization, sentence boundary detection and morphological analysis. Token annotations include a correct word, proposed by a statistical correction system.

Jozef Juhár | Daniel Hládek | Ján Stas

[1] Jozef Juhar,et al. Unsupervised Spelling Correction for Slovak , 2013 .

[2] Miroslav Spousta,et al. A High-Quality Web Corpus of Czech , 2012, LREC.

[3] D. Hladek,et al. Dagger: The Slovak morphological classifier , 2012, Proceedings ELMAR-2012.

[4] Jan Hajic,et al. The Prague Dependency Treebank , 2003 .

[5] Tomislav Stojanov,et al. CroMo - Morphological Analysis for Standard Croatian and its Synchronic and Diachronic Dialects and Variants , 2008, FSMNLP.

[6] Ales Horák,et al. Slovak National Corpus , 2004, TSD.

[7] Joshua Saxe,et al. Mining Web Technical Discussions to Identify Malware Capabilities , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems Workshops.

[8] Sara Rosenthal,et al. Detecting Opinionated Claims in Online Discussions , 2012, 2012 IEEE Sixth International Conference on Semantic Computing.

[9] Adrian D. Thurston. Parsing Computer Languages with an Automaton Compiled from a Single Regular Expression , 2006, CIAA.