论文信息 - Slavonic Corpus for Stylometry Research

Slavonic Corpus for Stylometry Research

Stylometry techniques such as authorship recognition, machine translation detection and pedophile identification are daily used in applications for the most widely used languages. But under-represented languages lack data sources usable for stylometry research. In this paper, we propose an algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify crawling and data-cleaning techniques for purposes of stylometry field and add heuristic layer to detect and extract meta-information. The system was used on Czech and Slovak web domains to build a Slavonic corpus for stylometry research. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones.

Jan Svec | Jan Rygl | J. Rygl | J. Svec

[1] Vojtech Kovár,et al. Adaptation of Czech Parsers for Slovak , 2012, RASLAN.

[2] Shlomo Argamon,et al. Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[3] H. T. Eddy. The characteristic curves of composition. , 1887, Science.

[4] Vit Suchomel. Recent Czech Web Corpora , 2012, RASLAN.

[5] Yiming Yang,et al. Introducing the Enron Corpus , 2004, CEAS.

[6] Shlomo Argamon,et al. Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[7] Walter Daelemans,et al. Conversation Level Constraints on Pedophile Detection in Chat Rooms , 2012, CLEF.

[8] I.N. Bozkurt,et al. Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[9] A. K. Singh,et al. An Efficient Method of Eliminating Noisy Information in Web Pages for Data Mining , 2004, CIT.