Slavonic Corpus for Stylometry Research

Stylometry techniques such as authorship recognition, machine translation detection and pedophile identification are daily used in applications for the most widely used languages. But under-represented languages lack data sources usable for stylometry research. In this paper, we propose an algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify crawling and data-cleaning techniques for purposes of stylometry field and add heuristic layer to detect and extract meta-information. The system was used on Czech and Slovak web domains to build a Slavonic corpus for stylometry research. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones.

[1]  Vojtech Kovár,et al.  Adaptation of Czech Parsers for Slovak , 2012, RASLAN.

[2]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[3]  H. T. Eddy The characteristic curves of composition. , 1887, Science.

[4]  Vit Suchomel Recent Czech Web Corpora , 2012, RASLAN.

[5]  Yiming Yang,et al.  Introducing the Enron Corpus , 2004, CEAS.

[6]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[7]  Walter Daelemans,et al.  Conversation Level Constraints on Pedophile Detection in Chat Rooms , 2012, CLEF.

[8]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[9]  A. K. Singh,et al.  An Efficient Method of Eliminating Noisy Information in Web Pages for Data Mining , 2004, CIT.