Stylometry in a Bilingual Setup

The method of stylometry by most frequent words does not allow direct comparison of original texts and their translations, i.e. across languages. For instance, in a bilingual Czech-German text collection containing parallel texts (originals and translations in both directions, along with Czech and German translations from other languages), authors would not cluster across languages, since frequency word lists for any Czech texts are obviously going to be more similar to each other than to a German text, and the other way round. We have tried to come up with an interlingua that would remove the language-specific features and possibly keep the linguistically independent features of individual author signal, if they exist. We have tagged, lemmatized, and parsed each language counterpart with the corresponding language model in UDPipe, which provides a linguistic markup that is cross-lingual to a significant extent. We stripped the output of language-dependent items, but that alone did not help much. As a next step, we transformed the lemmas of both language counterparts into shared pseudolemmas based on a very crude Czech-German glossary, with a 95.6% success. We show that, for stylometric methods based on the most frequent words, we can do without translations.

[1]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[2]  Mathieu Bastian,et al.  Gephi: An Open Source Software for Exploring and Manipulating Networks , 2009, ICWSM.

[3]  Alexandr Rosen InterCorp – a look behind the façade of a parallel corpus , 2016 .

[4]  Jan Rybicki,et al.  Vive la différence: Tracing the (authorial) gender signal by multivariate analysis of word frequencies , 2016, Digit. Scholarsh. Humanit..

[5]  Richard S. Forsyth,et al.  Found in translation: To what extent is authorial discriminability preserved by translators? , 2014, Lit. Linguistic Comput..

[6]  Jan Rybicki The great mystery of the (almost) invisible translator: Stylometry in translation , 2012 .

[7]  Mike Kestemont,et al.  Stylometry with R: A Package for Computational Text Analysis , 2016, R J..

[8]  David L. Hoover,et al.  Testing Burrows's Delta , 2004, Lit. Linguistic Comput..

[9]  Jan Hajic,et al.  UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing , 2016, LREC.

[10]  Edward Vanhoutte Literary and Linguistic Computing , 1986 .

[11]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[12]  Michal Škrabal,et al.  Databáze překladových ekvivalentů Treq , 2017 .

[13]  Daniel Zeman,et al.  Reusable Tagset Conversion Using Tagset Drivers , 2008, LREC.

[14]  Jan Rybicki,et al.  The stylistics and stylometry of collaborative translation: Woolf's Night and Day in Polish , 2013, Lit. Linguistic Comput..

[16]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[17]  Changsoo Lee Do language combinations affect translators' stylistic visibility in translated texts? , 2018, Digit. Scholarsh. Humanit..

[18]  Michal Skrabal,et al.  The Translation Equivalents Database (Treq) as a lexicographer’s Aid , 2017 .

[19]  Peter W. H. Smith,et al.  Improving Authorship Attribution: Optimizing Burrows' Delta Method* , 2011, J. Quant. Linguistics.

[20]  Sabine Buchholz,et al.  CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.

[21]  J. Burrows The Englishing of Juvenal: computational stylistics and translated texts , 2002 .