How Good Is Good Enough? Establishing Quality Thresholds for the Automatic Text Analysis of Retro-Digitized Comics

Stylometry in the form of simple statistical text analysis has proven to be a powerful tool for text classification, e.g. in the form of authorship attribution. When analyzing retro-digitized comics, manga and graphic novels, the researcher is confronted with the problem that automated text recognition (ATR) still leads to results that have comparatively high error rates, while the manual transcription of texts remains highly time-consuming. In this paper, we present an approach and measures that specify whether stylometry based on unsupervised ATR will produce reliable results for a given dataset of comics images.

[1]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[2]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[3]  Simon Günter,et al.  Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation , 2006, EMNLP.

[4]  John Burrows,et al.  Word-Patterns and Story-Shapes: The Statistical Analysis of Narrative Style , 1987 .

[5]  Jean-Christophe Burie,et al.  Digital Comics Image Indexing Based on Deep Learning , 2018, J. Imaging.

[6]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[7]  Mike Kestemont,et al.  Stylometry with R: a suite of tools , 2013, DH.

[8]  Ray Smith An Overview of the Tesseract OCR Engine , 2007 .

[9]  Dale Schuurmans,et al.  Augmenting Naive Bayes Classifiers with Statistical Language Models , 2004, Information Retrieval.

[10]  Shlomo Argamon,et al.  Measuring the Usefulness of Function Words for Authorship Attribution , 2020 .

[11]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[12]  Jean-Christophe Burie,et al.  Segmentation-Free Speech Text Recognition for Comic Books , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[13]  Rita Hartel,et al.  The Graphic Narrative Corpus (GNC): Design, Annotation, and Analysis for the Digital Humanities , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).