Exploring Learning Approaches for Ancient Greek Character Recognition with Citizen Science Data

The central dogma of handwritten character recognition remains inextricably linked to optical character recognition methods for print media. Alongside their reliance on proprietary data and lack of open-access software, the applicability of these optical character recognition methods to handwritten characters from low-quality documents (e.g., that are damaged) remains unknown. In this paper, we compare and contrast the performance of state-of-the-art optical character recognition tools for print and learning models engineered with state-of-the-art machine learning toolkits trained on handwritten inputs. Using Tesseract OCR as a baseline, we build, optimize, and evaluate three types of convolutional neural networks that are trained on the AL-ALLand AL-PUBdatasets, a collection of images of handwritten ancient Greek characters that were labeled by volunteers through the Ancient Lives online citizen science project. We find our best-performing machine learning model to be 92.57% accurate compared to Tesseract OCR’s 11.15%. Following our analysis, we present a brief examination of our models’ shortcomings, introduce the publicly-available AL-PUBdataset, and, describe Theia, a web-based tool that democratizes our machine learning models for public use. We conclude by discussing the promise of our findings for advancing research at the intersection of machine learning, manuscript transcription, and the digital humanities.

[1]  Eric Xing,et al.  Learning from Imperfect Annotations , 2020, ArXiv.

[2]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[3]  Alex C. Williams,et al.  Identification of Ancient Greek Papyrus Fragments Using Genetic Sequence Alignment Algorithms , 2014, 2014 IEEE 10th International Conference on e-Science.

[4]  L. Deng,et al.  The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web] , 2012, IEEE Signal Processing Magazine.

[5]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[6]  Albert Fornells,et al.  A study of the effect of different types of noise on the precision of supervised learning techniques , 2010, Artificial Intelligence Review.

[7]  Jiwen Dong,et al.  Simple convolutional neural network on image classification , 2017, 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA)(.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Manik Varma,et al.  Character Recognition in Natural Images , 2009, VISAPP.

[11]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[12]  Matthew Lease,et al.  Improving Consensus Accuracy via Z-Score and Weighted Voting , 2011, Human Computation.

[13]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[14]  Alex C. Williams,et al.  Deja Vu: Characterizing Worker Reliability Using Task Consistency , 2017, HCOMP.

[15]  A. Bevan,et al.  Participation in heritage crowdsourcing , 2018, Museum Management and Curatorship.

[16]  Melissa Terras,et al.  “Many hands make light work. Many hands together make merry work”: Transcribe Bentham and crowdsourcing manuscript collections , 2014 .

[17]  Greta Franzini,et al.  Græcissare: Ancient Greek Loanwords in the LiLa Knowledge Base of Linguistic Resources for Latin , 2020, CLiC-it.

[18]  Nurshazlyn Mohd Aszemi,et al.  Hyperparameter Optimization in Convolutional Neural Network using Genetic Algorithms , 2019, International Journal of Advanced Computer Science and Applications.

[19]  Saichon Jaiyen,et al.  ConvXGB: A new deep learning model for classification problems based on CNN and XGBoost , 2020 .

[20]  Rohan Vaidya,et al.  Handwritten Character Recognition Using Deep-Learning , 2018, 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT).

[21]  Thomas M. Breuel,et al.  High Performance Text Recognition Using a Hybrid Convolutional-LSTM Implementation , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[22]  Chirag I. Patel,et al.  Optical Character Recognition by Open source OCR Tool Tesseract: A Case Study , 2012 .

[23]  Alex C. Williams,et al.  A computational pipeline for crowdsourced transcriptions of Ancient Greek papyrus fragments , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[24]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[25]  Choh-Man Teng,et al.  A Comparison of Noise Handling Techniques , 2001, FLAIRS.

[26]  David De Roure,et al.  Zooniverse: observing the world's largest citizen science platform , 2014, WWW.

[27]  Mykola Pechenizkiy,et al.  Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction , 2006, 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06).

[28]  Yann LeCun,et al.  Generalization and network design strategies , 1989 .

[29]  M. Ridge Old Weather: Approaching Collections from a Different Angle , 2016 .

[30]  Martin Thoma,et al.  The HASYv2 dataset , 2017, ArXiv.

[31]  Jiayan Jiang,et al.  Learning a mixture of sparse distance metrics for classification and dimensionality reduction , 2011, 2011 International Conference on Computer Vision.

[32]  Nicola Reggiani,et al.  Digital Papyrology I: Methods, Tools and Trends , 2017 .

[33]  Nikolaos Gonis,et al.  Oxyrhynchus : a city and its texts , 2007 .

[34]  Eric Horvitz,et al.  Volunteering Versus Work for Pay: Incentives and Tradeoffs in Crowdsourcing , 2013, HCOMP.

[35]  R. Grayson A Life in the Trenches? The Use of Operation War Diary and Crowdsourcing Methods to Provide an Understanding of the British Army’s Day-to-Day Life on the Western Front , 2016 .

[36]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[37]  Gregory Cohen,et al.  EMNIST: Extending MNIST to handwritten letters , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[38]  Ray W. Smith,et al.  History of the Tesseract OCR engine: what worked and what didn't , 2013, Electronic Imaging.

[39]  D. Singh,et al.  Handwritten English Character Recognition Using Neural Network , 2010 .

[40]  Luca Maria Gambardella,et al.  Convolutional Neural Network Committees for Handwritten Character Classification , 2011, 2011 International Conference on Document Analysis and Recognition.

[41]  Allyssa Guzman,et al.  FromThePage Collection Owner User Study Report , 2020 .

[42]  Robert Sablatnig,et al.  Recognizing characters of ancient manuscripts , 2010, Electronic Imaging.

[43]  Jennifer Widom,et al.  Towards Globally Optimal Crowdsourcing Quality Management: The Uniform Worker Setting , 2016, SIGMOD Conference.

[44]  Fei Yin,et al.  Online and offline handwritten Chinese character recognition: Benchmarking on new databases , 2013, Pattern Recognit..

[45]  Supratik Mukhopadhyay,et al.  Pixel-Level Reconstruction and Classification for Noisy Handwritten Bangla Characters , 2018, 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR).