A First Step Towards NLP from Digitized Manuscripts: Virtual Restoration

Digitization of the documental heritage conserved in libraries and archives is a common practice, in order to ensure the preservation and fruition of this extended part of the human cultural and historical patrimony. For the most precious, fragile and difficult to read and decipher manuscripts, specialized though portable digitization equipment, such as high resolution multispectral/hyperspectral cameras, is nowadays available. Digitization made it possible the increasingly extensive use of digital image processing techniques, to perform a number of virtual restoration tasks, which constitute a first, often necessary step prior subsequent automatic analysis of the writing contents, with the ultimate goal to perform automatic transcription and/or natural language processing tasks. Here we report our experience in this field, referring, as a case study, to the problem of removing one of the most frequent and impairing degradation affecting many ancient manuscripts, i.e., the bleed-through distortion. In this case, virtual restoration gives also the immediate benefit to facilitate the work of philologists and paleographers interested in examining and transcribing the manuscript in a traditional way.

[1]  Rabeux Vincent,et al.  Document Recto-verso Registration Using a Dynamic Time Warping Algorithm , 2011, 2011 International Conference on Document Analysis and Recognition.

[2]  Anna Tonazzini,et al.  Sparse Representation Based Inpainting for the Restoration of Document Images Affected by Bleed-Through , 2018, IWCIM@EUSIPCO.

[3]  Michael S. Brown,et al.  Accurate Alignment of Double-Sided Manuscripts for Bleed-Through Removal , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[4]  Farnood Merrikh-Bayat,et al.  Using Non-Negative Matrix Factorization for Removing Show-Through , 2010, LVA/ICA.

[5]  Mohamed Cheriet,et al.  A Variational Approach to Degraded Document Enhancement , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Anna Tonazzini,et al.  Fast correction of bleed-through distortion in grayscale documents by a blind source separation technique , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[7]  Michael S. Brown,et al.  Ink-bleed reduction using functional minimization , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  Xiao-Ping Zhang,et al.  Blind Bleed-Through Removal for Scanned Historical Document Image With Conditional Random Fields , 2015, IEEE Transactions on Image Processing.

[9]  Bin Li,et al.  Multi-sensor image registration based on algebraic projective invariants. , 2013, Optics express.

[10]  Anna Tonazzini,et al.  Registration and Enhancement of Double-Sided Degraded Manuscripts Acquired in Multispectral Modality , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[11]  Anna Tonazzini,et al.  Digital restoration of ancient color manuscripts from geometrically misaligned recto-verso pairs , 2016 .

[12]  Christian Wolf,et al.  Document Ink Bleed-Through Removal with Two Hidden Markov Random Fields and a Single Observation Field , 2010, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Miki Haseyama,et al.  Image inpainting based on sparse representations with a perceptual metric , 2013, EURASIP J. Adv. Signal Process..

[14]  Anna Tonazzini,et al.  Digital image analysis to enhance underwritten text in the Archimedes palimpsest , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[15]  Anna Tonazzini,et al.  A non-stationary density model to separate overlapped texts in degraded documents , 2015, Signal Image Video Process..

[16]  Anil C. Kokaram,et al.  A Non-parametric Framework for Document Bleed-through Removal , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Anna Tonazzini,et al.  Nonlinear model identification and see-through cancelation from recto–verso data , 2012, International Journal on Document Analysis and Recognition (IJDAR).

[18]  Anna Tonazzini,et al.  Independent component analysis for document restoration , 2004, Document Analysis and Recognition.

[19]  Chew Lim Tan,et al.  Matching of double-sided document images to remove interference , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[20]  Anil C. Kokaram,et al.  Nonrigid recto-verso registration using page outline structure and content preserving warps , 2013, HIP '13.

[21]  Boaz Ophir,et al.  Show-Through Cancellation in Scanned Images using Blind Source Separation Techniques , 2007, 2007 IEEE International Conference on Image Processing.

[22]  Anna Tonazzini,et al.  Nonlinear model and constrained ML for removing back-to-front interferences from recto-verso documents , 2012, Pattern Recognit..

[23]  Andriy Myronenko,et al.  Intensity-Based Image Registration by Minimizing Residual Complexity , 2010, IEEE Transactions on Medical Imaging.

[24]  Anna Tonazzini,et al.  Blind Source Separation Techniques for Detecting Hidden Texts and Textures in Document Images , 2004, ICIAR.

[25]  Frank Lebourgeois,et al.  Restoring Ink Bleed-Through Degraded Document Images Using a Recursive Unsupervised Classification Technique , 2006, Document Analysis Systems.

[26]  Eric Dubois,et al.  Reduction of Bleed-through in Scanned Manuscript Documents , 2001, PICS.

[27]  Anil C. Kokaram,et al.  Bleed-through removal in degraded documents , 2012, Electronic Imaging.

[28]  Michael S. Brown,et al.  User-Assisted Ink-Bleed Reduction , 2010, IEEE Transactions on Image Processing.

[29]  Andrzej Cichocki,et al.  Adaptive blind signal and image processing , 2002 .

[30]  Jie Wang,et al.  Non-rigid Registration and Restoration of Double-Sided Historical Manuscripts , 2011, 2011 International Conference on Document Analysis and Recognition.

[31]  Matti Pietikäinen,et al.  Adaptive document image binarization , 2000, Pattern Recognit..