Exploring Character Shapes for Unsupervised Reconstruction of Strip-Shredded Text Documents

Digital reconstruction of mechanically shredded documents has received increasing attention in the last years mainly for historical and forensics needs. Computational methods to solve this problem are highly desirable in order to mitigate the time-consuming human effort and to preserve document integrity. The reconstruction of strips-shredded documents is accomplished by horizontally splicing pieces so that the arising sequence (solution) is as similar as the original document. In this context, a central issue is the quantification of the fitting between the pieces (strips), which generally involves stating a function that associates a pair of strips to a real value indicating the fitting quality. This problem is also more challenging for text documents, such as business letters or legal documents, since they depict poor color information. The system proposed here addresses this issue by exploring character shapes as visual features for compatibility computation. Experiments conducted with real mechanically shredded documents showed that our approach outperformed in accuracy other popular techniques in the literature considering documents with (almost) only textual content.

[1]  Jianqi Zhang,et al.  Graphical-character-based shredded Chinese document reconstruction , 2016, Multimedia Tools and Applications.

[2]  Songyang Lao,et al.  A Semi-automatic Solution Archive for Cross-Cut Shredded Text Documents Reconstruction , 2015, ICIG.

[3]  Yunzhou Zhang,et al.  A pipeline for reconstructing cross-shredded English document , 2017, 2017 2nd International Conference on Image, Vision and Computing (ICIVC).

[4]  Matthias Prandtstetter Two Approaches for Computing Lower Bounds on the Reconstruction of Strip Shredded Text Documents , 2009 .

[5]  Peter J. Olver,et al.  Automatic Solution of Jigsaw Puzzles , 2013, Journal of Mathematical Imaging and Vision.

[6]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[7]  Keiichi Abe,et al.  Topological structural analysis of digitized binary images by border following , 1985, Comput. Vis. Graph. Image Process..

[8]  Gabriel Taubin,et al.  PSQP: Puzzle Solving by Quadratic Programming , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Lina J. Karam,et al.  Morphological text extraction from images , 2000, IEEE Trans. Image Process..

[10]  Anil K. Jain,et al.  A modified Hausdorff distance for object matching , 1994, Proceedings of 12th International Conference on Pattern Recognition.

[11]  Matthias Prandtstetter,et al.  A Memetic Algorithm for Reconstructing Cross-Cut Shredded Text Documents , 2010, Hybrid Metaheuristics.

[12]  Siome Goldenstein,et al.  Assessing Cross-Cut Shredded Document Assembly , 2014, CIARP.

[13]  Naren Ramakrishnan,et al.  The Deshredder: A visual analytic approach to reconstructing shredded documents , 2012, 2012 IEEE Conference on Visual Analytics Science and Technology (VAST).

[14]  Junhua Chen,et al.  A high splicing accuracy solution to reconstruction of cross-cut shredded text document problem , 2017, Multimedia Tools and Applications.

[15]  Matthias Prandtstetter,et al.  Combining Forces to Reconstruct Strip Shredded Text Documents , 2008, Hybrid Metaheuristics.

[16]  Hedong Xu,et al.  A Solution to Reconstruct Cross-Cut Shredded Text Documents Based on Character Recognition and Genetic Algorithm , 2014 .

[17]  M. A. O. Marques,et al.  Document Decipherment-restoration: Strip-shredded Document Reconstruction based on Color , 2013, IEEE Latin America Transactions.

[18]  Ray Smith An Overview of the Tesseract OCR Engine , 2007 .

[19]  Wilfried Philips,et al.  Semiautomatic reconstruction of strip-shredded documents , 2005, IS&T/SPIE Electronic Imaging.

[20]  Nan Xing,et al.  Shreds Assembly Based on Character Stroke Feature , 2017, ICCSCI.

[21]  Haoqi Zhang,et al.  Hallucination: A Mixed-Initiative Approach for Efficient Document Reconstruction , 2012, HCOMP@AAAI.

[22]  Matthias Prandtstetter,et al.  Meta-heuristics for reconstructing cross cut shredded text documents , 2009, GECCO.

[23]  M.G. Strintzis,et al.  Shredded document reconstruction using MPEG-7 standard descriptors , 2004, Proceedings of the Fourth IEEE International Symposium on Signal Processing and Information Technology, 2004..

[24]  Huei-Yung Lin,et al.  Reconstruction of shredded document based on image feature matching , 2012, Expert Syst. Appl..

[25]  Ohad Ben-Shahar,et al.  A fully automated greedy square jigsaw puzzle solver , 2011, CVPR 2011.

[26]  Jörg Krüger,et al.  Content representation and pairwise feature matching method for virtual reconstruction of shredded documents , 2015, 2015 9th International Symposium on Image and Signal Processing and Analysis (ISPA).

[27]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[28]  Stephen V. Rice,et al.  Software tools and test data for research and testing of page-reading OCR systems , 2005, IS&T/SPIE Electronic Imaging.

[29]  Tanasanee Phienthrakul,et al.  A Linear Scoring Algorithm for Shredded Paper Reconstruction , 2015, 2015 11th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS).

[30]  Günther R. Raidl,et al.  Enhancing a Genetic Algorithm with a Solution Archive to Reconstruct Cross Cut Shredded Text Documents , 2013, EUROCAST.

[31]  Giovanni Ramponi,et al.  Feature extraction and clustering for the computer-aided reconstruction of strip-cut shredded documents , 2008, J. Electronic Imaging.

[32]  Robert Sablatnig,et al.  Strip shredded document reconstruction using optical character recognition , 2011, ICDP.

[33]  R. Jonker,et al.  Transforming asymmetric into symmetric traveling salesman problems , 1983 .

[34]  Azzam Sleit,et al.  An alternative clustering approach for reconstructing cross cut shredded text documents , 2011, Telecommunication Systems.