A Computational Forensic Approach to the Analysis of Questioned Document Fragments

Fragments of documents are common subjects in forensic analysis of questioned documents. Forensic analysis of torn document is more challenging owing to sparse data content; for example, a document fragment might consist of only part of a word. The degree of difficulty increases when large number of such documents needs to be analyzed. A forensic expert might overlook evidences in this huge pool of data. This dissertation aims to help combat this problem by studying scientific methodologies that can narrow down the search space of a forensic expert. Automatic sorting of document fragments can be accomplished based on criteria set by the forensic expert. This demands execution of the following :(i) text/graphics segmentation;(ii) segmentation of text type (printed/handwritten);(iii) script identification of text;(iv) identification of the writer; (v) identifying the font of the printed text. Adopting various image processing and pattern recognition techniques certain methodologies are proposed for accomplishing such tasks. Rigorous experiments have been carried out to evaluate our scientific methodologies with real life torn document fragments. Feature encoding techniques have been meticulously chosen so that discriminative properties between different objects of interest are well represented, making the classification task easier. For e.g. in case of writer identification we have implemented a feature encoding scheme that reveals variations in character shape structures between different writers. The thesis consists of 10 chapters. A brief overview of every chapter is as follows: • Chapter 1 discusses the topic, and the challenges associated with it. This chapter also provides the motivation behind the research addressed in this thesis. In the beginning it briefs the problem and then provides a logical explanation about how the research aims to solve the problem of sorting torn document fragments. Later, it mentions about the contributions of this thesis followed by a brief description of all chapters in ”Thesis Outline” Section. • Chapter 2 states various background theories that have been used as the basis of solutions proposed in this thesis. Our analysis revealed that sorting of similar torn document fragments can be accomplished by exploiting information on characteristics of its content type like script and font of printed text, writer of handwritten text, etc.. Background knowledge related to these topics are discussed in this chapter. We have extensively used Support Vector Machine (SVM) classifier in all experiments, hence a theoretical discussion on with SVM is also provided. • Chapter 3 presents the existing state-of-the-art methodologies on relevant subproblems that we need to deal in order to accomplish our objective, for example we narrate here existing state-of-the-art methodologies on the following topics : (a) Text/graphics segmentation ; (b) Script identification ; (c) Writer identification ; (d) Font identification. • Chapter 4 provides a conclusion and direction towards future research. • Chapter 5 is based on an article ”Document-Zone Classification in Torn Documents” and is devoted to the problem of text/graphics segmentation and text type discrimination i.e. printed and handwritten text identification in torn document fragments.

[1]  James Michael Coggins,et al.  A framework for texture analysis based on spatial filtering , 1983 .

[2]  S.C. Hinds,et al.  A rule-based system for document image segmentation , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[3]  S. V. N. Vishwanathan,et al.  Multiple Kernel Learning and the SMO Algorithm , 2010, NIPS.

[4]  Lambert Schomaker,et al.  Writer identification using directional ink-trace width measurements , 2012, Pattern Recognit..

[5]  Bernhard Schölkopf,et al.  Statistical Learning Theory: Models, Concepts, and Results , 2008, Inductive Logic.

[6]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[7]  A. Lawrence Spitz,et al.  Determination of the Script and Language Content of Document Images , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Gábor Lugosi,et al.  Introduction to Statistical Learning Theory , 2004, Advanced Lectures on Machine Learning.

[9]  Mohsen Ebrahimi Moghaddam,et al.  A text-independent Persian writer identification based on feature relation graph (FRG) , 2010, Pattern Recognit..

[10]  Jie Ding,et al.  Classification of oriental and European scripts by using characteristic features , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[11]  Somaya Al-Máadeed,et al.  Text-Dependent Writer Identification for Arabic Handwriting , 2012, J. Electr. Comput. Eng..

[12]  Salvatore Tabbone,et al.  Text extraction from graphical document images using sparse representation , 2010, DAS '10.

[13]  Sargur N. Srihari,et al.  Analysis of handwriting individuality using word features , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[14]  Rangachar Kasturi,et al.  A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Horst Bunke,et al.  Improving writer identification by means of feature selection and extraction , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[16]  I. S. I. Abuhaiba,et al.  Arabic Font Recognition Based on Templates , 2003, Int. Arab J. Inf. Technol..

[17]  U. Pal,et al.  Multi-script line identification from Indian documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[18]  Bidyut Baran Chaudhuri,et al.  Word-Wise Script Identification from Indian Documents , 2004, Document Analysis Systems.

[19]  Adel M. Alimi,et al.  Gaussian Mixture Models for Arabic Font Recognition , 2010, 2010 20th International Conference on Pattern Recognition.

[20]  Sekhar Mandal,et al.  Segmentation of Text and Graphics from Document Images , 2007 .

[21]  Thomas M. Breuel,et al.  Pixel-Accurate Representation and Evaluation of Page Segmentation in Document Images , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[22]  Horst Bunke,et al.  Writer identification using text line based features , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[23]  Jonathan J. Hull,et al.  Font and Function Word Identification in Document Recognition , 1996, Comput. Vis. Image Underst..

[24]  Thierry Paquet,et al.  Writer identification by writer's invariants , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[25]  Takashi Saitoh,et al.  Document image segmentation and text area ordering , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[26]  Syed Saqib Bukhari,et al.  Segmentation of Curled Textlines Using Active Contours , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[27]  Anoop Namboodiri,et al.  Text Dependent Writer Verification using Boosting , 2008 .

[28]  Thierry Paquet,et al.  A writer identification and verification system , 2005, Pattern Recognit. Lett..

[29]  U. Pal,et al.  Segmentation of Bangla unconstrained handwritten text , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[30]  Yun-Seok Nam,et al.  Classification of machine-printed and handwritten addresses on Korean mail piece images using geometric features , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[31]  Utpal Garain,et al.  Off-Line Multi-Script Writer Identification Using AR Coefficients , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[32]  Alessandro Sperduti,et al.  Optical font recognition for multi-font OCR and document processing , 1999, Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99.

[33]  Thomas M. Breuel,et al.  Performance Evaluation and Benchmarking of Six-Page Segmentation Algorithms , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Bidyut Baran Chaudhuri,et al.  A complete printed Bangla OCR system , 1998, Pattern Recognit..

[35]  David S. Doermann,et al.  Gabor filter based multi-class classifier for scanned document images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[36]  Robert Sablatnig,et al.  Higher order MRF for foreground-background separation in multi-spectral images of historical manuscripts , 2010, DAS '10.

[37]  Chew Lim Tan,et al.  Script and Language Identification in Degraded and Distorted Document Images , 2006, AAAI.

[38]  Yue Lu,et al.  Bangla/English Script Identification Based on Analysis of Connected Component Profiles , 2006, Document Analysis Systems.

[39]  Sung-Hyuk Cha,et al.  Individuality of handwriting. , 2002, Journal of forensic sciences.

[40]  Rakhal Das Banerji,et al.  The Origin of the Bengali Script , 2003 .

[41]  Pitak Thumwarin,et al.  On-line writer recognition for Thai based on velocity of barycenter of pen-point movement , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[42]  Tetsushi Wakabayashi,et al.  Text Independent Writer Identification for Bengali Script , 2010, 2010 20th International Conference on Pattern Recognition.

[43]  Umapada Pal,et al.  Text/Graphics Separation in Color Maps , 2007, 2007 International Conference on Computing: Theory and Applications (ICCTA'07).

[44]  Mohamed A. Ismail,et al.  Techniques for language identification for hybrid Arabic-English document images , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[45]  Tetsushi Wakabayashi,et al.  Handwritten Numeral Recognition of Six Popular Indian Scripts , 2007 .

[46]  Bidyut Baran Chaudhuri,et al.  Machine-printed and hand-written text lines identification , 2001, Pattern Recognit. Lett..

[47]  Horst Bunke,et al.  Using HMM based recognizers for writer identification and verification , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[48]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[49]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[50]  C. V. Jawahar,et al.  A bilingual OCR for Hindi-Telugu documents and its applications , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[51]  Henry S. Baird,et al.  Iterated Document Content Classification , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[52]  Romain Raveaux,et al.  A colour text/graphics separation based on a graph representation , 2008, 2008 19th International Conference on Pattern Recognition.

[53]  Katrin Franke Analysis of Authentic Signatures and Forgeries , 2009, IWCF.

[54]  Miguel A. Patricio,et al.  Segmentation of Text and Graphics/Images Using the Gray-Level Histogram Fourier Transform , 2000, SSPR/SPR.

[55]  Adel M. Alimi,et al.  A study on font-family and font-size recognition applied to Arabic word images at ultra-low resolution , 2013, Pattern Recognit. Lett..

[56]  Debashis Ghosh,et al.  Script Recognition—A Review , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  A. G. Ramakrishnan,et al.  Script identification in printed bilingual documents , 2002, Document Analysis Systems.

[58]  Thierry Paquet,et al.  Information retrieval based writer identification , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[59]  Alireza Khotanzad,et al.  Invariant Image Recognition by Zernike Moments , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[60]  Venu Govindaraju,et al.  Using a boosted tree classifier for text segmentation in hand-annotated documents , 2012, Pattern Recognit. Lett..

[61]  Li Chen,et al.  Character Independent Font Recognition on a Single Chinese Character , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  Nozha Boujemaa,et al.  Generalized histogram intersection kernel for image recognition , 2005, IEEE International Conference on Image Processing 2005.

[63]  Tieniu Tan,et al.  Personal identification based on handwriting , 2000, Pattern Recognit..

[64]  Bidyut Baran Chaudhuri,et al.  Identification of different script lines from multi-script documents , 2002, Image Vis. Comput..

[65]  Chang-Sung Jeong,et al.  A straight line detection using principal component analysis , 2006, Pattern Recognit. Lett..

[66]  Lambert Schomaker,et al.  Writer Style from Oriented Edge Fragments , 2003, CAIP.

[67]  Horst Bunke,et al.  A Set of Novel Features for Writer Identification , 2003, AVBPA.

[68]  Lambert Schomaker,et al.  Using codebooks of fragmented connected-component contours in forensic and historic writer identification , 2007, Pattern Recognit. Lett..

[69]  Sally L. Wood,et al.  Language identification for printed text independent of segmentation , 1995, Proceedings., International Conference on Image Processing.

[70]  Thomas M. Breuel,et al.  Background variability modeling for statistical layout analysis , 2008, 2008 19th International Conference on Pattern Recognition.

[71]  Shijian Lu,et al.  Script and Language Identification in Noisy and Degraded Document Images , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  Anil K. Jain,et al.  Text segmentation using gabor filters for automatic document processing , 1992, Machine Vision and Applications.

[73]  Ching Y. Suen,et al.  Script identification using steerable Gabor filters , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[74]  Thomas M. Breuel,et al.  Document image zone classification - a simple high-performance approach , 2007, VISAPP.

[75]  Song Mao,et al.  Empirical Performance Evaluation Methodology and Its Application to Page Segmentation Algorithms , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[76]  Marcus Liwicki,et al.  A writer identification system for on-line whiteboard data , 2008, Pattern Recognit..

[77]  Rolf Ingold,et al.  Optical Font Recognition Using Typographical Features , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[78]  Meng Shi,et al.  Handwritten numeral recognition using gradient and curvature of gray scale image , 2002, Pattern Recognit..

[79]  Bidyut Baran Chaudhuri,et al.  Script line separation from Indian multi-script documents , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[80]  Umapada Pal,et al.  Handwriting segmentation of unconstrained Oriya text , 2006 .

[81]  Lambert Schomaker,et al.  Automatic writer identification using connected-component contours and edge-based features of uppercase Western script , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[82]  Umapada Pal,et al.  Structural handwritten and machine print classification for sparse content and arbitrary oriented document fragments , 2010, SAC '10.

[83]  Zhenyu He,et al.  Writer identification using global wavelet-based features , 2008, Neurocomputing.

[84]  Fabrizio Argenti,et al.  Fast algorithms for texture analysis using co-occurrence matrices , 1990 .

[85]  Vassilis Anastassopoulos,et al.  Morphological waveform coding for writer identification , 2000, Pattern Recognit..

[86]  Thierry Paquet,et al.  Handwriting analysis for writer verification , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[87]  Bhabatosh Chanda,et al.  Writer Identification for Handwritten Telugu Documents Using Directional Morphological Features , 2010, 2010 12th International Conference on Frontiers in Handwriting Recognition.

[88]  Patrick Kelly,et al.  Automatic Script Identification From Document Images Using Cluster-Based Templates , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[89]  Franz Rosenthal,et al.  The Alphabet. A Key to the History of Mankind , 1949 .

[90]  David S. Doermann,et al.  Identifying script on word-level with informational confidence , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[91]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[92]  Theodosios Pavlidis,et al.  Font recognition and contextual processing for more accurate text recognition , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[93]  Adel M. Alimi,et al.  New features using fractal multi-dimensions for generalized Arabic font recognition , 2010, Pattern Recognit. Lett..

[94]  Katrin Franke,et al.  The influence of physical and biomechanical processes on the ink trace. Methodological foundations for the forensic analysis of signatures , 2005 .

[95]  Simon E Fisher,et al.  Confirmatory evidence for linkage of relative hand skill to 2p12-q11. , 2003, American journal of human genetics.

[96]  Chin-Chuan Han,et al.  Skeleton generation of engineering drawings via contour matching , 1994, Pattern Recognit..

[97]  Jinhong Katherine Guo,et al.  Separating handwritten material from machine printed text using hidden Markov models , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[98]  Sridha Sridharan,et al.  Texture for script identification , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[99]  Venu Govindaraju,et al.  Latent Dirichlet allocation based writer identification in offline handwriting , 2010, DAS '10.

[100]  Lambert Schomaker,et al.  A comparison of clustering methods for writer identification and verification , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[101]  Klaus-Robert Müller,et al.  Efficient and Accurate Lp-Norm Multiple Kernel Learning , 2009, NIPS.

[102]  Fumitaka Kimura,et al.  Identification of Japanese and English Script from a Single Document Page , 2007, 7th IEEE International Conference on Computer and Information Technology (CIT 2007).

[103]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[104]  Lambert Schomaker,et al.  Text-Independent Writer Identification and Verification Using Textural and Allographic Features , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.