A Robust Methodology for Creating Large Image Datasets Using a Universal Format

In this paper, an autonomous methodology to create large image datasets by cropping images from an elementary and universal format has been suggested. This format consists of rectangles which can vary in size, number, and position. This methodology enables us to extract thousands of images in a matter of minutes without much manual effort. The primary reason for developing such technique is that large datasets are required in order to train intensively deep and enormous networks, which are the foundation of Artificial Intelligence(AI) and Computer Vision. Using this methodology we can harness large datasets quite conveniently and without the use of any special equipment. Also, this format can be used to collect diverse dataset which can help engineers and researchers from various domains. Another benefit of this methodology is that the proposed format can be used in real-time applications as well. In the present work, we have used this methodology to collect handwritten image dataset written in the Punjabi language. This technique uses contours and edge detection to locate specific shapes and match their dimensions and locations with the described parameters. Using this technique we were able to collect handwritten image datasets from three different forms of the Punjabi language with high accuracies.

[1]  Fumitaka Kimura,et al.  Multi-Oriented English Text Line Extraction Using Background and Foreground Information , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[2]  Gilbert Strang,et al.  Wavelets and Dilation Equations: A Brief Introduction , 1989, SIAM Rev..

[3]  Chetan Sharma,et al.  INDIAN VEHICLE LICENSE PLATE EXTRACTION AND SEGMENT ATION , 2011 .

[4]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Panyam Narahari Sastry,et al.  CLASSIFICATION AND IDENTIFICATION OF TELUGU HANDWRITTEN CHARACTERS EXTRACTED FROM PALM LEAVES USING DECISION TREE APPROACH , 2010 .

[6]  Sriganesh Madhvanath,et al.  Online handwriting recognition for Tamil , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[7]  Alicia Fornés,et al.  Information extraction from historical handwritten document images with a context-aware neural model , 2019, Pattern Recognit..

[8]  Zhe Gan,et al.  Variational Autoencoder for Deep Learning of Images, Labels and Captions , 2016, NIPS.

[9]  Verónica Romero,et al.  Handwritten text recognition for historical documents in the transcriptorium project , 2014, DATeCH '14.

[10]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[11]  Hui Zhang,et al.  Image segmentation evaluation: A survey of unsupervised methods , 2008, Comput. Vis. Image Underst..

[12]  Itimad Raheem Ali,et al.  Using Feature Extraction to Recognize Handwritten Text Image , 2014 .

[13]  Astha Baxi,et al.  A Review on Otsu Image Segmentation Algorithm , 2013 .

[14]  John Bennett,et al.  The effect of large training set sizes on online Japanese Kanji and English cursive recognizers , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[15]  Hiroshi Sako,et al.  Handwritten digit recognition: investigation of normalization and feature extraction techniques , 2004, Pattern Recognit..

[16]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.