Processing Sampled Big Data

Big data processing requires extremely powerful and large computing setup. This puts bottleneck not only on processing infrastructure but also many researchers don’t get the freedom to analyze large datasets. This paper thus analyzes the processing of the large amount of data from machine learnt models that are built on the smaller sets of data samples. This work analyzes more than 40 GB data by testing different strategies of reducing the processed data without losing and compromising on the detection and model learning in machine learning. Many alternatives are analyzed and it is observed that 50% reduction does not drastically harm the machine learning model performance. On average, in SVM only 3.6%, and in Random Forest, only 1.8% performance is reduced, if only 50% data is used. The 50% reduction in instances means that in most cases, the data will fit in the RAM and the processing times will be considerably reduced, benefitting in execution times and or resources. From the incremental training and testing experiments, it is found that in special cases, smaller sub-sampled data can be used for model generation in machine learning problems. This is useful in cases, where there are either limitations on hardware or one has to select among many available machine learning algorithms.

[1]  Brody Sandel,et al.  Limited sampling hampers “big data” estimation of species richness in a tropical biodiversity hotspot , 2015, Ecology and evolution.

[2]  M. Hilbert,et al.  Big Data for Development: A Review of Promises and Challenges , 2016 .

[3]  Seungmin Lee,et al.  Implementation of high performance objectionable video classification system , 2006, 2006 8th International Conference Advanced Communication Technology.

[4]  Roger Clarke,et al.  Big data, big risks , 2016, Inf. Syst. J..

[5]  Hung-Chi Chang,et al.  Classifying Peer-to-Peer File Transfers for Objectionable Content Filtering Using a Web-based Approach , 2006 .

[6]  Dong Liu,et al.  Boost search relevance for tag-based social image retrieval , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[7]  Athanasios V. Vasilakos,et al.  Machine learning on big data: Opportunities and challenges , 2017, Neurocomputing.

[8]  Sharath Chandra Guntuku,et al.  Big Data Analytics framework for Peer-to-Peer Botnet detection using Random Forests , 2014, Inf. Sci..

[9]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[10]  Jason J. Jung,et al.  Social big data: Recent achievements and new challenges , 2015, Information Fusion.

[11]  Misha Denil,et al.  From Group to Individual Labels Using Deep Features , 2015, KDD.

[12]  Melnned M. Kantardzic Big Data Analytics , 2013, Lecture Notes in Computer Science.

[13]  Pingyi Fan,et al.  How Many Samples Required in Big Data Collection: A Differential Message Importance Measure , 2018, ArXiv.

[14]  Huan Liu,et al.  Blocking objectionable web content by leveraging multiple information sources , 2006, SKDD.

[15]  P. Fearnhead,et al.  The Zig-Zag process and super-efficient sampling for Bayesian analysis of big data , 2016, The Annals of Statistics.

[16]  Pedro Eleuterio,et al.  An adaptive sampling strategy for automatic detection of child pornographic videos , 2012 .

[17]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[18]  Shrikant Badghaiya,et al.  Image Classification using Tag and Segmentation based Retrieval , 2014 .

[19]  Miriam A. M. Capretz,et al.  Machine Learning With Big Data: Challenges and Approaches , 2017, IEEE Access.

[20]  Rehan Ullah Khan,et al.  Media Content Access: Image-based Filtering , 2018 .

[21]  Waleed Albattah The Role of Sampling in Big Data Analysis , 2016, BDAW '16.

[22]  Jack J. Dongarra,et al.  Exascale computing and big data , 2015, Commun. ACM.

[23]  B. B. Meshram,et al.  Text Based Approach For Indexing And Retrieval Of Image And Video: A Review , 2014, ArXiv.

[24]  Arnaldo de Albuquerque Araújo,et al.  Content-Based Filtering for Video Sharing Social Networks , 2011, ArXiv.

[25]  Ming Hu,et al.  A Novel Clustering-Based Sampling Approach for Minimum Sample Set in Big Data Environment , 2018, Int. J. Pattern Recognit. Artif. Intell..

[26]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Mohamed Moustafa,et al.  Applying deep learning to classify pornographic images and videos , 2015, ArXiv.

[28]  Athanasios V. Vasilakos,et al.  Big data analytics: a survey , 2015, Journal of Big Data.

[29]  Donald E. Brown,et al.  RMDL: Random Multimodel Deep Learning for Classification , 2018, ICISDM '18.

[30]  Z. Irani,et al.  Critical analysis of Big Data challenges and analytical methods , 2017 .

[31]  Adrian Ulges,et al.  Detecting pornographic video content by combining image features with motion information , 2009, ACM Multimedia.

[32]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[33]  Shohreh Kasaei,et al.  PIXEL-BASED SKIN DETECTION FOR PORNOGRAPHY FILTERING , 2005 .

[34]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[36]  Donald E. Brown,et al.  HDLTex: Hierarchical Deep Learning for Text Classification , 2017, 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA).

[37]  Zhonglei Wang,et al.  Sampling Techniques for Big Data Analysis , 2018, International Statistical Review.