An “On The Fly” Framework for Efficiently Generating Synthetic Big Data Sets

Collecting, analyzing and gaining insight from large volumes of data is now the norm in an ever increasing number of industries. Data analytics techniques, such as machine learning, are powerful tools used to analyze these large volumes of data. Synthetic data sets are routinely relied upon to train and develop such data analytics methods for several reasons: to generate larger data sets than are available, to generate diverse data sets, to preserve anonymity in data sets with sensitive information, etc. Processing, transmitting and storing data is a key issue faced when handling large data sets. This paper presents an “On the fly” framework for generating big synthetic data sets, suitable for these data analytics methods, that is both computationally efficient and applicable to a diverse set of problems. An example application of the proposed framework is presented along with a mathematical analysis of its computational efficiency, demonstrating its effectiveness. Empirical results indicate that the proposed data generation framework provides a reduction in computational time of $\approx$33% when compared to the alternative approach of generating the data set in full.

[1]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[3]  Chunjie Luo,et al.  BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking , 2013, WBDB.

[4]  Michael Conlon,et al.  A clustering approach to domestic electricity load profile characterisation using smart metering data , 2015 .

[5]  Vipin Chaudhary,et al.  Big Data in Finance , 2016 .

[6]  Yongcai Wang,et al.  Smart meter deployment optimization for efficient electrical appliance state monitoring , 2012, 2012 IEEE Third International Conference on Smart Grid Communications (SmartGridComm).

[7]  Jim Duggan,et al.  A multi-objective neural network trained with differential evolution for dynamic economic emission dispatch , 2018, International Journal of Electrical Power & Energy Systems.

[8]  Steven Skiena,et al.  Trading Strategies to Exploit Blog and News Sentiment , 2010, ICWSM.

[9]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[10]  Enda Barrett,et al.  Predicting host CPU utilization in the cloud using evolutionary neural networks , 2018, Future Gener. Comput. Syst..

[11]  S. Grijalva,et al.  The expected revenue of energy storage from energy arbitrage service based on the statistics of realistic market data , 2018, 2018 IEEE Texas Power and Energy Conference (TPEC).

[12]  Recurrence interval analysis of trading volumes. , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[13]  Cewu Lu,et al.  Virtual to Real Reinforcement Learning for Autonomous Driving , 2017, BMVC.

[14]  William Fleischman,et al.  Prediction of In-hospital Mortality in Emergency Department Patients With Sepsis: A Local Big Data-Driven, Machine Learning Approach. , 2016, Academic emergency medicine : official journal of the Society for Academic Emergency Medicine.

[15]  Mladen Kezunovic,et al.  The Fundamental Concept of Unified Generalized Model and Data Representation for New Applications in the Future Grid , 2012, 2012 45th Hawaii International Conference on System Sciences.

[16]  Jim Duggan,et al.  Forecasting energy demand, wind generation and carbon dioxide emissions in Ireland using evolutionary neural networks , 2018, Energy.

[17]  Min Chen,et al.  Disease Prediction by Machine Learning Over Big Data From Healthcare Communities , 2017, IEEE Access.

[18]  Michael B. Miller Linear Regression Analysis , 2013 .

[19]  Jim Duggan,et al.  Watershed management using neuroevolution , 2018, Modeling Earth Systems and Environment.

[20]  Arshdeep Bahga,et al.  Synthetic Workload Generation for Cloud Computing Applications , 2011, J. Softw. Eng. Appl..

[21]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[22]  Lukasz Golab,et al.  Smart Meter Data Analytics , 2017, ACM Trans. Database Syst..

[23]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[24]  Jim Gao,et al.  Machine Learning Applications for Data Center Optimization , 2014 .

[25]  Joydeep Ghosh,et al.  Graph databases for large-scale healthcare systems: A framework for efficient data management and data services , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[26]  Jim Duggan,et al.  A meta optimisation analysis of particle swarm optimisation velocity update equations for watershed management learning , 2018, Appl. Soft Comput..

[27]  George K. Karagiannidis,et al.  Efficient Machine Learning for Big Data: A Review , 2015, Big Data Res..

[28]  Zenghui Wang,et al.  Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review , 2017, Neural Computation.

[29]  Sheng Yu,et al.  Generation of Synthetic Electronic Medical Record Text , 2018, 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[30]  Mladen Kezunovic,et al.  The role of big data in improving power system operation and protection , 2013, 2013 IREP Symposium Bulk Power System Dynamics and Control - IX Optimization, Security and Control of the Emerging Power Grid.

[31]  CARLOS A. GOMEZ-URIBE,et al.  The Netflix Recommender System , 2015, ACM Trans. Manag. Inf. Syst..