Diftong: a tool for validating big data workflows

Data validation is about verifying the correctness of data. When organisations update and refine their data transformations to meet evolving requirements, it is imperative to ensure that the new version of a workflow still produces the correct output. We motivate the need for workflows and describe the implementation of a validation tool called Diftong. This tool compares two tabular databases resulting from different versions of a workflow to detect and prevent potential unwanted alterations. Row-based and column-based statistics are used to quantify the results of the database comparison. Diftong was shown to provide accurate results in test scenarios, bringing benefits to companies that need to validate the outputs of their workflows. By automating this process, the risk of human error is also eliminated. Compared to the more labour-intensive manual alternative, it has the added benefit of improved turnaround time for the validation process. Together this allows for a more agile way of updating data transformation workflows.

[1]  Naveen Garg,et al.  Challenges and Techniques for Testing of Big Data , 2016 .

[2]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[3]  Carlos Ordonez,et al.  Managing Big Data Analytics Workflows with a Database System , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[4]  Jerry Zeyu Gao,et al.  Big Data Validation and Quality Assurance -- Issuses, Challenges, and Needs , 2016, 2016 IEEE Symposium on Service-Oriented System Engineering (SOSE).

[5]  Yun Guo,et al.  A Scalable Big Data Test Framework , 2015, 2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST).

[6]  Rachida Dssouli,et al.  Big Data Pre-processing: A Quality Framework , 2015, 2015 IEEE International Congress on Big Data.

[7]  Janez Žerovnik,et al.  Elementary methods for computation of quartiles , 2017 .

[8]  Alejandro A. Vaisman,et al.  Data Quality in a Big Data Context , 2018, ADBIS.

[9]  Dean N. Williams,et al.  A workflow-enabled big data analytics software stack for escience , 2015, 2015 International Conference on High Performance Computing & Simulation (HPCS).

[10]  Amr F. Desouky,et al.  RDBMS, NoSQL, Hadoop: A Performance-Based Empirical Analysis , 2016, AMECSE '16.

[11]  Carlo Batini,et al.  Data and Information Quality , 2016, Data-Centric Systems and Applications.

[12]  Ihab F. Ilyas,et al.  Distributed Data Deduplication , 2016, Proc. VLDB Endow..

[13]  Maria E. Orlowska,et al.  Data Flow and Validation in Workflow Modelling , 2004, ADC.

[14]  Cornelia Gyorodi,et al.  A Comparative Study of Relational and Non-Relational Database Models in a Web- Based Application , 2015 .

[15]  Mohammed Erritali,et al.  Evaluation of high-level query languages based on MapReduce in Big Data , 2018, Journal of Big Data.

[16]  Ashlesha S. Nagdive,et al.  Overview on Performance Testing Approach in Big Data , 2014 .

[17]  Taghi M. Khoshgoftaar,et al.  Deep learning applications and challenges in big data analytics , 2015, Journal of Big Data.

[18]  Jan vom Brocke,et al.  Increasing Trust in (Big) Data Analytics , 2018, CAiSE Workshops.

[19]  Carlo Batini,et al.  On the Meaningfulness of “Big Data Quality” (Invited Paper) , 2015, Data Science and Engineering.

[20]  Jorge Bernardino,et al.  Testing data-centric services using poor quality data: from relational to NoSQL document databases , 2017, Journal of the Brazilian Computer Society.

[21]  Carlo Batini,et al.  Data and Information Quality , 2016, Data-Centric Systems and Applications.

[22]  Leonardo Mostarda,et al.  Modeling temporal aspects of sensor data for MongoDB NoSQL database , 2017, Journal of Big Data.

[23]  Günther Pernul,et al.  Trust and Big Data: A Roadmap for Research , 2014, 2014 25th International Workshop on Database and Expert Systems Applications.

[24]  Yangyong Zhu,et al.  The Challenges of Data Quality and Data Quality Assessment in the Big Data Era , 2015, Data Sci. J..

[25]  Jerry Zeyu Gao,et al.  Big Data Validation Case Study , 2017, 2017 IEEE Third International Conference on Big Data Computing Service and Applications (BigDataService).

[26]  Theodora Varvarigou,et al.  A Robust Information Life Cycle Management Framework for Securing and Governing Critical Infrastructure Systems , 2018, Inventions.