An Automated Big Data Accuracy Assessment Tool

Data analysis is the most important aspect of any business as it is critical to decision-making. Data quality assessment is a necessary function to be performed before data analysis, as the quality of data has high impact on the outcome of the analysis. Data quality is a multi-dimensional factor that affects the analysis in numerous ways. Among all the dimensions, accuracy is the most important and hardest dimension to assess. With the advent of big data, this problem becomes more complicated. There are only few studies that focus on data accuracy with minimal domain expert dependency. In this paper, we propose an extensive data accuracy assessment tool that uses machine learning to determine the accuracy of data. In addition, our model also addresses the intrinsic and contextual categories of data accuracy. Our model was developed on Apache Spark which serves as the big data environment for handling large datasets.