Data Cleaning Optimization for Grain Big Data Processing using Task Merging

Data quality has exerted important influence over the application of grain big data, so data cleaning is a necessary and important work. In MapReduce frame, we can use parallel technique to execute data cleaning in high scalability mode, but due to the lack of effective design there are amounts of computing redundancy in the process of data cleaning, which results in lower performance. In this research, we found some tasks often are carried out multiple times on same input files, or require same operation results in the process of data cleaning. For this problem, we proposed a new optimization technique that is based on task merge. By merging simple or redundancy computations on same input files, the number of the loop computation in MapReduce can be reduced greatly. The experiment shows, by this means, the overall system runtime is significantly reduced, which proves that the process of data cleaning is optimized. In this paper, we optimized several modules of data cleaning such as entity identification, inconsistent data restoration, and missing value filling. Experimental results show that the proposed method in this paper can increase efficiency for grain big data cleaning.

[1]  Madhuri Gupta,et al.  Efficient entity resolution using multiple blocking keys for bibliographic dataset , 2017, 2017 International Conference on Intelligent Communication and Computational Techniques (ICCT).

[2]  George Kollios,et al.  MRShare , 2010, Proc. VLDB Endow..

[3]  Carlos Eduardo S. Pires,et al.  An efficient spark-based adaptive windowing for entity matching , 2017, J. Syst. Softw..

[4]  Nan Tang,et al.  Big Data Cleaning , 2014, APWeb.

[5]  Sharad Mehrotra,et al.  Parallel Progressive Approach to Entity Resolution Using MapReduce , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[6]  Ryo Yoshinaka,et al.  Micro-clustering by data polishing , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[7]  Chien-Hung Chen,et al.  MapReduce Scheduling for Deadline-Constrained Jobs in Heterogeneous Cloud Computing Systems , 2018, IEEE Transactions on Cloud Computing.

[8]  Wang Ju-gou Ideas about Improving Foodstuff Statistic , 2007 .

[9]  Cheqing Jin,et al.  MapReduce-based entity matching with multiple blocking functions , 2016, Frontiers of Computer Science.

[10]  Jianping Fan,et al.  A novel framework for semantic entity identification and relationship integration in large scale text data , 2016, Future Gener. Comput. Syst..

[11]  Wenfei Fan,et al.  Inferring data currency and consistency for conflict resolution , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[12]  Shuai Ma,et al.  Interaction between Record Matching and Data Repairing , 2014, JDIQ.

[13]  Jizhou Sun,et al.  Truth discovery on inconsistent relational data , 2018, Tsinghua Science and Technology.

[14]  Qingquan Li,et al.  A Data Cleaning Method for Big Trace Data Using Movement Consistency , 2018, Sensors.