A Data Cleaning Service on Massive Spatio-Temporal Data in Highway Domain

With the development of highway toll system and sensor network, massive highway toll data has been accumulated nowadays. The imperfection of raw data, such as incomplete, repetitive and abnormal data, seriously affects the efficiency of data mining modeling. Traditional cleaning methods on massive spatio-temporal data are inefficient, because the business rules are difficult to depict in various domains. On the highway toll data of Henan Province, we propose a data cleaning service through business rules. This service can efficiently clean the raw toll data with spatio-temporal attributes, including the data calibration of erroneous data and invalid data, the repair of erroneous data, and the filtering of duplicate data. Implemented through Hadoop MapReduce on toll data in highway domain, our service shows its efficiency, accuracy and scalability in extensive experiments.

[1]  Jianzhong Li,et al.  Towards certain fixes with editing rules and master data , 2010, Proc. VLDB Endow..

[2]  Nan Tang,et al.  Towards dependable data repairing with fixing rules , 2014, SIGMOD Conference.

[3]  Michael J. Carey,et al.  Breaking BAD: a data serving vision for big active data , 2016, DEBS.

[4]  Paolo Papotti,et al.  Holistic data cleaning: Putting violations into context , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[5]  Nan Tang,et al.  Dependable Data Repairing with Fixing Rules , 2017, ACM J. Data Inf. Qual..

[6]  Derong Shen,et al.  Determining Repairing Sequence of Inconsistencies in Content-Related Data , 2017, WISE.

[7]  Sherali Zeadally,et al.  Handling big data: research challenges and future directions , 2016, The Journal of Supercomputing.

[8]  Weilong Ding,et al.  A Data Cleaning Method on Massive Spatio-Temporal Data , 2016, APSCC.

[9]  Anish Das Sarma,et al.  Data Cleaning: A Practical Perspective , 2013, Data Cleaning: A Practical Perspective.

[10]  Lukasz Golab,et al.  Sampling the repairs of functional dependency violations under hard constraints , 2010, Proc. VLDB Endow..

[11]  Wenfei Fan,et al.  Inferring data currency and consistency for conflict resolution , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[12]  Nan Tang,et al.  Big Data Cleaning , 2014, APWeb.

[13]  M. Zhong,et al.  ESTIMATION OF MISSING TRAFFIC COUNTS USING FACTOR, GENETIC, NEURAL AND REGRESSION TECHNIQUES , 2004 .

[14]  Marina Papatriantafilou,et al.  eChIDNA: Continuous data validation in advanced metering infrastructures , 2018, 2018 IEEE International Energy Conference (ENERGYCON).

[15]  Shian-Shyong Tseng,et al.  Discovering Traffic Bottlenecks in an Urban Network by Spatiotemporal Data Mining on Location-Based Services , 2011, IEEE Transactions on Intelligent Transportation Systems.