Autonomous Sensor Data Cleaning in Stream Mining Setting

Abstract Background: Internet of Things (IoT), earth observation and big scientific experiments are sources of extensive amounts of sensor big data today. We are faced with large amounts of data with low measurement costs. A standard approach in such cases is a stream mining approach, implying that we look at a particular measurement only once during the real-time processing. This requires the methods to be completely autonomous. In the past, very little attention was given to the most time-consuming part of the data mining process, i.e. data pre-processing. Objectives: In this paper we propose an algorithm for data cleaning, which can be applied to real-world streaming big data. Methods/Approach: We use the short-term prediction method based on the Kalman filter to detect admissible intervals for future measurements. The model can be adapted to the concept drift and is useful for detecting random additive outliers in a sensor data stream. Results: For datasets with low noise, our method has proven to perform better than the method currently commonly used in batch processing scenarios. Our results on higher noise datasets are comparable. Conclusions: We have demonstrated a successful application of the proposed method in real-world scenarios including the groundwater level, server load and smart-grid data

[1]  Eyke Hüllermeier,et al.  Open challenges for data stream mining research , 2014, SKDD.

[2]  Masoud Al Quhtani Data Mining Usage in Corporate Information Security: Intrusion Detection Applications , 2017 .

[3]  M. Zekić-Sušac,et al.  Data Mining as Support to Knowledge Management in Marketing , 2015 .

[4]  Ph. D. Shu Xu Data cleaning and knowledge discovery in process data , 2015 .

[5]  Al Quhtani Masoud,et al.  Data Mining Usage in Corporate Information Security: Intrusion Detection Applications , 2017 .

[6]  R. E. Kalman,et al.  A New Approach to Linear Filtering and Prediction Problems , 2002 .

[7]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[8]  Ihab F. Ilyas,et al.  Data Cleaning: Overview and Emerging Challenges , 2016, SIGMOD Conference.

[9]  Wei Fan,et al.  Mining big data: current status, and forecast to the future , 2013, SKDD.

[10]  Tommaso Proietti,et al.  A Data-Cleaning Augmented Kalman Filter for Robust Estimation of State Space Models , 2016 .

[11]  António Trigo,et al.  Impact of ICT Innovative Momentum on Real-Time Accounting , 2015 .

[12]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[13]  Sanjay Krishnan,et al.  ActiveClean: Interactive Data Cleaning For Statistical Modeling , 2016, Proc. VLDB Endow..

[14]  Lon-Mu Liu,et al.  Joint Estimation of Model Parameters and Outlier Effects in Time Series , 1993 .

[15]  USAGE OF THE KALMAN FILTER FOR DATA CLEANING OF SENSOR DATA , 2013 .

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  Dominique Brodbeck,et al.  Research directions in data wrangling: Visualizations and transformations for usable and credible data , 2011, Inf. Vis..