Data Preprocessing Techniques

It is hard for raw industrial data accumulated by commonly implemented supervisory control and data acquisition (SCADA) system on-site to be directly employed to construct a prediction model, given that such data are always mixed with high level noise, missing points, and outliers due to the possible real-time database malfunction, data transformation, or maintenance. Thereby, the data preprocessing techniques have to be implemented, which usually contain anomaly data detection, data imputation, and data de-noising techniques. As for the issue of outliers, in this chapter, we introduce the anomaly detection methods based on fuzzy C means (FCM), K-nearest-neighbor (KNN), and dynamic time warping (DTW) algorithms. To tackle the missing data points problem, a series of data imputation methods are also described. After introducing the generic regression filling and expectation maximum methods, we supplement a varied window similarity measure method, the segmented shape-representation-based method, and the non-equal-length granules correlation method for industrial data imputation. With respect to the high level noise embodied in raw data, we then give an introduction to the well-known empirical mode decomposition (EMD) method. To verify the effectiveness of these methods, this chapter also provides a number of industrial case studies.

[1]  Andrew Gelman,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2006 .

[2]  Mohamed Medhat Gaber,et al.  Knowledge discovery from data streams , 2009, IDA 2009.

[3]  Harald Haas,et al.  Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication , 2004, Science.

[4]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[5]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[6]  Gustavo E. A. P. A. Batista,et al.  A Study of K-Nearest Neighbour as an Imputation Method , 2002, HIS.

[7]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[8]  Ahmet Arslan,et al.  A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm , 2013, Inf. Sci..

[9]  Laurent Itti,et al.  An Integrated Model of Top-Down and Bottom-Up Attention for Optimizing Detection Speed , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[10]  D.P. Mandic,et al.  Multi-step forecasting using echo state networks , 2005, EUROCON 2005 - The International Conference on "Computer as a Tool".

[11]  J. Chiang,et al.  A new kernel-based fuzzy clustering approach: support vector clustering with cell growing , 2003, IEEE Trans. Fuzzy Syst..

[12]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[13]  Tak-Chung Fu,et al.  A review on time series data mining , 2011, Eng. Appl. Artif. Intell..

[14]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[15]  Jean-Marc Adamo,et al.  Data Mining for Association Rules and Sequential Patterns , 2000, Springer New York.

[16]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[17]  Christofer Toumazou,et al.  Empirical Mode Decomposition: Real-Time Implementation and Applications , 2013, J. Signal Process. Syst..

[18]  Jesús S. Aguilar-Ruiz,et al.  Knowledge discovery from data streams , 2009, Intell. Data Anal..

[19]  Gabriel Rilling,et al.  On empirical mode decomposition and its algorithms , 2003 .

[20]  Han Min,et al.  Ridge regression learning in ESN for chaotic time series prediction , 2007 .

[21]  Ge Yu,et al.  FSMBO: Fast Time Series Similarity Matching Based on Bit Operation , 2008, 2008 The 9th International Conference for Young Computer Scientists.

[22]  Kai Liu,et al.  Adaptive fuzzy clustering based anomaly data detection in energy system of steel industry , 2014, Inf. Sci..

[23]  Chun-Chin Hsu,et al.  An information granulation based data mining approach for classifying imbalanced data , 2008, Inf. Sci..

[24]  Wei Wang,et al.  Data imputation for gas flow data in steel industry based on non-equal-length granules correlation coefficient , 2016, Inf. Sci..

[25]  James C. Bezdek,et al.  On cluster validity for the fuzzy c-means model , 1995, IEEE Trans. Fuzzy Syst..

[26]  Dimitris Kanellopoulos,et al.  Data Preprocessing for Supervised Leaning , 2007 .

[27]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection , 1998 .

[28]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[29]  E. Miller,et al.  Top-Down Versus Bottom-Up Control of Attention in the Prefrontal and Posterior Parietal Cortices , 2007, Science.

[30]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[31]  Richard J. Povinelli,et al.  Time series outlier detection and imputation , 2014, 2014 IEEE PES General Meeting | Conference & Exposition.

[32]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .