论文信息 - Data Preprocessing Techniques

Data Preprocessing Techniques

It is hard for raw industrial data accumulated by commonly implemented supervisory control and data acquisition (SCADA) system on-site to be directly employed to construct a prediction model, given that such data are always mixed with high level noise, missing points, and outliers due to the possible real-time database malfunction, data transformation, or maintenance. Thereby, the data preprocessing techniques have to be implemented, which usually contain anomaly data detection, data imputation, and data de-noising techniques. As for the issue of outliers, in this chapter, we introduce the anomaly detection methods based on fuzzy C means (FCM), K-nearest-neighbor (KNN), and dynamic time warping (DTW) algorithms. To tackle the missing data points problem, a series of data imputation methods are also described. After introducing the generic regression filling and expectation maximum methods, we supplement a varied window similarity measure method, the segmented shape-representation-based method, and the non-equal-length granules correlation method for industrial data imputation. With respect to the high level noise embodied in raw data, we then give an introduction to the well-known empirical mode decomposition (EMD) method. To verify the effectiveness of these methods, this chapter also provides a number of industrial case studies.

Jun Zhao | Wei Wang | Chunyang Sheng

[1] Andrew Gelman,et al. Data Analysis Using Regression and Multilevel/Hierarchical Models , 2006 .

[2] Mohamed Medhat Gaber,et al. Knowledge discovery from data streams , 2009, IDA 2009.

[3] Harald Haas,et al. Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication , 2004, Science.

[4] Hiroshi Motoda,et al. Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[5] Dorian Pyle,et al. Data Preparation for Data Mining , 1999 .

[6] Gustavo E. A. P. A. Batista,et al. A Study of K-Nearest Neighbour as an Imputation Method , 2002, HIS.

[7] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[8] Ahmet Arslan,et al. A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm , 2013, Inf. Sci..

[9] Laurent Itti,et al. An Integrated Model of Top-Down and Bottom-Up Attention for Optimizing Detection Speed , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[10] D.P. Mandic,et al. Multi-step forecasting using echo state networks , 2005, EUROCON 2005 - The International Conference on "Computer as a Tool".

[11] J. Chiang,et al. A new kernel-based fuzzy clustering approach: support vector clustering with cell growing , 2003, IEEE Trans. Fuzzy Syst..

[12] George Karypis,et al. A Comparison of Document Clustering Techniques , 2000 .

[13] Tak-Chung Fu,et al. A review on time series data mining , 2011, Eng. Appl. Artif. Intell..

[14] Padhraic Smyth,et al. From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[15] Jean-Marc Adamo,et al. Data Mining for Association Rules and Sequential Patterns , 2000, Springer New York.

[16] Roderick J. A. Little,et al. Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[17] Christofer Toumazou,et al. Empirical Mode Decomposition: Real-Time Implementation and Applications , 2013, J. Signal Process. Syst..

[18] Jesús S. Aguilar-Ruiz,et al. Knowledge discovery from data streams , 2009, Intell. Data Anal..

[19] Gabriel Rilling,et al. On empirical mode decomposition and its algorithms , 2003 .

[20] Han Min,et al. Ridge regression learning in ESN for chaotic time series prediction , 2007 .

[21] Ge Yu,et al. FSMBO: Fast Time Series Similarity Matching Based on Bit Operation , 2008, 2008 The 9th International Conference for Young Computer Scientists.

[22] Kai Liu,et al. Adaptive fuzzy clustering based anomaly data detection in energy system of steel industry , 2014, Inf. Sci..

[23] Chun-Chin Hsu,et al. An information granulation based data mining approach for classifying imbalanced data , 2008, Inf. Sci..

[24] Wei Wang,et al. Data imputation for gas flow data in steel industry based on non-equal-length granules correlation coefficient , 2016, Inf. Sci..

[25] James C. Bezdek,et al. On cluster validity for the fuzzy c-means model , 1995, IEEE Trans. Fuzzy Syst..

[26] Dimitris Kanellopoulos,et al. Data Preprocessing for Supervised Leaning , 2007 .

[27] Hiroshi Motoda,et al. Feature Extraction, Construction and Selection , 1998 .

[28] James C. Bezdek,et al. Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[29] E. Miller,et al. Top-Down Versus Bottom-Up Control of Attention in the Prefrontal and Posterior Parietal Cortices , 2007, Science.

[30] Ethem Alpaydin,et al. Introduction to machine learning , 2004, Adaptive computation and machine learning.

[31] Richard J. Povinelli,et al. Time series outlier detection and imputation , 2014, 2014 IEEE PES General Meeting | Conference & Exposition.

[32] D. Rubin,et al. Statistical Analysis with Missing Data. , 1989 .