Local Similarity Imputation Based on Fast Clustering for Incomplete Data in Cyber-Physical Systems

Missing values are common in cyber-physical systems (CPS) for a variety of reasons, such as sensor faults, communication malfunctions, environmental interferences, and human errors. An accurate missing value imputation is crucial to promote the data quality for data mining and statistical analysis tasks. Unfortunately, most of the existing methods take use of the whole data set to impute a missing value, which could have unfavorable influences and impacts (low accuracy or high complexity) on the imputed results caused by irrelevant records. Aiming at this problem, this paper develops a novel local similarity imputation method that estimates missing data based on fast clustering and top $k$-nearest neighbors. To improve the imputation accuracy, a two-layer stacked autoencoder combined with distinctive imputation is applied to locate the principal features of a dataset for clustering. Then, the top $k$ -nearest neighbor hybrid distance weighted imputation is approached to fill in missing values in clusters. The proposed method is evaluated on five popular University of California Irvine datasets as well as one air quality monitoring dataset collected from CPS through comparison with four high-quality existing imputation methods. Empirical results present that the proposed scheme can impute the missing data values effectively and efficiently, especially for the incomplete data with local characteristic in CPS.

[1]  Tero Aittokallio,et al.  Dealing with missing values in large-scale studies: microarray data imputation and beyond , 2010, Briefings Bioinform..

[2]  Insup Lee,et al.  Cyber-physical systems: The next computing revolution , 2010, Design Automation Conference.

[3]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[4]  Jitender S. Deogun,et al.  Towards Missing Data Imputation: A Study of Fuzzy K-means Clustering Method , 2004, Rough Sets and Current Trends in Computing.

[5]  Ming Li,et al.  Forecasting Fine-Grained Air Quality Based on Big Data , 2015, KDD.

[6]  Md Zahidul Islam,et al.  A Decision Tree-based Missing Value Imputation Technique for Data Pre-processing , 2011, AusDM.

[7]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[8]  Esther-Lydia Silva-Ramírez,et al.  Missing value imputation on missing completely at random data using multilayer perceptrons , 2011, Neural Networks.

[9]  Feng Xia,et al.  A High-Order Possibilistic $C$-Means Algorithm for Clustering Incomplete Multimedia Data , 2017, IEEE Systems Journal.

[10]  Hong Gu,et al.  A hybrid genetic algorithm–fuzzy c-means approach for incomplete data clustering based on nearest-neighbor intervals , 2013, Soft Comput..

[11]  Vadlamani Ravi,et al.  Data imputation via evolutionary computation, clustering and a neural network , 2015, Neurocomputing.

[12]  Xingshe Zhou,et al.  A Data-Centric Framework for Cyber-Physical-Social Systems , 2015, IT Prof..

[13]  Md Zahidul Islam,et al.  FIMUS: A framework for imputing missing values using co-appearance, correlation and similarity analysis , 2014, Knowl. Based Syst..

[14]  Hong Cheng,et al.  TRIP: An Interactive Retrieving-Inferring Data Imputation Approach , 2015, IEEE Transactions on Knowledge and Data Engineering.

[15]  Wang Ling,et al.  Estimation of Missing Values Using a Weighted K-Nearest Neighbors Algorithm , 2009, 2009 International Conference on Environmental Science and Information Application Technology.

[16]  Zili Zhang,et al.  Missing Value Estimation for Mixed-Attribute Data Sets , 2011, IEEE Transactions on Knowledge and Data Engineering.

[17]  T. Schneider Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values. , 2001 .

[18]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[19]  C. Willmott Some Comments on the Evaluation of Model Performance , 1982 .

[20]  Johan A. K. Suykens,et al.  Handling missing values in support vector machine classifiers , 2005, Neural Networks.

[21]  Bing Yu,et al.  Missing data analyses: a hybrid multiple imputation algorithm using Gray System Theory and entropy based on clustering , 2013, Applied Intelligence.

[22]  Lynne E. Parker,et al.  Nearest neighbor imputation using spatial-temporal correlations in wireless sensor networks , 2014, Inf. Fusion.

[23]  Swati Aggarwal,et al.  Hybrid model for data imputation: Using fuzzy c means and multi layer perceptron , 2014, 2014 IEEE International Advance Computing Conference (IACC).

[24]  Ahmet Arslan,et al.  A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm , 2013, Inf. Sci..

[25]  Florian Metze,et al.  Extracting deep bottleneck features using stacked auto-encoders , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.