Missing Data Imputation for Geolocation-based Price Prediction Using KNN-MCF Method

Accurate house price forecasts are very important for formulating national economic policies. In this paper, we offer an effective method to predict houses’ sale prices. Our algorithm includes one-hot encoding to convert text data into numeric data, feature correlation to select only the most correlated variables, and a technique to overcome the missing data. Our approach is an effective way to handle missing data in large datasets with the K-nearest neighbor algorithm based on the most correlated features (KNN–MCF). As far as we are concerned, there has been no previous research that has focused on important features dealing with missing observations. Compared to the typical machine learning prediction algorithms, the prediction accuracy of the proposed method is 92.01% with the random forest algorithm, which is more efficient than the other methods.

[1]  Joseph Hilbe,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2009 .

[2]  Taeyoung Kim,et al.  Analysis and Impact Evaluation of Missing Data Imputation in Day-ahead PV Generation Forecasting , 2019, Applied Sciences.

[3]  Perica Strbac,et al.  Toward optimal feature selection using ranking methods and classification algorithms , 2011 .

[4]  Jack Sklansky,et al.  On Automatic Feature Selection , 1988, Int. J. Pattern Recognit. Artif. Intell..

[5]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[6]  J. Schafer,et al.  A comparison of inclusive and restrictive strategies in modern missing data procedures. , 2001, Psychological methods.

[7]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[8]  Saeid Homayouni,et al.  The First Wetland Inventory Map of Newfoundland at a Spatial Resolution of 10 m Using Sentinel-1 and Sentinel-2 Data on the Google Earth Engine Cloud Computing Platform , 2018, Remote. Sens..

[9]  Michel Verleysen,et al.  K nearest neighbours with mutual information for simultaneous classification and missing data imputation , 2009, Neurocomputing.

[10]  Shulin Wang,et al.  Feature selection in machine learning: A new perspective , 2018, Neurocomputing.

[11]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Junjie Wu,et al.  An Optimal 2-D Spectrum Matching Method for SAR Ground Moving Target Imaging , 2018, IEEE Transactions on Geoscience and Remote Sensing.

[13]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[14]  Shahab Jolani,et al.  Guidelines for multiple imputations in repeated measurements with time-dependent covariates: a case study. , 2018, Journal of clinical epidemiology.

[15]  Peter Filzmoser,et al.  Imputation of missing values for compositional data using classical and robust methods , 2008 .