Semi-parametric optimization for missing data imputation

Missing data imputation is an important issue in machine learning and data mining. In this paper, we propose a new and efficient imputation method for a kind of missing data: semi-parametric data. Our imputation method aims at making an optimal evaluation about Root Mean Square Error (RMSE), distribution function and quantile after missing-data are imputed. We evaluate our approaches using both simulated data and real data experimentally, and demonstrate that our stochastic semi-parametric regression imputation is much better than existing deterministic semi-parametric regression imputation in efficiency and effectiveness.

[1]  John A. List,et al.  The Environmental Kuznets Curve: Real Progress or Misspecified Models? , 2003, Review of Economics and Statistics.

[2]  David J. Hand,et al.  A Handbook of Small Data Sets , 1993 .

[3]  Geoffrey I. Webb,et al.  Identifying Approximate Itemsets of Interest in Large Databases , 2004, Applied Intelligence.

[4]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[5]  Pin T. Ng,et al.  The Elasticity of Demand for Gasoline: A Semi-parametric Analysis , 2002 .

[6]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[7]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[8]  Chris Clifton Change Detection in Overhead Imagery Using Neural Networks , 2004, Applied Intelligence.

[9]  Chengqi Zhang,et al.  Optimized parameters for missing data imputation , 2006 .

[10]  A. A. Weiss,et al.  Semiparametric estimates of the relation between weather and electricity sales , 1986 .

[11]  J. Ross Quinlan,et al.  Unknown Attribute Values in Induction , 1989, ML.

[12]  Shichao Zhang,et al.  "Missing is useful": missing values in cost-sensitive decision trees , 2005, IEEE Transactions on Knowledge and Data Engineering.

[13]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[14]  Lasse Holmström,et al.  A semiparametric density estimation approach to pattern classification , 2004, Pattern Recognit..

[15]  Lukasz Kurgan,et al.  Trends in Data Mining and Knowledge Discovery , 2005 .

[16]  S. Hanson,et al.  Mixture Models for Learning from Incomplete Data , 1997 .

[17]  Russell Greiner,et al.  Computational learning theory and natural learning systems: Volume IV: making learning systems practical , 1997, COLT 1997.

[18]  Paola Sebastiani,et al.  Learning Bayesian Networks from Incomplete Databases , 1997, UAI.

[19]  J. N. K. Rao,et al.  Empirical Likelihood‐based Inference in Linear Models with Missing Data , 2002 .

[20]  Ron Kohavi,et al.  Lazy Decision Trees , 1996, AAAI/IAAI, Vol. 1.

[21]  Max Bramer,et al.  Techniques for Dealing with Missing Values in Classification , 1997, IDA.

[22]  J. N. K. Rao,et al.  Empirical likelihood-based inference under imputation for missing response data , 2002 .

[23]  John L.P. Thompson,et al.  Missing data , 2004, Amyotrophic lateral sclerosis and other motor neuron disorders : official publication of the World Federation of Neurology, Research Group on Motor Neuron Diseases.

[24]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[25]  C. Anderson‐Cook,et al.  Robust Parameter Design: A Semi-Parametric Approach , 2005 .

[26]  Chengqi Zhang,et al.  Guest Editors' Introduction: Information Enhancement for Data Mining , 2004, IEEE Intell. Syst..

[27]  Ingram Olkin,et al.  Incomplete data in sample surveys. Vol. 1: report and case studies , 1983 .

[28]  Xiaohua Hu,et al.  A Data Mining Approach for Retailing Bank Customer Attrition Analysis , 2004, Applied Intelligence.

[29]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[30]  Reda Alhajj,et al.  Utilizing Genetic Algorithms to Optimize Membership Functions for Fuzzy Weighted Association Rules Mining , 2006, Applied Intelligence.

[31]  Oliver Linton,et al.  Semiparametric Regression Analysis With Missing Response at Random , 2003 .

[32]  Søren Feodor Nielsen,et al.  1. Statistical Analysis with Missing Data (2nd edn). Roderick J. Little and Donald B. Rubin, John Wiley & Sons, New York, 2002. No. of pages: xv+381. ISBN: 0‐471‐18386‐5 , 2004 .

[33]  J. Peixoto A Property of Well-Formulated Polynomial Regression Models , 1990 .

[34]  Chengqi Zhang,et al.  Guest Editors' Introduction: Special Section on Intelligent Data Preparation , 2005, IEEE Trans. Knowl. Data Eng..

[35]  A. P. White,et al.  Probabilistic induction by dynamic part generation in virtual trees , 1987 .