Empirical likelihood confidence intervals for differences between two datasets with missing data

Detecting differences between populations (or datasets) is an important research topic in machine learning, yet an common application means of evaluating, such as a new medical product by comparing with an old one. Previous researchers focus on change detection. In this paper, we measure the uncertainty of structural differences, such as mean and distribution function differences, between populations, using a confidence interval (CI), via an empirical likelihood approach. We present a statistically sound method for estimating CIs for differences between non-parametric populations with missing values, which are imputed by using simple random hot deck imputation method. We illustrate the power of CI estimation as a new machine learning technique for, such as, distinguishing spam from non-spam emails in spambase dataset downloaded from UCI.

[1]  I. McKeague,et al.  Empirical likelihood based hypothesis testing , 2003 .

[2]  Art B. Owen,et al.  Data Squashing by Empirical Likelihood , 2004, Data Mining and Knowledge Discovery.

[3]  J. N. K. Rao,et al.  Empirical Likelihood‐based Inference in Linear Models with Missing Data , 2002 .

[4]  A. Owen Empirical Likelihood Ratio Confidence Regions , 1990 .

[5]  H. O. Hartley,et al.  A new estimation theory for sample surveys , 1968 .

[6]  DavidR . Thomas,et al.  Confidence Interval Estimation of Survival Probabilities for Censored Data , 1975 .

[7]  Geoffrey I. Webb,et al.  On detecting differences between groups , 2003, KDD '03.

[8]  Yuichi Kitamura,et al.  Empirical likelihood methods with weakly dependent processes , 1997 .

[9]  Wayne A. Fuller,et al.  Fractional hot deck imputation , 2004 .

[10]  Stephen D. Bay,et al.  Characterizing Model Erros and Differences , 2000, ICML.

[11]  Randy R. Sitter,et al.  EFFICIENT RANDOM IMPUTATION FOR MISSING DATA IN COMPLEX SURVEYS , 2000 .

[12]  J. Rao On Variance Estimation with Imputed Survey Data , 1996 .

[13]  Gao Cong,et al.  Speed-up iterative frequent itemset mining with constraint changes , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[14]  P. Hall The Bootstrap and Edgeworth Expansion , 1992 .

[15]  J. N. K. Rao,et al.  Empirical likelihood-based inference under imputation for missing response data , 2002 .

[16]  Yinzhong Chen,et al.  INFERENCE WITH SURVEY DATA IMPUTED BY HOT DECK WHEN IMPUTED VALUES ARE NONIDENTIFIABLE , 1999 .

[17]  J. Lawless,et al.  Empirical Likelihood and General Estimating Equations , 1994 .

[18]  J. N. K. Rao,et al.  ASYMPTOTIC NORMALITY UNDER TWO-PHASE SAMPLING DESIGNS , 2007 .

[19]  Peter Hall,et al.  ON THE BOOTSTRAP AND TWO-SAMPLE PROBLEMS , 1988 .

[20]  Donald B. Rubin,et al.  On Variance Estimation With Imputed Survey Data: Comment , 1996 .

[21]  J. Shao,et al.  Bootstrap for Imputed Survey Data , 1996 .

[22]  Jeffrey Xu Yu,et al.  Mining Changes of Classification by Correspondence Tracing , 2003, SDM.

[23]  Stephen D. Bay,et al.  Detecting change in categorical data: mining contrast sets , 1999, KDD '99.

[24]  Ian W. McKeague,et al.  Comparing Distribution Functions via Empirical Likelihood , 2006 .

[25]  Wynne Hsu,et al.  Mining Changes for Real-Life Applications , 2000, DaWaK.

[26]  Bing-Yi Jing,et al.  Two-sample empirical likelihood method , 1995 .

[27]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[28]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[29]  A. Owen Empirical likelihood ratio confidence intervals for a single functional , 1988 .

[30]  Art B. Owen,et al.  Empirical Likelihood for Linear Models , 1991 .