Prediction of a Function of Misclassified Binary Data

We consider the problem of predicting a function of misclassified binary variables. We make an interesting observation that the naive predictor, which ignores the misclassification errors, is unbiased even if the total misclassification error is high as long as the probabilities of false positives and false negatives are identical. Other than this case, the bias of the naive predictor depends on the misclassification distribution and the magnitude of the bias can be high in certain cases. We correct the bias of the naive predictor using a double sampling idea where both inaccurate and accurate measurements are taken on the binary variable for all the units of a sample drawn from the original data using a probability sampling scheme. Using this additional information and design-based sample survey theory, we derive a biascorrected predictor. We examine the cases where the new bias-corrected predictors can also improve over the naive predictor in terms of mean square error (MSE).

[1]  P Gustafson,et al.  Case–Control Analysis with Partial Knowledge of Exposure Misclassification Probabilities , 2001, Biometrics.

[2]  Fritz Scheuren,et al.  Regression Analysis of Data Files that Are Computer Matched , 1993 .

[3]  Ying Yang,et al.  Maximum likelihood estimation of a binomial proportion using one‐sample misclassified binary data , 2015 .

[4]  Dean M. Young,et al.  Confidence intervals for a binomial parameter based on binary data subject to false-positive misclassification , 2006, Comput. Stat. Data Anal..

[5]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .

[6]  A. Tenenbein A Double Sampling Scheme for Estimating from Binomial Data with Misclassifications , 1970 .

[7]  Michael Evans,et al.  Bayesian Analysis of Binary Data Subject to Misclassification , 1996 .

[8]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[9]  Brent Snook,et al.  Computerized Crime Linkage Systems , 2012 .

[10]  I. Bross Misclassification in 2 X 2 Tables , 1954 .

[11]  Frank Yates,et al.  Selection Without Replacement from Within Strata with Probability Proportional to Size , 1953 .

[12]  P. Lahiri,et al.  Regression Analysis With Linked Data , 2005 .

[13]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[14]  Alan D. Lopez,et al.  Evaluation of record linkage of mortality data between a health and demographic surveillance system and national civil registration system in South Africa , 2014, Population Health Metrics.

[15]  Robert H Lyles,et al.  Design and Analytic Considerations for Single-Armed Studies with Misclassification of a Repeated Binary Outcome , 2004, Journal of biopharmaceutical statistics.

[16]  Robert L. Winkler,et al.  Implications of errors in survey data: a Bayesian model , 1992 .

[17]  G R Howe,et al.  Use of computerized record linkage in cohort studies. , 1998, Epidemiologic reviews.

[18]  Norman E. Breslow,et al.  Multiplicative Models and Cohort Analysis , 1983 .

[19]  M. Viana,et al.  Bayesian analysis of prevalence from the results of small screening samples , 1993 .

[20]  Lalitha Sundaresan,et al.  Validation of de-identified record linkage to ascertain hospital admissions in a cohort study , 2011, BMC medical research methodology.

[21]  Yan D. Zhao,et al.  One-way analysis of proportions for misclassified binomial data , 2013 .

[22]  Judith D. Goldberg,et al.  The Effects of Misclassification on the Bias in the Difference Between Two Proportions and the Relative Odds in the Fourfold Table , 1975 .

[23]  Christophe G. Giraud-Carrier,et al.  Effective record linkage for mining campaign contribution data , 2014, Knowledge and Information Systems.

[24]  John Neter,et al.  The Effect of Mismatching on the Measurement of Response Errors , 1965 .

[25]  D. Young,et al.  Bayesian Estimation of Intervention Effect with Pre- and Post-Misclassified Binomial Data , 2007, Journal of biopharmaceutical statistics.

[26]  Bob Zhong EVALUATING QUALITATIVE ASSAYS USING SENSITIVITY AND SPECIFICITY , 2002, Journal of biopharmaceutical statistics.