1. INTRODUCTION This paper is my description of the state of statistical data editing and current research problems. It is not intended to be a complete description of all areas. Rather, it represents sub-areas of statistical data editing that I will describe in sufficient detail so that the discussion of a few research problems is more easily understood. I define statistical data editing (SDE) as those methods that are used to edit (i.e., clean-up) and impute (fill-in) missing or contradictory data. The end result of SDE is data that can be used for intended analytic purposes. These include primary purposes such as estimation of totals and subtotals for publications that are free of self-contradictory information. The published totals do not contradict published totals in other sources. Self-contradictory information might include groups of items that do not add to desired subtotals or totals for subgroups that exceed a known proportion of the total for the entire group. The uses of the data after SDE might be preparation of variances of estimates for a number of sub-domains and micro-data analyses. If only a few published totals need to be accurate, then an efficient use of resources may be to perform detailed edits on only a few records that effect the estimated totals. If many analyses need to be performed on a large number of sub-domains or if the full set of accurate micro-data are needed, then a very large number of edits, follow-up, and corrections may be needed. SDE can be used in all phases of survey processing. These phases include frame development, form design, proposed analytic purposes for which the data are collected, and quality assurance. This paper focuses primarily on SDE as it applies to analytic purposes, and places most emphasis on those procedures typically applied after the initial receipt of survey or other data. The main goal of SDE might be improved procedures and greater automation to enhance the ability of survey managers and analysts to provide accurate published estimates and micro-data. I broadly subdivide statistical data editing into two subcategories: (1) Fellegi-Holt (FH) methods and systems and (2) General methods and systems. FH systems are based on the Fellegi-Holt model of editing and typically add various options for imputation. General methods are all other methods. Whereas the paper by Fellegi and Holt (1976) appeared quite awhile ago, few systems have been implemented because of the difficulty in developing …
[1]
Richard Sigman,et al.
Statistical Methods for Developing Ratio Edit Tolerances for Economic Data
,
1999
.
[2]
Laurence A. Wolsey,et al.
Integer and Combinatorial Optimization
,
1988
.
[3]
N. Chernikova.
Algorithm for finding a general formula for the non-negative solutions of a system of linear equations
,
1964
.
[4]
Roderick J. A. Little,et al.
Statistical Analysis with Missing Data
,
1988
.
[5]
William E. Winkler,et al.
THE DISCRETE EDIT SYSTEM
,
1997
.
[6]
D. Rubin,et al.
Statistical Analysis with Missing Data
,
1988
.
[7]
Todd A. Todaro.
Evaluation of the Aggies Automated Edit and Imputation System
,
1999
.
[8]
D. Holt,et al.
A Systematic Approach to Automatic Edit and Imputation
,
1976
.
[9]
R. S. Garfinkel,et al.
Optimal Imputation of Erroneous Data: Categorical Data, General Edits
,
1986,
Oper. Res..
[10]
Bor-Chung Chen,et al.
Set Covering Algorithms in Edit Generation
,
1998
.
[11]
William E. Winkler,et al.
SET-COVERING AND EDITING DISCRETE DATA
,
1998
.