Statistical Methods for Developing Ratio Edit Tolerances for Economic Data

Key data items collected by the economic programs of the U.S. Bureau of the Census are subjected to ratio edits as a part of the overall data-review process. In a ratio edit, the ratio of two highly correlated items is compared to upper and lower bounds, known as tolerances. Ratios outside the tolerances are edit failures, and one or both of the items in an edit-failing ratio are either imputed or ̄agged for analyst review. The effectiveness of the ratio edit is therefore dependent on the tolerances. From a subject-matter analyst's perspective, ratio edits are appealing because it is dif®cult to evaluate the ``reasonableness'' of a data item's value by itself. By comparing an item to other related values in the questionnaire, the analyst can determine if the response appears valid. For example, the ratio of total annual hours to total employees should be approximately 2,000 (40 hours a work week multiplied by 50 work weeks a year). Ratio edit systems are equally appealing from a mathematical perspective. By augmenting explicitly de®ned ratio edits with implied ratio edits, a record containing edit failures can be corrected by applying a set covering procedure to the set of edit-failing The U.S. Census Bureau developed general-purpose ratio edit software for use by the ten sectors of the 1997 Economic Census. This software requires explicit bounds (tolerances) for each ratio edit. We investigated statistical methods of automatically setting tolerance limits, examining three methods: robust estimation (15% trimmed mean and standard deviation); resistant fences (EDA method based on ®rst and third quartiles and interquartile range); and gap analysis (Distance Measurement Algorithm for the Selection of Outliers, D_MASO). We also developed an approach for symmetrizing skewed distributions of ratios using power transformations prior to tolerance development. We evaluated these methods on two sets of historical data: the 1994 Annual Survey of Manufactures (ASM) and the 1992 Business Census. In both data sets, we achieved success with some variation of resistant fences and recommend that this methodology be used in the absence of subject-matter expertise or known mathematical bounds on a ratio relationship.