A NETWORK FLOW DISCLOSURE AVOIDANCE SYSTEM APPLIED TO THE CENSUS OF AGRICULTURE

The U.S. Bureau of the Census, Agriculture Division, has the responsibility to collect data regarding the agricultural sector and to publish this data without violating confidentiality laws. Collected data contains sensitive data values, commonly referred to as primary suppressions, that if directly published could identify an individual or farm operation. There are a number of methods available which prevent compromising the primary suppressions. These disclosure avoidance techniques include rounding, perturbation, and cell suppression, and are outlined in the article by Cox, et al. (1986a). Since rounding and perturbation are unsatisfactory for aggregate magnitude data (Cox, et al, 1986b), the Economic and Agriculture Divisions have always chosen a cell suppression technique to protect published tabular data. Instead of the sensitive data value appearing in the publication, a "D" appears in its place. However, in most cases, the sensitive data values could still be derived from non-sensitive data because most data items are published in additive tables. Therefore, additional data values must be suppressed. These additional suppressed data values are commonly referred to as complementary suppressions. The objective in applying complementary suppressions is to ensure the protection of the sensitive data value at minimum cost. Note that this requires assigning a cost of suppression to each data cell. Commonly, the original data value that would have appeared in the publication is assigned as the cost. Minimizing the cost incurred through complementary suppressions produces a publishable table with maximum data utility; that is, the greatest amount of usable data is provided. In recent years, the Bureau has conducted research on a cell suppression technique which utilizes network flow methodology. The origin of using graph theory in the disclosure avoidance area lies in Cox (1980), and Gusfield (1984). More recently, Cox, et al (1986a), has outlined this methodology. A more complete history is given in Greenberg (1990). A general outline of the minimum cost network flow problem and related methodology appears in Bazaraa & Jarvis (1977), and Gondran & Minoux (1984). Prior to the 1978 Census of Agriculture, analysts in the division performed cell suppression by hand using a technique occasionally referred to as the "nearest-smallest method". For an outline of this method see Zayatz, et al (forthcoming). The cell suppression procedure was first automated for the 1978 Census of Agriculture by programming a portion of the hand procedure. However, a major portion of the complementary procedure was still performed manually. Minor revisions were made to the existing automated cell suppression procedure for the 1982 Census of Agriculture and the remainder of the hand procedure was automated. This was the first time the entire disclosure avoidance procedure was automated. After the 1982 Census of Agriculture, the disclosure procedure was reviewed and recommendations for improvements were made. These improvements were implemented for the 1987 Census of Agriculture. However, since the automated cell suppression procedure was not based upon any statistical or mathematical methodology, it was not always reliable. Frequently, oversuppression occurred which decreased the amount of usable data published. Also, undersuppression occurred which required analyst intervention to fully protect all sensitive data values. For the 1992 Census of Agriculture, research was conducted on the cell suppression technique using the network flow system of applying complementary suppressions. However, the network flow system used by other divisions of the Bureau could only accommodate a single two dimensional table. Almost all agricultural data (as well as most of the data in other economic areas) are contained in a system of two dimensional tables. In addition, although Business Division and Industry Division have strictly hierarchical data structures, Agriculture Division does not. Further contributing to the complexity of agricultural data are systems of three dimensional tables. Because of these problems, the existing network flow system was not optimal for agricultural data, thereby requiring customization. This paper discusses the formulation of the customized network methodology and the limitations encountered with the customized version when applied to agricultural data. In Section 2 we describe the fundamentals of the network flow system of applying complementary suppressions. A system of two dimensional tables with "appendages" is presented in Section 3. In Section 4 we discuss a heuristic that will link networks to accommodate three dimensional tables with appendages. We present the main limitations in Section 5 and provide concluding remarks in Section 6.