A Synthetic Supplemental Public Use File of Low-Income Information Return Data: Methodology, Utility, and Privacy Implications

TA X P OL ICY CENTER | URBAN INSTITUTE & BROOKINGS INSTITUTION i i The Statistics of Income division of the Internal Revenue Service releases an annual public-use file of individual income tax returns that is invaluable to tax analysts in government agencies, nonprofit research organizations, and the private sector. However, the Statistics of Income division has had to take increasingly aggressive measures to protect the data against growing disclosure risks, such as a data intruder matching the anonymized public data with other public information available in nontax databases. This project develops an alternative privacy protection method: a fully synthetic representation of the income tax data that is statistically representative of the original data. The method generates the synthetic data from a smoothed version of the empirical distribution of income tax returns. The resulting synthetic file includes no actual tax return records. In this report, we describe the methods used in the first part of this project, the creation of a synthetic public-use file of nonfilers. We show how the methodology protects the underlying data from disclosure and evaluates the quality of the data. ABOUT THE TAX POLICY CENTER The Urban-Brookings Tax Policy Center aims to provide independent analyses of current and longer-term tax issues and to communicate its analyses to the public and to policymakers in a timely and accessible manner. The Center combines top national experts in tax, expenditure, budget policy, and microsimulation modeling to concentrate on areas of tax policy that are critical to future debate. Copyright © 2020. Tax Policy Center. Permission is granted for reproduction of this file, with attribution to the UrbanBrookings Tax Policy Center.

[1]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[2]  Jerome P. Reiter,et al.  Sampling With Synthesis: A New Approach for Releasing Public Use Census Microdata , 2010 .

[3]  Lars Vilhuber,et al.  How Protective Are Synthetic Data? , 2008, Privacy in Statistical Databases.

[4]  Ivan P. Fellegi,et al.  On the Question of Statistical Confidentiality , 1972 .

[5]  Anna Oganian,et al.  Global Measures of Data Utility for Microdata Masked for Disclosure Limitation , 2009, J. Priv. Confidentiality.

[6]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[7]  Gary Benedetto,et al.  The Creation and Use of the SIPP Synthetic Beta v7.0 , 2018 .

[8]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[9]  W. Winkler Examples of Easy-to-implement, Widely Used Methods of Masking for which Analytic Properties are not Justified , 2008 .

[10]  Jerome P. Reiter Estimating Risks of Identification Disclosure in Microdata , 2005 .

[11]  G. Duncan,et al.  Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics , 1993 .

[12]  Jerome P. Reiter,et al.  Differential Privacy and Statistical Disclosure Risk Measures: An Investigation with Binary Synthetic Data , 2012, Trans. Data Priv..

[13]  Jerome P. Reiter,et al.  Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database , 2011 .

[14]  Irit Dinur,et al.  Revealing information while preserving privacy , 2003, PODS.

[15]  Fang Liu,et al.  Comparative Study of Differentially Private Data Synthesis Methods , 2016, Statistical Science.

[16]  S. Fienberg,et al.  A Bayesian Approach to Data Disclosure: Optimal Intruder Behavior for Continuous Data , 1997 .

[17]  Jerome P. Reiter,et al.  Adjusting Survey Weights When Altering Identifying Design Variables Via Synthetic Data , 2006, Privacy in Statistical Databases.

[18]  Lars Vilhuber,et al.  Excerpt: Usage and outcomes of the Synthetic Data Server , 2017 .

[19]  Thomas Steinke,et al.  Differential Privacy: A Primer for a Non-Technical Audience , 2018 .

[20]  Anna Oganian,et al.  A Framework for Evaluating the Utility of Data Altered to Protect Confidentiality , 2006 .

[21]  Gillian M. Raab,et al.  Practical Data Synthesis for Large Samples , 2018, J. Priv. Confidentiality.

[22]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[23]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[24]  T. Therneau,et al.  An Introduction to Recursive Partitioning Using the RPART Routines , 2015 .

[25]  Stephen E. Fienberg,et al.  Statistical Disclosure Limitation For Data Access , 2009, Encyclopedia of Database Systems.

[26]  Jerome P. Reiter,et al.  Bayesian Estimation of Disclosure Risks for Multiply Imputed, Synthetic Data , 2014, J. Priv. Confidentiality.

[27]  Ofer Harel,et al.  Data confidentiality: A review of methods for statistical disclosure limitation and methods for assessing privacy , 2011 .

[28]  D. Lambert,et al.  The Risk of Disclosure for Microdata , 1989 .

[29]  Jerome P. Reiter,et al.  Disclosure Risk Evaluation for Fully Synthetic Categorical Data , 2014, Privacy in Statistical Databases.

[30]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[31]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[32]  Jerome P. Reiter,et al.  Satisfying Disclosure Restrictions With Synthetic Data Sets , 2002 .

[33]  Jerome P. Reiter,et al.  Using CART to generate partially synthetic public use microdata , 2005 .

[34]  William E. Winkler,et al.  Disclosure Risk Assessment in Perturbative Microdata Protection , 2002, Inference Control in Statistical Databases.