The lack of public access to current, real-world datasets significantly hinders the progress of network research as a scientific pursuit. It is often not possible to robustly validate a proposed mechanism, enhancement, or new service without understanding how it will interact with real networks and real users. Yet obtaining the necessary raw measurement data—in particular, packet traces including payload—can prove exceedingly difficult, and not having appropriate traces for a study can stall the most promising research. There have been extensive efforts by the community at large to change the status quo by providing collections of public network traces. However, the community’s major push to encourage institutions to release anonymized data has achieved only very limited success. The risks involved with any release still outweigh the potential benefits in almost all environments. The lack of significant progress in this direction—despite extensive efforts—is an undeniable indication that the community needs a new approach. An alternative paradigm for enabling network research is mediated trace analysis: rather than bringing the data to the experimenter, bring the experiment to the data, i.e., researchers send their analysis programs to data providers who then run them on their behalf and return the output. The community has been using this approach on an ad hoc basis for a number of years, but in that form it fails to scale: providers only undertake such mediation based on a great deal of trust that the requesting researcher is acting in good faith and that data released via the mediation will not pose any privacy risks. If as a community we find that for effectively conducting our science we must increasingly rely upon mediated trace analysis, then we must address in a systematic fashion the crucial technical hurdle of ensuring that mediated analysis programs do not leak sensitive information from the data they process. The two frameworks previously proposed for preventing such leaks have the significant limitation of requiring researchers to code their analysis programs in terms of pre-approved modules [6] or a specific language [5]. In this paper we propose a powerful alternative approach that can work with nearly arbitrary analysis programs while imposing only modest requirements on researchers and data providers. The key observation we leverage is that the data provider holds the researcher’s program “captive,” so to speak: the provider can run it multiple times on different inputs and observe the program’s behavior in each case. Having captive programs creates an opportunity for permutation analysis. As a simple example, suppose that a researcher asking for mediated analysis asserts that their program is indifferent to IP addresses—other than that they remain distinct in a one-to-one mapping with end systems—but that in fact the researcher’s program searches for the presence of a single particular IP address in a packet trace and flags its presence in a surreptitious fashion in the output. Conceptually, the provider can detect this leakage as follows. They first feed the program the original trace and capture its output. They then permute the trace, consistently altering its embedded addresses, and diff the resulting output with that from the first run. If the sensitive address indeed appeared in the original trace but not in the permuted trace, then the outputs will necessarily differ (otherwise, the program failed to leak its presence). If the address did not in fact appear in the original trace, then the outputs may agree (they might not if the permutation happened to accidentally introduce the address), which one might view as “no harm, no foul.” We term such an approach as black-box permutation analysis, since it can secure mediated trace analysis without requiring any visibility into the internals of the researcher’s program, and thus without imposing any restrictions on how the researcher must code it. However, while the above example is appealing in its conceptual simplicity, applying such analysis in a secure, systematic fashion requires careful consideration of numerous issues. Our work endeavors to illuminate these issues and develop sound approaches for attending to them. In particular, we develop an analytic framework for permutation analysis and employ it to show how to detect violations of a data provider’s privacy policy using only a relatively modest number of black-box permutations. We also discuss how our technique can account for innocuous changes in program output via canonicaliza-
[1]
Martin F. Arlitt,et al.
SC2D: an alternative to trace anonymization
,
2006,
MineNet '06.
[2]
Jason Lee,et al.
The devil and packet trace anonymization
,
2006,
CCRV.
[3]
Charles V. Wright,et al.
Playing Devil's Advocate: Inferring Sensitive Information from Anonymized Network Traces
,
2007,
NDSS.
[4]
Hari Balakrishnan,et al.
Efficient and Robust TCP Stream Normalization
,
2008,
2008 IEEE Symposium on Security and Privacy (sp 2008).
[5]
Jason Lee,et al.
A first look at modern enterprise traffic
,
2005,
IMC '05.
[6]
Mostafa H. Ammar,et al.
Prefix-preserving IP address anonymization: measurement-based security evaluation and a new cryptography-based scheme
,
2004,
Comput. Networks.
[7]
Jelena Mirkovic,et al.
Privacy-safe network trace sharing via secure queries
,
2008,
NDA '08.
[8]
David Wetherall,et al.
Privacy oracle: a system for finding application leaks with black box differential testing
,
2008,
CCS.
[9]
Michael Backes,et al.
Automatic Discovery and Quantification of Information Leaks
,
2009,
2009 30th IEEE Symposium on Security and Privacy.