Anti-aliasing on the web

It is increasingly common for users to interact with the web using a number of different aliases. This trend is a double-edged sword. On one hand, it is a fundamental building block in approaches to online privacy. On the other hand, there are economic and social consequences to allowing each user an arbitrary number of free aliases. Thus, there is great interest in understanding the fundamental issues in obscuring the identities behind aliases.However, most work in the area has focused on linking aliases through analysis of lower-level properties of interactions such as network routes. We show that aliases that actively post text on the web can be linked together through analysis of that text. We study a large number of users posting on bulletin boards, and develop algorithms to anti-alias those users: we can with a high degree of success identify when two aliases belong to the same individual.Our results show that such techniques are surprisingly effective, leading us to conclude that guaranteeing privacy among aliases that post actively requires mechanisms that do not yet exist.

[1]  Hang Li,et al.  Clustering Words with the MDL Principle , 1996, COLING.

[2]  Michael K. Reiter,et al.  Anonymous Web transactions with Crowds , 1999, CACM.

[3]  Marina MeWi Comparing Clusterings , 2002 .

[4]  David Chaum,et al.  Security without identification: transaction systems to make big brother obsolete , 1985, CACM.

[5]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[6]  M. Kendall,et al.  The Statistical Study of Literary Vocabulary , 1944, Nature.

[7]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[8]  David Chaum,et al.  Untraceable electronic mail, return addresses, and digital pseudonyms , 1981, CACM.

[9]  E. Voorhees The Effectiveness & Efficiency of Agglomerative Hierarchic Clustering in Document Retrieval , 1985 .

[10]  E. Friedman,et al.  The Social Cost of Cheap Pseudonyms , 2001 .

[11]  Paul F. Syverson,et al.  Onion routing , 1999, CACM.

[12]  Yossi Matias,et al.  On secure and pseudonymous client-relationships with multiple servers , 1998, TSEC.

[13]  G. Udny Yule,et al.  The statistical study of literary vocabulary , 1944 .

[14]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[15]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[16]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[17]  Eugene H. Spafford,et al.  Authorship analysis: identifying the author of a program , 1997, Comput. Secur..

[18]  Yuta Tsuboi,et al.  Authorship identification for heterogeneous documents , 2002 .

[19]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[20]  C. B. Williams Mendenhall's studies of word-length distribution in the works of Shakespeare and Bacon , 1975 .

[21]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[22]  Pankaj Rohatgi,et al.  Can Pseudonymity Really Guarantee Privacy? , 2000, USENIX Security Symposium.