FORGE: A Fake Online Repository Generation Engine for Cyber Deception

Today, major corporations and government organizations must face the reality that they will be hacked by malicious actors. In this paper, we consider the case of defending enterprises that have been successfully hacked by imposing additional a posteriori costs on the attacker. Our idea is simple: for every real document d, we develop methods to automatically generate a set Fake(d) of fake documents that are very similar to d. The attacker who steals documents must wade through a large number of documents in detail in order to separate the real one from the fakes. Our FORGE system focuses on technical documents (e.g. engineering/design documents) and involves three major innovations. First, we represent the semantic content of documents via multi-layer graphs (MLGs). Second, we propose a novel concept of “meta-centrality” for multi-layer graphs. Our third innovation is to show that the problem of generating the set Fake(d) of fakes can be viewed as an optimization problem. We prove that this problem is NP-complete and then develop efficient heuristics to solve it in practice. We ran detailed experiments with a panel of 20 human subjects and show that FORGE generates highly believable fakes.

[1]  Maarten H. Lamers,et al.  A SEMANTIC CENTRALITY MEASURE FOR FINDING THE MOST TRUSTWORTHY ACCOUNT , 2010 .

[2]  Rachel Greenstadt,et al.  Detecting Hoaxes, Frauds, and Deception in Writing Style Online , 2012, 2012 IEEE Symposium on Security and Privacy.

[3]  Salvatore J. Stolfo,et al.  Software decoys for insider threat , 2012, ASIACCS '12.

[4]  Piek Vossen,et al.  Extending, trimming and fusing WordNet for technical documents , 2001 .

[5]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[6]  Ben Whitham,et al.  Design requirements for generating deceptive content to protect document repositories , 2014 .

[7]  Ben Whitham,et al.  Towards a set of metrics to guide the generation of fake computer file systems , 2014 .

[8]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[9]  D. Kushner Digital decoys [fake MP3 song files to deter music pirating] , 2003 .

[10]  Jonathan White,et al.  Using Synthetic Decoys to Digitally Watermark Personally-Identifying Data and to Promote Data Security , 2006, Security and Management.

[11]  Jason J. Jung,et al.  Measuring semantic centrality based on building consensual ontology on social network , 2006 .

[12]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993 .

[13]  Lei Wang,et al.  Generation and Distribution of Decoy Document System , 2013, ISCTCS.

[14]  Salvatore J. Stolfo,et al.  Baiting Inside Attackers Using Decoy Documents , 2009, SecureComm.

[15]  Lior Rokach,et al.  HoneyGen: An automated honeytokens generator , 2011, Proceedings of 2011 IEEE International Conference on Intelligence and Security Informatics.

[16]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[17]  J. Yuill,et al.  Honeyfiles: deceptive files for intrusion detection , 2004, Proceedings from the Fifth Annual IEEE SMC Information Assurance Workshop, 2004..

[18]  David Gross-Amblard,et al.  Temporal Semantic Centrality for the Analysis of Communication Networks , 2012, ICWE.

[19]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[20]  Mason A. Porter,et al.  Multilayer networks , 2013, J. Complex Networks.

[21]  Ben Whitham AUTOMATING THE GENERATION OF FAKE DOCUMENTS TO DETECT NETWORKINTRUDERS , 2013 .

[22]  Chris D. Paice,et al.  The identification of important concepts in highly structured technical papers , 1993, SIGIR.

[23]  Lior Rokach,et al.  A Survey of Data Leakage Detection and Prevention Solutions , 2012, SpringerBriefs in Computer Science.

[24]  Salvatore J. Stolfo,et al.  Lost in Translation: Improving Decoy Documents via Automated Translation , 2012, 2012 IEEE Symposium on Security and Privacy Workshops.

[25]  Torsten Suel,et al.  Local methods for estimating pagerank values , 2004, CIKM '04.

[26]  Pascal Frossard,et al.  Clustering on Multi-Layer Graphs via Subspace Analysis on Grassmann Manifolds , 2013, IEEE Transactions on Signal Processing.