Fake Document Generation for Cyber Deception by Manipulating Text Comprehensibility

Advanced cyber attackers can penetrate enterprise networks and steal critical documents containing intellectual property despite all access control measures. Cyber deception is one of many solutions to protect critical documents after an attacker penetrates the network. It requires the generation and deployment of decoys such as fake text. The comprehensibility of a fake text document can affect the required time and effort for an attack to succeed. However, existing cybersecurity research has given limited attention to exploring the comprehensibility features of text for fake document generation. This article presents a novel method to generate believable fake text documents by measuring and manipulating the comprehensibility of legit text within a genetic algorithm (GA) framework. For measuring text comprehensibility, we adopt a set of quantitative measures based on qualitative principles of psycholinguistics and reading comprehension: connectivity, dispersion, and sequentiality. Our user-study analysis indicates that the quantitative comprehensibility measures can approximate the degree of human effort required to comprehend a fake text document in contrast to a legit text. For manipulating text comprehensibility, we develop a multiobjective, multimutation GA that modifies a legit document to Pareto-optimally alter its comprehensibility measures and generate hard-to-comprehend, believable fake documents. Our experiments show that the proposed algorithm successfully generates fake documents for a broader class of legit documents with varied text characteristics when compared to baselines from previous research. Hence, the application of our method can help improve cyber deception systems by providing more believable yet hard-to-comprehend fake documents to mislead cyber attackers.

[1]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[2]  Kevyn Collins-Thompson,et al.  Computational Assessment of Text Readability: A Survey of Current and Future Research Running title: Computational Assessment of Text Readability , 2014 .

[3]  H. Fawcett Manual of Political Economy , 1995 .

[4]  Colin Tankard,et al.  Advanced Persistent threats and how to monitor and deter them , 2011, Netw. Secur..

[5]  Sreenivas Gollapudi,et al.  Empowering authors to diagnose comprehension burden in textbooks , 2012, KDD.

[6]  Leon Manelis,et al.  Determinants of processing for a propositional structure , 1980 .

[7]  Mehrdad Amirghasemi,et al.  An effective asexual genetic algorithm for solving the job shop scheduling problem , 2015, Comput. Ind. Eng..

[8]  Chris Mellish,et al.  Combining information extraction with genetic algorithms for text mining , 2004, IEEE Intelligent Systems.

[9]  Salvatore J. Stolfo,et al.  Lost in Translation: Improving Decoy Documents via Automated Translation , 2012, 2012 IEEE Symposium on Security and Privacy Workshops.

[10]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[11]  Leila Sharif Hassanabadi,et al.  Summarising text with a genetic algorithm-based sentence extraction , 2008 .

[12]  Salvatore J. Stolfo,et al.  Baiting Inside Attackers Using Decoy Documents , 2009, SecureComm.

[13]  Neil C. Rowe,et al.  Two Taxonomies of Deception for Attacks on Information Systems , 2004 .

[14]  Walter Kintsch,et al.  Toward a model of text comprehension and production. , 1978 .

[15]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.

[16]  David E Kieras Initial Mention as a Cue to the Main Idea and the Main Item of a Technical Passage. , 1979 .

[17]  Sushil Jajodia,et al.  Generating Hard to Comprehend Fake Documents for Defensive Cyber Deception , 2018, IEEE Intelligent Systems.

[18]  David E. Kieras,et al.  Rules for Comprehensible Technical Prose: A Survey of the Psycholinguistic Literature. , 1985 .

[19]  Tzung-Pei Hong,et al.  A dynamic mutation genetic algorithm , 1996, 1996 IEEE International Conference on Systems, Man and Cybernetics. Information Intelligence and Systems (Cat. No.96CH35929).

[20]  Iyatiti Mokube,et al.  Honeypots: concepts, approaches, and challenges , 2007, ACM-SE 45.

[21]  Danielle S McNamara,et al.  The tool for the automatic analysis of text cohesion (TAACO): Automatic assessment of local, global, and text cohesion , 2015, Behavior Research Methods.

[22]  Catherine Snow,et al.  Reading for Understanding: Toward an R&D Program in Reading Comprehension , 2002 .

[23]  Ben Whitham Automating the Generation of Enticing Text Content for High-Interaction Honeyfiles , 2017, HICSS.

[24]  Sushil Jajodia,et al.  FORGE: A Fake Online Repository Generation Engine for Cyber Deception , 2019 .

[25]  P. Thorndyke Cognitive structures in comprehension and memory of narrative discourse , 1977, Cognitive Psychology.

[26]  L. Baker,et al.  Comprehension Monitoring: Identifying and Coping with Text Confusions1 , 1979 .

[27]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[28]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[29]  Bertram C. Bruce,et al.  Why readability formulas fail , 1981, IEEE Transactions on Professional Communication.

[30]  Richard R. Day,et al.  Developing Reading Comprehension Questions. , 2005 .

[31]  John Skvoretz,et al.  Node centrality in weighted networks: Generalizing degree and shortest paths , 2010, Soc. Networks.

[32]  Ville Leppänen,et al.  A Survey on Fake Entities as a Method to Detect and Monitor Malicious Activity , 2017, 2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).

[33]  Mohammed H. Almeshekah,et al.  Cyber Security Deception , 2016, Cyber Deception.

[34]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.