Cybersecurity Research Datasets: Taxonomy and Empirical Analysis

We inspect 965 cybersecurity research papers published between 2012 and 2016 in order to understand better how datasets are used, produced and shared. We construct a taxonomy of the types of data created and shared, informed and validated by the examined papers. We then analyze the gathered data on datasets. Three quarters of existing datasets used as input to research are publicly available, but less than one fifth of datasets created by researchers are publicly shared. Using a series of linear regressions, we demonstrate that those researchers who do make public the datasets they create are rewarded with more citations to the associated papers. Hence, we conclude that an under-appreciated incentive exists for researchers to share their created datasets with the broader research community.

[1]  Tudor Dumitras,et al.  Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits , 2015, USENIX Security Symposium.

[2]  Tyler Moore,et al.  The consequence of non-cooperation in the fight against phishing , 2008, 2008 eCrime Researchers Summit.

[3]  Ross J. Anderson Why information security is hard - an economic perspective , 2001, Seventeenth Annual Computer Security Applications Conference.

[4]  Cristina Nita-Rotaru,et al.  On the Practicality of Integrity Attacks on Document-Level Sentiment Analysis , 2014, AISec '14.

[5]  Frank Breitinger,et al.  Availability of datasets for digital forensics - And what is missing , 2017, Digit. Investig..

[6]  Elaine Shi,et al.  Understanding Craigslist Rental Scams , 2016, Financial Cryptography.

[7]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[8]  J. Alex Halderman,et al.  An Internet-Wide View of Internet-Wide Scanning , 2014, USENIX Security Symposium.

[9]  Leyla Bilge,et al.  Before we knew it: an empirical study of zero-day attacks in the real world , 2012, CCS.

[10]  Injong Rhee,et al.  Tackling bufferbloat in 3G/4G networks , 2012, Internet Measurement Conference.

[11]  Tyler Moore,et al.  The Economics of Information Security , 2006, Science.

[12]  Lawrence A. Gordon,et al.  Sharing Information on Computer Systems Security: An Economic Analysis , 2003 .

[13]  Aditya Akella,et al.  Seeing through Network-Protocol Obfuscation , 2015, CCS.

[14]  Nicolas Christin,et al.  Traveling the silk road: a measurement analysis of a large anonymous online marketplace , 2012, WWW.

[15]  Matthew Smith,et al.  On the Awareness, Control and Privacy of Shared Photo Metadata , 2014, Financial Cryptography.

[16]  Anindya Ghose,et al.  The Economic Incentives for Sharing Security Information , 2004, Inf. Syst. Res..

[17]  Tyler Moore,et al.  Abuse Reporting and the Fight Against Cybercrime , 2017, ACM Comput. Surv..

[18]  J. Alex Halderman,et al.  Analysis of the HTTPS certificate ecosystem , 2013, Internet Measurement Conference.