Scholarly Digital Libraries as a Platform for Malware Distribution

Researchers from academic institutions and the corporate sector rely heavily on scholarly digital libraries for accessing journal articles and conference proceedings. Primarily downloaded in the form of PDF files, there is a risk that these documents may be compromised by attackers. PDF files have many capabilities that have been widely used for malicious operations. Attackers increasingly take advantage of innocent users who open PDF files with little or no concern, mistakenly considering these files safe and relatively non-threatening. Researchers also consider scholarly digital libraries reliable and home to a trusted corpus of papers and untainted by malicious files. For these reasons, scholarly digital libraries are an attractive target for cyber-attacks launched via PDF files. In this study, we present several vulnerabilities and practical distribution attack approaches tailored for scholarly digital libraries. To support our claim regarding the attractiveness of scholarly digital libraries as an attack platform, we evaluated more than two million scholarly papers in the CiteSeerX library that were collected over 8 years and found it to be contaminated with a surprisingly large number (0.3%-2%) of malicious scholarly PDF documents, the origin of which is 46 different countries spread worldwide. More than 55% of the malicious papers in CiteSeerX were crawled from IP’s belonging to USA universities, followed by those belonging to Europe (33.6%). We show how existing scholarly digital libraries can be easily leveraged as a distribution platform both for a targeted attack and in a worldwide manner. On average, a certain malicious paper caused high impact damage as it was downloaded 167 times in 5 years by researchers from different countries worldwide. In general, the USA and Asia downloaded the most malicious scholarly papers, 40.15% and 27.9%, respectively. The top malicious scholarly document downloaded is a malicious version of a popular paper in the computer forensics domain, with 2213 downloads in a worldwide coverage of 108 different countries. Finally, we suggest several concrete solutions for mitigating such attacks, including simple deterministic solutions and also advanced machine learning-based frameworks.

[1]  Paul Baccas FINDING RULES FOR HEURISTIC DETECTION OF MALICIOUS PDFS : WITH ANALYSIS OF EMBEDDED EXPLOIT CODE , 2010 .

[2]  Niels Provos,et al.  SHELLOS: Enabling Fast Detection and Forensic Analysis of Code Injection Attacks , 2011, USENIX Security Symposium.

[3]  Himanshu Pareek,et al.  Entropy and n-gram Analysis of Malicious PDF Documents , 2013 .

[4]  Giorgio Giacinto,et al.  A Pattern Recognition System for Malicious PDF Files Detection , 2012, MLDM.

[5]  Yuval Elovici,et al.  Keeping pace with the creation of new malicious PDF files using an active-learning based detection framework , 2016, Security Informatics.

[6]  Yuval Elovici,et al.  Detection of malicious PDF files and directions for enhancements: A state-of-the art survey , 2015, Comput. Secur..

[7]  Pavel Laskov,et al.  Detection of Malicious PDF Files Based on Hierarchical Document Structure , 2013, NDSS.

[8]  Birhanu Eshete Effective analysis, characterization, and detection of malicious web pages , 2013, WWW '13 Companion.

[9]  Lior Rokach,et al.  ALDROID: efficient update of Android anti-virus software using designated active learning methods , 2016, Knowledge and Information Systems.

[10]  S. Chitra,et al.  Predicate based Algorithm for Malicious Web Page Detection using Genetic Fuzzy Systems and Support Vector Machine , 2012 .

[11]  Yuval Elovici,et al.  ALPD: Active Learning Framework for Enhancing the Detection of Malicious PDF Files , 2014, 2014 IEEE Joint Intelligence and Security Informatics Conference.

[12]  Razvan Benchea,et al.  A practical approach on clustering malicious PDF documents , 2012, Journal in Computer Virology.

[13]  Axelle Apvrille,et al.  Reducing the window of opportunity for Android malware Gotta catch ’em all , 2012, Journal in Computer Virology.

[14]  Lior Rokach,et al.  Improving malware detection by applying multi-inducer ensemble , 2009, Comput. Stat. Data Anal..

[15]  Yuval Elovici,et al.  ALDOCX: Detection of Unknown Malicious Microsoft Office Documents Using Designated Active Learning Methods Based on New Structural Feature Extraction Methodology , 2017, IEEE Transactions on Information Forensics and Security.

[16]  S. Lawrence Free online availability substantially increases a paper's impact , 2001, Nature.

[17]  Elmar Gerhards-Padilla,et al.  PDF Scrutinizer: Detecting JavaScript-based attacks in PDF documents , 2012, 2012 Tenth Annual International Conference on Privacy, Security and Trust.

[18]  Angelos Stavrou,et al.  Malicious PDF detection using metadata and structural features , 2012, ACSAC '12.

[19]  Lior Rokach,et al.  Novel active learning methods for enhanced PC malware detection in windows OS , 2014, Expert Syst. Appl..

[20]  Evangelos P. Markatos,et al.  Combining static and dynamic analysis for the detection of malicious documents , 2011, EUROSEC '11.

[21]  Hahn-Ming Lee,et al.  Suspicious URL Filtering Based on Logistic Regression with Multi-view Analysis , 2013, 2013 Eighth Asia Joint Conference on Information Security.

[22]  Pavel Laskov,et al.  Static detection of malicious JavaScript-bearing PDF documents , 2011, ACSAC '11.

[23]  C. Chellappan,et al.  Detecting Malicious URLs in E-mail – An Implementation , 2013 .

[24]  Didier Stevens Malicious PDF Documents Explained , 2011, IEEE Security & Privacy.

[25]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[26]  Justin Tung Ma,et al.  Learning to detect malicious URLs , 2011, TIST.

[27]  Madian Khabsa,et al.  Digital commons , 2020, Internet Policy Rev..

[28]  Yuval Shahar,et al.  Improving condition severity classification with an efficient active learning based framework , 2016, J. Biomed. Informatics.

[29]  Valentin Hamon,et al.  Malicious URI resolving in PDF documents , 2013, Journal of Computer Virology and Hacking Techniques.

[30]  Xun Lu,et al.  De-obfuscation and Detection of Malicious PDF Files with High Accuracy , 2013, 2013 46th Hawaii International Conference on System Sciences.

[31]  Yuval Shahar,et al.  An Active Learning Framework for Efficient Condition Severity Classification , 2015, AIME.

[32]  Giorgio Giacinto,et al.  Looking at the bag is not enough to find the bomb: an evasion of structural methods for malicious PDF files detection , 2013, ASIA CCS '13.

[33]  Jarle Kittilsen Detecting malicious PDF documents , 2011 .

[34]  Vincent Larivière,et al.  Self-Selected or Mandated, Open Access Increases Citation Impact for Higher Quality Research , 2010, PloS one.

[35]  Jianhua Sun,et al.  Malicious Websites Detection and Search Engine Protection , 2013 .