Caradoc: A Pragmatic Approach to PDF Parsing and Validation

PDF has become a de facto standard for exchanging electronic documents, for visualization as well as for printing. However, it has also become a common delivery channel for malware, and previous work has highlighted features that lead to security issues. In our work, we focus on the structure of the format, independently from specific features. By methodically testing PDF readers against hand-crafted files, we show that the interpretation of PDF files at the structural level may cause some form of denial of service, or be ambiguous and lead to rendering inconsistencies among readers. We then propose a pragmatic solution by restricting the syntax to avoid common errors, and propose a formal grammar for it. We explain how data consistency can be validated at a finer-grained level using a dedicated type checker. Finally, we assess this approach on a set of real-world files and show that our proposals are realistic.

[1]  H. Larralde,et al.  Lévy walk patterns in the foraging movements of spider monkeys (Ateles geoffroyi) , 2003, Behavioral Ecology and Sociobiology.

[2]  Peter Sewell,et al.  Not-Quite-So-Broken TLS: Lessons in Re-Engineering a Security Protocol Specification and Implementation , 2015, USENIX Security Symposium.

[3]  M. Chupeau,et al.  Cover times of random searches , 2015, Nature Physics.

[4]  Gregg Rothermel,et al.  Test case prioritization: an empirical study , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[5]  M. Moreau,et al.  Intermittent search strategies , 2011, 1104.0639.

[6]  C. Schade,et al.  FCScan: A New Lightweight and Effective Approach for Detecting Malicious Content in Electronic Documents , 2013 .

[7]  Nicolas E. Humphries,et al.  Environmental context explains Lévy and Brownian movement patterns of marine predators , 2010, Nature.

[8]  Vitaly Shmatikov,et al.  Abusing File Processing in Malware Detectors for Fun and Profit , 2012, 2012 IEEE Symposium on Security and Privacy.

[9]  Andreas Bogk,et al.  The Pitfalls of Protocol Design: Attempting to Write a Formally Verified PDF Parser , 2014, 2014 IEEE Security and Privacy Workshops.

[10]  -. AlexandreBlonce,et al.  Portable Document Format (PDF) Security Analysis and Malware Threats , 2008 .

[11]  Frédéric Raynal,et al.  Malicious origami in PDF , 2009, Journal in Computer Virology.

[12]  Gandhimohan M. Viswanathan,et al.  Ecology: Fish in Lévy-flight foraging , 2010, Nature.

[13]  Pavel Laskov,et al.  Practical Evasion of a Learning-Based Classifier: A Case Study , 2014, 2014 IEEE Symposium on Security and Privacy.

[14]  P. A. Prince,et al.  Lévy flight search patterns of wandering albatrosses , 1996, Nature.

[15]  A. Hofgaard,et al.  Foraging and movement paths of female reindeer: insights from fractal analysis, correlated random walks, and Lévy flights , 2002 .

[16]  Giorgio Giacinto,et al.  Looking at the bag is not enough to find the bomb: an evasion of structural methods for malicious PDF files detection , 2013, ASIA CCS '13.

[17]  J. Klafter,et al.  L\'evy walks , 2014, 1410.5100.

[18]  Koushik Sen,et al.  CUTE: a concolic unit testing engine for C , 2005, ESEC/FSE-13.

[19]  Pavel Laskov,et al.  Static detection of malicious JavaScript-bearing PDF documents , 2011, ACSAC '11.

[20]  Evangelos P. Markatos,et al.  Combining static and dynamic analysis for the detection of malicious documents , 2011, EUROSEC '11.

[21]  David Brumley,et al.  Optimizing Seed Selection for Fuzzing , 2014, USENIX Security Symposium.

[22]  Olivier Levillain,et al.  Parsifal: A Pragmatic Solution to the Binary Parsing Problems , 2014, 2014 IEEE Security and Privacy Workshops.

[23]  Corina S. Pasareanu,et al.  A survey of new trends in symbolic execution for software testing and analysis , 2009, International Journal on Software Tools for Technology Transfer.

[24]  Lars Chittka,et al.  Spatiotemporal dynamics of bumblebees foraging under predation risk. , 2011, Physical review letters.

[25]  Jon A. Solworth,et al.  Ethos' Deeply Integrated Distributed Types , 2014, 2014 IEEE Security and Privacy Workshops.

[26]  David Brumley,et al.  Program-Adaptive Mutational Fuzzing , 2015, 2015 IEEE Symposium on Security and Privacy.

[27]  Deborah Austin,et al.  Intraspecific variation in movement patterns: modeling individual behaviour in a large marine predator , 2004 .

[28]  David Leon,et al.  A comparison of coverage-based and distribution-based techniques for filtering and prioritizing test cases , 2003, 14th International Symposium on Software Reliability Engineering, 2003. ISSRE 2003..

[29]  Koushik Sen,et al.  DART: directed automated random testing , 2005, PLDI '05.

[30]  Andrea J. Liu,et al.  Generalized Lévy walks and the role of chemokines in migration of effector CD8+ T cells , 2012, Nature.

[31]  Koushik Sen,et al.  Symbolic execution for software testing: three decades later , 2013, CACM.

[32]  Fabio Roli,et al.  Evasion Attacks against Machine Learning at Test Time , 2013, ECML/PKDD.

[33]  Angelos Stavrou,et al.  Malicious PDF detection using metadata and structural features , 2012, ACSAC '12.

[34]  Giorgio Giacinto,et al.  A Pattern Recognition System for Malicious PDF Files Detection , 2012, MLDM.