Reverse Engineering Input Syntactic Structure from Program Execution and Its Applications

Program input syntactic structure is essential for a wide range of applications such as test case generation, software debugging, and network security. However, such important information is often not available (e.g., most malware programs make use of secret protocols to communicate) or not directly usable by machines (e.g., many programs specify their inputs in plain text or other random formats). Furthermore, many programs claim they accept inputs with a published format, but their implementations actually support a subset or a variant. Based on the observations that input structure is manifested by the way input symbols are used during execution and most programs take input with top-down or bottom-up grammars, we devise two dynamic analyses, one for each grammar category. Our evaluation on a set of real-world programs shows that our technique is able to precisely reverse engineer input syntactic structure from execution. We apply our technique to hierarchical delta debugging (HDD) and network protocol reverse engineering. Our technique enables the complete automation of HDD, in which programmers were originally required to provide input grammars, and improves the runtime performance of HDD. Our client study on network protocol reverse engineering also shows that our technique supersedes existing techniques.

[1]  Helen J. Wang,et al.  ShieldGen: Automatic Data Patch Generation for Unknown Vulnerabilities with Informed Probing , 2007, 2007 IEEE Symposium on Security and Privacy (SP '07).

[2]  Zhenkai Liang,et al.  Polyglot: automatic extraction of protocol message format using dynamic binary analysis , 2007, CCS '07.

[3]  Helen J. Wang,et al.  Shield: vulnerability-driven network filters for preventing known vulnerability exploits , 2004, SIGCOMM.

[4]  Polyglot : Automatic Extraction of Protocol Format using Dynamic Binary Analysis , 2007 .

[5]  Helen J. Wang,et al.  Tupni: automatic reverse engineering of input formats , 2008, CCS.

[6]  Jun Xu,et al.  Packet vaccine: black-box exploit detection and signature generation , 2006, CCS '06.

[7]  Rupak Majumdar,et al.  Directed test generation using symbolic grammars , 2007, ESEC-FSE companion '07.

[8]  Thomas W. Reps,et al.  Extracting Output Formats from Executables , 2006, 2006 13th Working Conference on Reverse Engineering.

[9]  Xiang Zhang,et al.  Tracing Lineage Beyond Relational Operators , 2007, VLDB.

[10]  Xiangyu Zhang,et al.  Efficient online detection of dynamic control dependence , 2007, ISSTA '07.

[11]  Emin Gün Sirer,et al.  Using production grammars in software testing , 1999, DSL '99.

[12]  Xiangyu Zhang,et al.  Cost effective dynamic program slicing , 2004, PLDI '04.

[13]  Bjorn De Sutter,et al.  Matching Control Flow of Program Versions , 2007, 2007 IEEE International Conference on Software Maintenance.

[14]  David Coppit,et al.  yagg: an easy-to-use generator for structured test inputs , 2005, ASE.

[15]  Xiangyu Zhang,et al.  Deriving Input Syntactic Structure From Execution and Its Applications , 2008 .

[16]  Minaxi Gupta,et al.  A study of malware in peer-to-peer networks , 2006, IMC '06.

[17]  David Leon,et al.  Detecting and debugging insecure information flows , 2004, 15th International Symposium on Software Reliability Engineering.

[18]  Adam Kiezun,et al.  Grammar-based whitebox fuzzing , 2008, PLDI '08.

[19]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[20]  Zhendong Su,et al.  HDD: hierarchical delta debugging , 2006, ICSE.

[21]  K. V. Hanford,et al.  Automatic Generation of Test Cases , 1970, IBM Syst. J..

[22]  Andreas Zeller,et al.  Simplifying and Isolating Failure-Inducing Input , 2002, IEEE Trans. Software Eng..

[23]  Christopher Krügel,et al.  Automatic Network Protocol Analysis , 2008, NDSS.

[24]  Xiangyu Zhang,et al.  Deriving input syntactic structure from execution , 2008, SIGSOFT '08/FSE-16.

[25]  Xuxian Jiang,et al.  Automatic Protocol Format Reverse Engineering through Context-Aware Monitored Execution , 2008, NDSS.

[26]  Rajesh Parekh,et al.  Grammar Inference Automata Induction and Language Acquisition , 2005 .

[27]  Xiangyu Zhang,et al.  Dynamic slicing long running programs through execution fast forwarding , 2006, SIGSOFT '06/FSE-14.

[28]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[29]  K. De Bosschere,et al.  DIABLO: a reliable, retargetable and extensible link-time rewriting framework , 2005, Proceedings of the Fifth IEEE International Symposium on Signal Processing and Information Technology, 2005..

[30]  Peter M. Maurer,et al.  Generating test data with enhanced context-free grammars , 1990, IEEE Software.

[31]  Andreas Zeller,et al.  Why Programs Fail: A Guide to Systematic Debugging , 2005 .

[32]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[33]  Paul Walton Purdom,et al.  A sentence generator for testing parsers , 1972 .