Reverse engineering of data structures from binary

Reversing engineering of data structures involves two aspects: (1) given an application binary, infers the data structure definitions; and (2) given a memory dump, infers the data structure instances. These two capabilities have a number of security and forensics applications that include vulnerability discovery, kernel rootkit detection, and memory forensics. In this dissertation, we present an integrated framework for reverse engineering of data structures from binary. There are three key components in our framework: REWARDS, SigGraph and DIMSUM. REWARDS is a data structure definition reverse engineering component that can automatically uncover both the syntax and semantics of data structures. SigGraph and DIMSUM are two data structure instance reverse engineering components which can recognize the data structure instances in a memory dump. In particular, SigGraph can systematically generate non-isomorphic signatures for data structures in an OS kernel and enable the brute force scanning of kernel memory to find the data structure instances. SigGraph relies on memory mapping information, but DIMSUM, which leverages probabilistic inference techniques, can directly scan memory without memory mapping information. We have developed a number of enabling techniques in our framework that include (1) bi-directional (i.e., backward and forward) data flow analysis, (2) signature graph generation and comparison, and (3) belief propagation based probabilistic inference. We demonstrate how we integrate these techniques into our reverse engineering framework in this dissertation. We have obtained the following preliminary experimental results. REWARDS achieved high accuracy in revealing data structure definitions accessed during an execution. SigGraph recognized Linux kernel data structure instances with zero false negative and close-to-zero false positives, and had strong robustness in the presence of malicious pointer manipulations. DIMSUM achieved higher effectiveness than previous non-probabilistic approaches without memory mapping information.