Learning stateful models for network honeypots

Attacks like call fraud and identity theft often involve sophisticated stateful attack patterns which, on top of normal communication, try to harm systems on a higher semantic level than usual attack scenarios. To detect these kind of threats via specially deployed honeypots, at least a minimal understanding of the inherent state machine of a specific service is needed to lure potential attackers and to keep a communication for a sufficiently large number of steps. To this end we propose PRISMA, a method for protocol inspection and state machine analysis, which infers a functional state machine and message format of a protocol from network traffic alone. We apply our method to three real-life network traces ranging from 10,000 up to 2 million messages of both binary and textual protocols. We show that PRISMA is capable of simulating complete and correct sessions based on the learned models. A case study on malware traffic reveals the different states of the execution, rendering PRISMA a valuable tool for malware analysis.

[1]  Edward F. Moore,et al.  Gedanken-Experiments on Sequential Machines , 1956 .

[2]  Vern Paxson,et al.  A high-level programming environment for packet trace anonymization and transformation , 2003, SIGCOMM '03.

[3]  Jon Postel,et al.  File Transfer Protocol , 1985, RFC.

[4]  Marc Dacier,et al.  Automatic Handling of Protocol Dependencies and Reaction to 0-Day Attacks with ScriptGen Based Honeypots , 2006, RAID.

[5]  Paul Hethmon Extensions to FTP , 2007, RFC.

[6]  Christopher Krügel,et al.  JACKSTRAWS: Picking Command and Control Connections from Bot Traffic , 2011, USENIX Security Symposium.

[7]  Dan Wing,et al.  The SIP Identity Baiting Attack , 2008 .

[8]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[9]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[10]  Konrad Rieck,et al.  Linear-Time Computation of Similarity Measures for Sequential Data , 2008, J. Mach. Learn. Res..

[11]  L. Baum,et al.  An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology , 1967 .

[12]  Zhenkai Liang,et al.  Polyglot: automatic extraction of protocol message format using dynamic binary analysis , 2007, CCS '07.

[13]  Christopher Krügel,et al.  Prospex: Protocol Specification Extraction , 2009, 2009 30th IEEE Symposium on Security and Privacy.

[14]  Christoph Schnörr,et al.  Learning Sparse Representations by Non-Negative Matrix Factorization and Sequential Cone Programming , 2006, J. Mach. Learn. Res..

[15]  Xuxian Jiang,et al.  Automatic Protocol Format Reverse Engineering through Context-Aware Monitored Execution , 2008, NDSS.

[16]  Helen J. Wang,et al.  Discoverer: Automatic Protocol Reverse Engineering from Network Traces , 2007, USENIX Security Symposium.

[17]  Randy H. Katz,et al.  Protocol-Independent Adaptive Replay of Application Dialog , 2006, NDSS.

[18]  Humberto Abdelnur,et al.  SIP digest authentication relay attack , 2009 .

[19]  P. Paatero,et al.  Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values† , 1994 .

[20]  Nicole Krämer,et al.  ASAP: Automatic Semantics-Aware Analysis of Network Payloads , 2010, PSDML.

[21]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[22]  Zhi Wang,et al.  ReFormat: Automatic Reverse Engineering of Encrypted Messages , 2009, ESORICS.

[23]  Robert Elz,et al.  Feature negotiation mechanism for the File Transfer Protocol , 1998, RFC.

[24]  David Brumley,et al.  Replayer: automatic protocol replay by binary analysis , 2006, CCS '06.

[25]  David Mankins,et al.  Directory oriented FTP commands , 1980, RFC.

[26]  P. Gács,et al.  Algorithms , 1992 .

[27]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[28]  George Varghese,et al.  Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications , 2001, SIGCOMM 2001.

[29]  Marc Dacier,et al.  ScriptGen: an automated script generation tool for Honeyd , 2005, 21st Annual Computer Security Applications Conference (ACSAC'05).

[30]  Dawn Xiaodong Song,et al.  Dispatcher: enabling active botnet infiltration using automatic protocol reverse-engineering , 2009, CCS.

[31]  A. Fraser Hidden Markov Models and Dynamical Systems , 2011 .

[32]  Amy Nicole Langville,et al.  Algorithms, Initializations, and Convergence for the Nonnegative Matrix Factorization , 2014, ArXiv.

[33]  P. Holland Weighted Ridge Regression: Combining Ridge and Robust Regression Methods , 1973 .

[34]  Helen J. Wang,et al.  Tupni: automatic reverse engineering of input formats , 2008, CCS.

[35]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[36]  Christopher Krügel,et al.  Automatic Network Protocol Analysis , 2008, NDSS.