Statistical Analysis of Network Data for Cybersecurity

Every day or so a new virus is released on the internet. Every week companies report server break-ins. Several times a year major attacks on the internet are announced. These attacks are reported to cost billions of dollars in lost productivity, lost data, and lost business. Detecting and modeling these attacks has only recently begun to attract the attention of statisticians. Transaction data in cybersecurity takes many forms. There are the basic protocol of the internet, that governs everything from email to web browsing, remote logins, chat sessions, and instant messaging. For banking purposes, authentication and encryption are essential to ensure that the person accessing the account is who they say they are, and that no unauthorized person can observe the transaction. On the host, there is the interaction with the operating system, and the requirement that programs and users are acting in their proper manner, and are who they say they are. The field of cybersecurity is large, and we will only touch on a very small part of it. We will discuss primarily network security, in which we are interested in detecting attacks on a network computer from a remote host. This will require a brief introduction to the network protocols and the data that we observe. We will provide some examples of attacks from the past and discuss some areas for statistical analysis. Then we will consider the problem of visualizing network data. For a good discussion of computer security, see Bishop [2003]. Statistical issues of computer security are covered in Marchette [2001]. A good discussion on the issues of user profiling, fraud detection, and the detection of masqueraders can be found in Schonlau et al. [2001] and Bolton and Hand[2002].