Information leak detection in financial e-mails using mail pattern analysis under partial information

With the advent of e-mail, sensitive information leakage has become a daunting problem in today's world. Quite often, the mail volume from a company is huge, making manual monitoring impossible. Automatic screening mostly relies on the idea of content scanning, but sometimes the information is so sensitive that even scanning the mails by a third party may not be permitted. Detection under such restrictions becomes difficult. Also, mails originating from specific organizations are often restricted in their subject and content, suggesting that powerful generic techniques like content scanning may not be needed. We propose that selection of proper input variables relevant to the domain could help in such cases; a simple straightforward learning scheme can then detect information leak efficiently using only mail pattern analysis. We used our technique on real life mails from financial institutions. By choosing the input variables judiciously, we were able to learn the mail patterns quite well and detected violations efficiently. The preliminary results are encouraging with an accuracy close to 92%. This technique is now being implemented in a real life commercial tool.