Detection and Labeling of Personal Identifiable Information in E-mails

The protection of personal identifiable information (PII) is increasingly demanded by customers and data protection regulation. To safeguard PII a organization has to find out which incoming communication actually contains it. Only then PII can be labeled, tracked, and protected. E-mails are one of the main means of communication. They consist of unstructured data difficult to classify. We developed an automated detection system for PII in e-mails and connected it to a usage control infrastructure. Our concept is based on previous findings in the area of spam detection. We tested our approach with a data set in a customer service scenario. The evaluation shows that the utilization of Bayes-classification is very promising to detect PII.

[1]  Alexander Pretschner,et al.  Data Loss Prevention Based on Data-Driven Usage Control , 2012, 2012 IEEE 23rd International Symposium on Software Reliability Engineering.

[2]  Christian Schaefer,et al.  Policy Evolution in Distributed Usage Control , 2009, STM@IFIPTM.

[3]  Alexander Pretschner,et al.  Representation-Independent Data Usage Control , 2011, DPM/SETOP.

[4]  Jonathan A. Zdziarski,et al.  Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification , 2005 .

[5]  R. Sandhu,et al.  The UCON ABC Usage Control Model JAEHONG , 2004 .

[6]  Christoph Bier How Usage Control and Provenance Tracking Get Together - A Data Protection Perspective , 2013, 2013 IEEE Security and Privacy Workshops.

[7]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[8]  Joaquin Garcia-Alfaro,et al.  Data Privacy Management and Autonomous Spontaneous Security, 4th International Workshop, DPM 2009 and Second International Workshop, SETOP 2009, St. Malo, France, September 24-25, 2009, Revised Selected Papers , 2010, DPM/SETOP.

[9]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[10]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[11]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[12]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[13]  Gary Robinson,et al.  A statistical approach to the spam problem , 2003 .

[14]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[15]  Alexander Pretschner,et al.  Distributed data usage control for web applications: a social network implementation , 2011, CODASPY '11.

[16]  Kjersti Aas,et al.  Text Categorisation: A Survey , 1999 .