论文信息 - Detection and Labeling of Personal Identifiable Information in E-mails

Detection and Labeling of Personal Identifiable Information in E-mails

The protection of personal identifiable information (PII) is increasingly demanded by customers and data protection regulation. To safeguard PII a organization has to find out which incoming communication actually contains it. Only then PII can be labeled, tracked, and protected. E-mails are one of the main means of communication. They consist of unstructured data difficult to classify. We developed an automated detection system for PII in e-mails and connected it to a usage control infrastructure. Our concept is based on previous findings in the area of spam detection. We tested our approach with a data set in a customer service scenario. The evaluation shows that the utilization of Bayes-classification is very promising to detect PII.

Christoph Bier | Jonas Prior | C. Bier | Jonas Prior

[1] Alexander Pretschner,et al. Data Loss Prevention Based on Data-Driven Usage Control , 2012, 2012 IEEE 23rd International Symposium on Software Reliability Engineering.

[2] Christian Schaefer,et al. Policy Evolution in Distributed Usage Control , 2009, STM@IFIPTM.

[3] Alexander Pretschner,et al. Representation-Independent Data Usage Control , 2011, DPM/SETOP.

[4] Jonathan A. Zdziarski,et al. Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification , 2005 .

[5] R. Sandhu,et al. The UCON ABC Usage Control Model JAEHONG , 2004 .

[6] Christoph Bier. How Usage Control and Provenance Tracking Get Together - A Data Protection Perspective , 2013, 2013 IEEE Security and Privacy Workshops.

[7] R. Suganya,et al. Data Mining Concepts and Techniques , 2010 .

[8] Joaquin Garcia-Alfaro,et al. Data Privacy Management and Autonomous Spontaneous Security, 4th International Workshop, DPM 2009 and Second International Workshop, SETOP 2009, St. Malo, France, September 24-25, 2009, Revised Selected Papers , 2010, DPM/SETOP.

[9] Céline Rouveirol,et al. Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[10] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[11] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[12] Tong Zhang,et al. Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[13] Gary Robinson,et al. A statistical approach to the spam problem , 2003 .

[14] Yogesh L. Simmhan,et al. A survey of data provenance in e-science , 2005, SGMD.

[15] Alexander Pretschner,et al. Distributed data usage control for web applications: a social network implementation , 2011, CODASPY '11.

[16] Kjersti Aas,et al. Text Categorisation: A Survey , 1999 .