Detecting Security Breaches in Personal Data Protection with Machine Learning

In the age of big data and the Internet of Things, large volume of information, such as medical data, commercial data, or government service data, is generated every second. The protection of personal data to reduce the risk of using information has become very crucial in the field of aforementioned application fields. In this paper, we designed a machine learning model, which can effectively filter out documents containing personal data, and prompt alert to the user. Words and phrases are punctured and marked with part-of-speech tagging and different weights given for different parts of sentence. The pre-trained neural network model and selected features are used to determine whether the sentence contains any personal data. We also compared accuracies among different models of neural network and convolution neural network. In addition, GPU was used to improve the training performance.