A Framework for Maximizing Utility of Sanitized Documents Based on Meta-labeling

Document sanitization, i.e., the process of removing or generalizing sensitive information in order to reduce the security classification of the document, is widely used today in applications of information sharing. Traditional document sanitization systems focus on removal or generalization of certain words and phrases, but do not take into account the utility of the sanitized documents. This leads to a gap between the sanitized documents and the users' requirements. Proposed in this paper is a formal framework and conceptual algorithms for optimal document sanitization based on meta-labeling. Each document is associated with a meta-label, which serves to determine both the security label and the utility of the document. In the sanitization process, the system first computes a new meta-label for the sanitized version and then sanitizes the document through mediators guided by the new meta-label. Algorithms are provided to compute a new meta-label that is proven to satisfy the security requirements and provide maximal utility with respect to users' requirements, which are also represented by a meta-label.