Towards a Classifier for Digital Sensitivity Review

The sensitivity review of government records is essential before they can be released to the official government archives, to prevent sensitive information such as personal information, or that which is prejudicial to international relations from being released. As records are typically reviewed and released after a period of decades, sensitivity review practices are still based on paper records. The transition to digital records brings new challenges, e.g. increased volume of digital records, making current practices impractical to use. In this paper, we describe our current work towards developing a sensitivity review classifier that can identify and prioritise potentially sensitive digital records for review. Using a test collection built from government records with real sensitivities identified by government assessors, we show that considering the entities present in each record can markedly improve upon a text classification baseline.

[1]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[2]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[3]  Claire Cardie,et al.  OpinionFinder: A System for Subjectivity Analysis , 2005, HLT.

[4]  Peter Szolovits,et al.  Automated de-identification of free-text medical records , 2008, BMC Medical Informatics Decis. Mak..

[5]  W. Orlikowski,et al.  Genre Repertoire: The Structuring of Communicative Practices in Organizations , 1994 .

[6]  Craig MacDonald,et al.  Terrier Information Retrieval Platform , 2005, ECIR.

[7]  Dinh-Thuc Nguyen,et al.  Automatic Anonymization of Natural Languages Texts Posted on Social Networking Services and Automatic Detection of Disclosure , 2012, 2012 Seventh International Conference on Availability, Reliability and Security.

[8]  Ben He,et al.  Terrier : A High Performance and Scalable Information Retrieval Platform , 2022 .

[9]  Rob Johnson,et al.  Text Classification for Data Loss Prevention , 2011, PETS.

[10]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[11]  Jacques Savoy,et al.  Authorship Attribution Based on Specific Vocabulary , 2012, TOIS.

[12]  Craig MacDonald,et al.  Ranking opinionated blog posts using OpinionFinder , 2008, SIGIR '08.

[13]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .