How the Accuracy and Confidence of Sensitivity Classification Affects Digital Sensitivity Review

Government documents must be manually reviewed to identify any sensitive information, e.g., confidential information, before being publicly archived. However, human-only sensitivity review is not practical for born-digital documents due to, for example, the volume of documents that are to be reviewed. In this work, we conduct a user study to evaluate the effectiveness of sensitivity classification for assisting human sensitivity reviewers. We evaluate how the accuracy and confidence levels of sensitivity classification affects the number of documents that are correctly judged as being sensitive (reviewer accuracy) and the time that it takes to sensitivity review a document (reviewing speed). In our within-subject study, the participants review government documents to identify real sensitivities while being assisted by three sensitivity classification treatments, namely None (no classification predictions), Medium (sensitivity predictions from a simulated classifier with a balanced accuracy (BAC) of 0.7), and Perfect (sensitivity predictions from a classifier with an accuracy of 1.0). Our results show that sensitivity classification leads to significant improvements (ANOVA, p < 0.05) in reviewer accuracy in terms of BAC (+37.9% Medium, +60.0% Perfect) and also in terms of F2 (+40.8% Medium, +44.9% Perfect). Moreover, we show that assisting reviewers with sensitivity classification predictions leads to significantly increased (ANOVA, p < 0.05) mean reviewing speeds (+72.2% Medium, +61.6% Perfect). We find that reviewers do not agree with the classifier significantly more as the classifier’s confidence increases. However, reviewing speed is significantly increased when the reviewers agree with the classifier (ANOVA, p < 0.05). Our in-depth analysis shows that when the reviewers are not assisted with sensitivity predictions, mean reviewing speeds are 40.5% slower for sensitive judgements compared to not-sensitive judgements. However, when the reviewers are assisted with sensitivity predictions, the difference in reviewing speeds between sensitive and not-sensitive judgements is reduced by ˜10%, from 40.5% to 30.8%. We also find that, for sensitive judgements, sensitivity classification predictions significantly increase mean reviewing speeds by 37.7% when the reviewers agree with the classifier’s predictions (t-test, p < 0.05). Overall, our findings demonstrate that sensitivity classification is a viable technology for assisting human reviewers with the sensitivity review of digital documents.

[1]  Michael H. Birnbaum,et al.  How to show that 9 > 221 : Collect judgments in a between-subjects design , 1999 .

[2]  Craig MacDonald,et al.  Towards a Classifier for Digital Sensitivity Review , 2014, ECIR.

[3]  Falk Scholer,et al.  User performance versus precision measures for simple search tasks , 2006, SIGIR.

[4]  Martin Halvey,et al.  Conceptualizing agent-human interactions during the conversational search process , 2018 .

[5]  Douglas W. Oard,et al.  Evaluation of information retrieval for E-discovery , 2010, Artificial Intelligence and Law.

[6]  Maura R. Grossman,et al.  Navigating Imprecision in Relevance Assessments on the Road to Total Recall: Roger and Me , 2017, SIGIR.

[7]  Giacomo Berardi,et al.  A utility-theoretic ranking method for semi-automated text classification , 2012, SIGIR '12.

[8]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[9]  David Sánchez,et al.  C‐sanitized: A privacy model for document redaction and sanitization , 2014, J. Assoc. Inf. Sci. Technol..

[10]  Renato Rocha Souza,et al.  Using Artificial Intelligence to Identify State Secrets , 2016, ArXiv.

[11]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[12]  S. Geisser,et al.  On methods in the analysis of profile data , 1959 .

[13]  Alistair Tough The Scope and Appetite for Technology-Assisted Sensitivity Reviewing of Born-Digital Records in a Resource Poor Environment: A Case Study From Malawi , 2018 .

[14]  Denis Cousineau,et al.  Confidence intervals in within-subject designs: A simpler solution to Loftus and Masson's method , 2005 .

[15]  Yiqun Liu,et al.  How Does Domain Expertise Affect Users’ Search Interaction and Outcome in Exploratory Search? , 2018, ACM Trans. Inf. Syst..

[16]  R. Scott,et al.  Computer terminal work and the benefit of microbreaks. , 2001, Applied ergonomics.

[17]  Michelle L. Kaarst-Brown,et al.  Sensitive information: A review and research agenda , 2005, J. Assoc. Inf. Sci. Technol..

[18]  Richard D. Morey,et al.  Confidence Intervals from Normalized Data: A correction to Cousineau (2005) , 2008 .

[19]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[20]  David D. Lewis,et al.  Information retrieval for e-discovery , 2010, SIGIR.

[21]  Hazem M. Hajj,et al.  A Meta-Framework for Modeling the Human Reading Process in Sentiment Analysis , 2016, ACM Trans. Inf. Syst..

[22]  Craig MacDonald,et al.  Using Part-of-Speech N-grams for Sensitive-Text Classification , 2015, ICTIR.

[23]  J. Shane Culpepper,et al.  The Influence of Topic Difficulty, Relevance Level, and Document Ordering on Relevance Judging , 2016, ADCS.

[24]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[25]  Maura R. Grossman,et al.  Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review , 2011 .

[26]  Joachim M. Buhmann,et al.  The Balanced Accuracy and Its Posterior Distribution , 2010, 2010 20th International Conference on Pattern Recognition.

[27]  Craig MacDonald,et al.  How Sensitivity Classification Effectiveness Impacts Reviewers in Technology-Assisted Sensitivity Review , 2019, CHIIR.

[28]  Iadh Ounis,et al.  A Study of SVM Kernel Functions for Sensitivity Classification Ensembles with POS Sequences , 2017, SIGIR.

[29]  Petra Kaufmann,et al.  Experimental And Quasi Experimental Designs For Research , 2016 .

[30]  Craig MacDonald,et al.  Towards Maximising Openness in Digital Sensitivity Review Using Reviewing Time Predictions , 2018, ECIR.

[31]  Craig MacDonald,et al.  Enhancing Sensitivity Classification with Semantic Features Using Word Embeddings , 2017, ECIR.

[32]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[33]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[34]  Robert L. Mason,et al.  Statistical Principles in Experimental Design , 2003 .

[35]  M. Masson,et al.  Using confidence intervals in within-subject designs , 1994, Psychonomic bulletin & review.

[36]  Douglas W. Oard,et al.  Jointly Modeling Relevance and Sensitivity for Search Among Sensitive Content , 2019, SIGIR.

[37]  Geoff Cumming,et al.  Confidence intervals and replication: where will the next mean fall? , 2006, Psychological methods.

[38]  Douglas W. Oard,et al.  Overview of the TREC 2010 Legal Track , 2010, TREC.

[39]  Craig MacDonald,et al.  Semi-Automated Text Classification for Sensitivity Identification , 2015, CIKM.

[40]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.