论文信息 - An Introduction to NLP-based Textual Anonymisation

An Introduction to NLP-based Textual Anonymisation

We introduce the problem of automatic textual anonymisation and present a new publicly-available, pseudonymised benchmark corpus of personal email text for the task, dubbed ITAC (Informal Text Anonymisation Corpus). We discuss the method by which the corpus was constructed, and consider some important issues related to the evaluation of textual anonymisation systems. We also present some initial baseline results on the new corpus using a state of the art HMM-based tagger. We introduce the problem of automatic textual anonymisation and present a new publicly-available, pseudonymised benchmark corpus of personal email text for the task, dubbed ITAC (Informal Text Anonymisation Corpus). We discuss the method by which the corpus was constructed, and consider some important issues related to the evaluation of textual anonymisation systems. We also present some initial baseline results on the new corpus using a state of the art HMM-based tagger.

Ben Medlock

[1] David A. Cohn,et al. Active Learning with Statistical Models , 1996, NIPS.

[2] Ted Briscoe,et al. Robust Accurate Statistical Annotation of General Text , 2002, LREC.

[3] John F. Roddick,et al. On the im-pact of knowledge discovery and data mining , 2001 .

[4] Jian Su,et al. Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[5] Stefan Wrobel,et al. Active Hidden Markov Models for Information Extraction , 2001, IDA.

[6] Christian Lovis,et al. Electronic Patient Record : dealing with numbers or with words ? , 1999 .

[7] John F. Roddick,et al. On the impact of knowledge discovery and data mining , 2000 .

[8] John F. Roddick,et al. Detecting Privacy and Ethical Sensitivity in Data Mining Results , 2004, ACSC.

[9] Louise Corti,et al. Confidentiality and Informed Consent: Issues for Consideration in the Preservation of and Provision of Access to Qualitative Data Archives , 2000 .

[10] Frances Rock. Policy and practice in the anonymisation of linguistic data , 2001 .