Protecting User Privacy in Remote Conversational Systems: A Privacy-Preserving framework based on text sanitization

Large Language Models (LLMs) are gaining increasing attention due to their exceptional performance across numerous tasks. As a result, the general public utilize them as an influential tool for boosting their productivity while natural language processing researchers endeavor to employ them in solving existing or new research problems. Unfortunately, individuals can only access such powerful AIs through APIs, which ultimately leads to the transmission of raw data to the models' providers and increases the possibility of privacy data leakage. Current privacy-preserving methods for cloud-deployed language models aim to protect privacy information in the pre-training dataset or during the model training phase. However, they do not meet the specific challenges presented by the remote access approach of new large-scale language models. This paper introduces a novel task,"User Privacy Protection for Dialogue Models,"which aims to safeguard sensitive user information from any possible disclosure while conversing with chatbots. We also present an evaluation scheme for this task, which covers evaluation metrics for privacy protection, data availability, and resistance to simulation attacks. Moreover, we propose the first framework for this task, namely privacy protection through text sanitization. Before sending the input to remote large models, it filters out the sensitive information, using several rounds of text sanitization based on privacy types that users define. Upon receiving responses from the larger model, our framework automatically restores privacy to ensure that the conversation goes smoothly, without intervention from the privacy filter. Experiments based on real-world datasets demonstrate the efficacy of our privacy-preserving approach against eavesdropping from potential attackers.

[1]  Chuan Chen,et al.  Challenges and Remedies to Privacy and Security in AIGC: Exploring the Potential of Privacy Computing, Blockchain, and Beyond , 2023, ArXiv.

[2]  Prateek Mittal,et al.  Differentially Private In-Context Learning , 2023, ArXiv.

[3]  A. Hellander,et al.  FedBot: Enhancing Privacy in Chatbots with Federated Learning , 2023, 2304.03228.

[4]  Huan Zhao,et al.  Exploring the Feasibility of ChatGPT for Event Extraction , 2023, ArXiv.

[5]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[6]  Mingyu Gao,et al.  FLARE: A Fast, Secure, and Memory-Efficient Distributed Analytics Framework (Flavor: Systems) , 2023, Proc. VLDB Endow..

[7]  Chao Chen,et al.  A survey on privacy inference attacks and defenses in cloud-based Deep Neural Network , 2022, Comput. Stand. Interfaces.

[8]  Li Dong,et al.  THE-X: Privacy-Preserving Transformer Inference with Homomorphic Encryption , 2022, FINDINGS.

[9]  Anit Kumar Sahu,et al.  Self-Aware Personalized Federated Learning , 2022, NeurIPS.

[10]  Mohammad Taher Pilehvar,et al.  AdapLeR: Speeding up Inference by Adaptive Length Reduction , 2022, ACL.

[11]  Florian Tramèr,et al.  What Does it Mean for a Language Model to Preserve Privacy? , 2022, FAccT.

[12]  David Sánchez,et al.  The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization , 2022, Computational Linguistics.

[13]  Shouling Ji,et al.  AHEAD: Adaptive Hierarchical Decomposition for Range Query under Local Differential Privacy , 2021, CCS.

[14]  Huseyin A. Inan,et al.  Differentially Private Fine-tuning of Language Models , 2021, ICLR.

[15]  R. Jia,et al.  Selective Differential Privacy for Language Modeling , 2021, NAACL.

[16]  Dimitra Gkatzia,et al.  CAPE: Context-Aware Private Embeddings for Private Language Learning , 2021, EMNLP.

[17]  Giovanni Neglia,et al.  Federated Multi-Task Learning under a Mixture of Distributions , 2021, NeurIPS.

[18]  Alison Q. O'Neil,et al.  Survey: Leakage and Privacy at Inference Time , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Sherman S. M. Chow,et al.  Differential Privacy for Text Analytics via Natural Text Sanitization , 2021, FINDINGS.

[20]  Maosong Sun,et al.  TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference , 2021, NAACL.

[21]  Hooman Alavizadeh,et al.  Scalable, High-Performance, and Generalized Subtree Data Anonymization Approach for Apache Spark , 2021, Electronics.

[22]  D. Song,et al.  DPlis: Boosting Utility of Differentially Private Deep Learning via Randomized Smoothing , 2021, Proc. Priv. Enhancing Technol..

[23]  Keyhan Khamforoosh,et al.  DI-Mondrian: Distributed improved Mondrian for satisfaction of the L-diversity privacy model using Apache Spark , 2021, Inf. Sci..

[24]  Geoffrey C. Fox,et al.  CRYPTOGRU: Low Latency Privacy-Preserving Text Analysis With GRU , 2020, EMNLP.

[25]  Abhinav Aggarwal,et al.  A Differentially Private Text Perturbation Method Using Regularized Mahalanobis Metric , 2020, PRIVATENLP.

[26]  Lingjuan Lyu,et al.  Differentially Private Representation for NLP: Formal Guarantee and An Empirical Study on Privacy and Fairness , 2020, FINDINGS.

[27]  Úlfar Erlingsson,et al.  Tempered Sigmoid Activations for Deep Learning with Differential Privacy , 2020, AAAI.

[28]  Yoko Kamidoi,et al.  A Framework for Fast MapReduce Processing Considering Sensitive Data on Hybrid Clouds , 2020, 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC).

[29]  Jeffrey D. Ullman,et al.  Panda: Partitioned Data Security on Outsourced Sensitive and Non-sensitive Data , 2020, ACM Trans. Manag. Inf. Syst..

[30]  Mi Zhang,et al.  Privacy Risks of General-Purpose Language Models , 2020, 2020 IEEE Symposium on Security and Privacy (SP).

[31]  Alistair E. W. Johnson,et al.  Deidentification of free-text medical records using pre-trained bidirectional transformers , 2020, CHIL.

[32]  Congzheng Song,et al.  Information Leakage in Embedding Models , 2020, CCS.

[33]  Anamitra R. Choudhury,et al.  PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination , 2020, ICML.

[34]  Ming Li,et al.  PCKV: Locally Differentially Private Correlated Key-Value Data Collection with Optimized Utility , 2019, USENIX Security Symposium.

[35]  Aseem Rastogi,et al.  CrypTFlow: Secure TensorFlow Inference , 2019, 2020 IEEE Symposium on Security and Privacy (SP).

[36]  Quanquan Gu,et al.  DP-LSSGD: A Stochastic Optimization Method to Lift the Utility in Privacy-Preserving ERM , 2019, MSML.

[37]  Xiaofeng Meng,et al.  PrivKV: Key-Value Data Collection with Local Differential Privacy , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[38]  H. Brendan McMahan,et al.  Differentially Private Learning with Adaptive Clipping , 2019, NeurIPS.

[39]  Azer Bestavros,et al.  Conclave: secure multi-party computation on big data , 2019, EuroSys.

[40]  Hubert Eichner,et al.  Towards Federated Learning at Scale: System Design , 2019, SysML.

[41]  Jeffrey D. Ullman,et al.  Partitioned Data Security on Outsourced Sensitive and Non-Sensitive Data , 2018, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[42]  Yuan Tian,et al.  Privacy Partitioning: Protecting User Data During the Deep Learning Inference Phase , 2018, ArXiv.

[43]  Shashi Narayan,et al.  Privacy-preserving Neural Representations of Text , 2018, EMNLP.

[44]  Dimitris S. Papailiopoulos,et al.  ATOMO: Communication-efficient Learning via Atomic Sparsification , 2018, NeurIPS.

[45]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[46]  Timothy Baldwin,et al.  Towards Robust and Privacy-preserving Text Representations , 2018, ACL.

[47]  Takayuki Nishio,et al.  Client Selection for Federated Learning with Heterogeneous Resources in Mobile Edge , 2018, ICC 2019 - 2019 IEEE International Conference on Communications (ICC).

[48]  Julian Dolby,et al.  SecureMR: secure mapreduce computation using homomorphic encryption and program partitioning , 2018, HotSoS.

[49]  Úlfar Erlingsson,et al.  The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks , 2018, USENIX Security Symposium.

[50]  Úlfar Erlingsson,et al.  Scalable Private Learning with PATE , 2018, ICLR.

[51]  Ashwin Machanavajjhala,et al.  One-sided Differential Privacy , 2017, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[52]  Janardhan Kulkarni,et al.  Collecting Telemetry Data Privately , 2017, NIPS.

[53]  H. Brendan McMahan,et al.  Learning Differentially Private Recurrent Language Models , 2017, ICLR.

[54]  David M. Eyers,et al.  Glamdring: Automatic Application Partitioning for Intel SGX , 2017, USENIX Annual Technical Conference.

[55]  Mona Vij,et al.  Intel® Software Guard Extensions (Intel® SGX) Architecture for Oversubscription of Secure Memory in a Virtualized Environment , 2017, HASP@ISCA.

[56]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[57]  Franck Dernoncourt,et al.  De-identification of patient notes with recurrent neural networks , 2016, J. Am. Medical Informatics Assoc..

[58]  Christos Gkantsidis,et al.  Observing and Preventing Leakage in MapReduce , 2015, CCS.

[59]  Beng Chin Ooi,et al.  M2R: Enabling Stronger Privacy in MapReduce Computation , 2015, USENIX Security Symposium.

[60]  Murat Kantarcioglu,et al.  SEMROD: Secure and Efficient MapReduce Over HybriD Clouds , 2015, SIGMOD Conference.

[61]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[62]  David Sánchez,et al.  C‐sanitized: A privacy model for document redaction and sanitization , 2014, J. Assoc. Inf. Sci. Technol..

[63]  Keith Marsolo,et al.  Large-scale evaluation of automated clinical note de-identification and its impact on information extraction , 2013, J. Am. Medical Informatics Assoc..

[64]  Balamurugan Anandan,et al.  t-Plausibility: Generalizing Words to Desensitize Text , 2012, Trans. Data Priv..

[65]  Kyungho Jeon,et al.  The HybrEx Model for Confidentiality and Privacy in Cloud Computing , 2011, HotCloud.

[66]  Vitaly Shmatikov,et al.  Airavat: Security and Privacy for MapReduce , 2010, NSDI.

[67]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[68]  Mukesh K. Mohania,et al.  Efficient techniques for document sanitization , 2008, CIKM '08.

[69]  Jimeng Sun,et al.  Hiding in the Crowd: Privacy Preservation on Evolving Streams through Correlation Tracking , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[70]  Xuanjing Huang,et al.  TextFusion: Privacy-Preserving Pre-trained Model Inference via Token Fusion , 2022, EMNLP.

[71]  Yossi Matias,et al.  Learning and Evaluating a Differentially Private Pre-trained Language Model , 2021, PRIVATENLP.

[72]  David Sands,et al.  Personalised Differential Privacy Summary of POPL ’ 15 paper “ Differential Privacy : Now It ’ s Getting Personal ” , 2015 .

[73]  A. Reisner,et al.  De-identification algorithm for free-text nursing notes , 2005, Computers in Cardiology, 2005.

[74]  T. Alves,et al.  TrustZone : Integrated Hardware and Software Security , 2004 .

[75]  Robert H. Baud,et al.  Medical document anonymization with a semantic lexicon , 2000, AMIA.

[76]  AMIA 2000, American Medical Informatics Association Annual Symposium, Los Angeles, CA, USA, November 4-8, 2000 , 2000, AMIA.