Towards Automatic Comparison of Data Privacy Documents: A Preliminary Experiment on GDPR-like Laws

General Data Protection Regulation (GDPR) becomes a standard law for data protection in many countries. Currently, twelve countries adopt the regulation and establish their GDPR-like regulation. However, to evaluate the differences and similarities of these GDPRlike regulations is time-consuming and needs a lot of manual effort from legal experts. Moreover, GDPR-like regulations from different countries are written in their languages leading to a more difficult task since legal experts who know both languages are essential. In this paper, we investigate a simple natural language processing (NLP) approach to tackle the problem. We first extract chunks of information from GDPR-like documents and form structured data from natural language. Next, we use NLP methods to compare documents to measure their similarity. Finally, we manually label a small set of data to evaluate our approach. The empirical result shows that the BERT model with cosine similarity outperforms other baselines. Our data and code are publicly available.1

[1]  Yulan He,et al.  Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 , 2020, EMNLP.

[2]  Minh Le Nguyen,et al.  JNLP Team: Deep Learning for Legal Processing in COLIEE 2020 , 2020, ArXiv.

[3]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[4]  Kripabandhu Ghosh,et al.  Measuring Similarity among Legal Court Case Documents , 2017, Compute '17.

[5]  Mirella Lapata,et al.  Ranking Sentences for Extractive Summarization with Reinforcement Learning , 2018, NAACL.

[6]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[7]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[8]  Florian Matthes,et al.  Semantic Text Matching of Contract Clauses and Legal Comments in Tenancy Law , 2018 .

[9]  Sushanta Kumar,et al.  Similarity analysis of legal judgments , 2011, Bangalore Compute Conf..

[10]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[11]  Luca Cagliero,et al.  Extracting highlights of scientific articles: A supervised summarization approach , 2020, Expert Syst. Appl..

[12]  Minh Le Nguyen,et al.  Building Legal Case Retrieval Systems with Lexical Matching and Summarization using A Pre-Trained Phrase Scoring Model , 2019, ICAIL.

[13]  Asif Ekbal,et al.  IITP in COLIEE@ICAIL 2019: Legal Information Retrieval using BM25 and BERT , 2021, ArXiv.

[14]  Daniel S. Weld,et al.  TLDR: Extreme Summarization of Scientific Documents , 2020, FINDINGS.

[15]  Răzvan Viorescu 2018 REFORM OF EU DATA PROTECTION RULES , 2017 .