Towards Grammatical Tagging for the Legal Language of Cybersecurity

Legal language can be understood as the language typically used by those engaged in the legal profession and, as such, it may come both in spoken or written form. Recent legislation on cybersecurity obviously uses legal language in writing, thus inheriting all its interpretative complications due to the typical abundance of cases and sub-cases as well as to the general richness in detail. This paper faces the challenge of the essential interpretation of the legal language of cybersecurity, namely of the extraction of the essential Parts of Speech (POS) from the legal documents concerning cybersecurity. The challenge is overcome by our methodology for POS tagging of legal language. It leverages state-of-the-art open-source tools for Natural Language Processing (NLP) as well as manual analysis to validate the outcomes of the tools. As a result, the methodology is automated and, arguably, general for any legal language following minor tailoring of the preprocessing step. It is demonstrated over the most relevant EU legislation on cybersecurity, namely on the NIS 2 directive, producing the first, albeit essential, structured interpretation of such a relevant document. Moreover, our findings indicate that tools such as SpaCy and ClausIE reach their limits over the legal language of the NIS 2.

[1]  Sam Arts,et al.  Natural language processing to identify the creation and impact of new technologies in patent text: Code, data, and new measures , 2020, Research Policy.

[2]  Ion Androutsopoulos,et al.  LEGAL-BERT: “Preparing the Muppets for Court’” , 2020, FINDINGS.

[3]  V. Schmidt The EU Commission , 2020, Europe's Crisis of Legitimacy.

[4]  Nilton Correia da Silva,et al.  VICTOR: a Dataset for Brazilian Legal Documents Classification , 2020, LREC.

[5]  Fahad ul Hassan,et al.  Automated Requirements Identification from Construction Contract Documents Using Natural Language Processing , 2020 .

[6]  Xiaoyan Wang,et al.  Distinguish Confusing Law Articles for Legal Judgment Prediction , 2020, ACL.

[7]  Maosong Sun,et al.  JEC-QA: A Legal-Domain Question Answering Dataset , 2019, AAAI.

[8]  Jieh Hsiang,et al.  Patent Claim Generation by Fine-Tuning OpenAI GPT-2 , 2019, World Patent Information.

[9]  Masha Medvedeva,et al.  Using machine learning to predict decisions of the European Court of Human Rights , 2019, Artificial Intelligence and Law.

[10]  Paolo Torroni,et al.  CLAUDETTE: an automated detector of potentially unfair clauses in online terms of service , 2018, Artificial Intelligence and Law.

[11]  Xia Feng,et al.  Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey , 2017, Multimedia Tools and Applications.

[12]  Josef van Genabith,et al.  Exploring the Use of Text Classification in the Legal Domain , 2017, ASAIL@ICAIL.

[13]  Josh Blackman,et al.  Predicting the Behavior of the Supreme Court of the United States: A General Approach , 2014, ArXiv.

[14]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..