Impact of Text Specificity and Size on Word Embeddings Performance: An Empirical Evaluation in Brazilian Legal Domain

Word embeddings is a text representation technique capable of capturing syntactic and semantic linguistic patterns and of representing each word as an n-dimensional dense vector. In the domain of legal texts, there are trained word embeddings in languages like English, Polish, and Chinese. However, to the best of our knowledge, there are no embeddings based on Portuguese (Brazilian and European) legal texts. Given that, our research question is: does the specificity and size of the text corpus used for a word embedding training contribute to a more successful classification? To answer the question, we train word embeddings models in the legal domain with different levels of specificity and size. Then we evaluate their impact on text classification. To deal with the different levels of specificity, we collect text documents from different courts of the Brazilian Judiciary, in hierarchical order. We used these text corpora to train a word embeddings model (GloVe) and then had then evaluated while classifying processes with a deep learning model (CNN). In a context perspective, the results show that in word embeddings trained on smaller corpora sizes, text specificity has a higher impact than for large sizes. Also, in a corpus size perspective, the results demonstrate that the greater the corpus size in embeddings training, the better are the results. However, this impact decreases as the corpus size increases until a point where more words in the corpus have little impact on the results.

[1]  Wenan Zhou,et al.  A survey of word embeddings based on deep learning , 2019, Computing.

[2]  S.J.J. Smith,et al.  Empirical Methods for Artificial Intelligence , 1995 .

[3]  Gustavo Carvalho,et al.  Document type classification for Brazil’s supreme court using a Convolutional Neural Network , 2018, ICoFCS-2018.

[4]  Renata Vieira,et al.  BlogSet-BR: A Brazilian Portuguese Blog Corpus , 2018, LREC.

[5]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[6]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[7]  Donald E. Brown,et al.  Text Classification Algorithms: A Survey , 2019, Inf..

[8]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[9]  Dimitrios Kampas,et al.  Deep learning in law: early adaptation and legal word embeddings trained on large corpora , 2018, Artificial Intelligence and Law.

[10]  Gunupudi Rajesh Kumar,et al.  Intrusion Detection Using Text Processing Techniques: A Recent Survey , 2015 .

[11]  Mohammed Meknassi,et al.  Enhancing unsupervised neural networks based text summarization with word embedding and ensemble learning , 2019, Expert Syst. Appl..

[12]  Aleksander Smywinski-Pohl,et al.  Automatic Construction of a Polish Legal Dictionary with Mappings to Extra-Legal Terms Established via Word Embeddings , 2019, ICAIL.

[13]  Fabio Rinaldi,et al.  Natural Language Processing of Clinical Notes on Chronic Diseases: Systematic Review , 2019, JMIR medical informatics.

[14]  Ada Pellegrini Grinover,et al.  Teoria geral do processo , 2014 .

[15]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[16]  Tiago A. Almeida,et al.  Towards automatic filtering of fake reviews , 2018, Neurocomputing.

[17]  Nathan Hartmann,et al.  Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks , 2017, STIL.

[18]  Jianxin Li,et al.  Large-Scale Hierarchical Text Classification with Recursively Regularized Deep Graph-CNN , 2018, WWW.

[19]  Anderson da Silva Soares,et al.  Portuguese Language Models and Word Embeddings: Evaluating on Semantic Similarity Tasks , 2020, PROPOR.

[20]  Jun Zhao,et al.  How to Generate a Good Word Embedding , 2015, IEEE Intelligent Systems.

[21]  Paula Chocron,et al.  Vocabulary Alignment for Collaborative Agents: a Study with Real-World Multilingual How-to Instructions , 2018, IJCAI.

[22]  Alper Kursat Uysal,et al.  An improved global feature selection scheme for text classification , 2016, Expert Syst. Appl..