Data Quality for Deep Learning of Judgment Documents: An Empirical Study

The revolution in hardware technology has made it possible to obtain high-definition data through highly sophisticated algorithms. Deep learning has emerged and is widely used in various fields, and the judicial area is no exception. As the carrier of the litigation activities, the judgment documents record the process and results of the people’s courts, and their quality directly affects the fairness and credibility of the law. To be able to measure the quality of judgment documents, the interpretability of judgment documents has been an indispensable dimension. Unfortunately, due to the various uncontrollable factors during the process, such as data transmission and storage, The data set for training usually has a poor quality. Besides, due to the severe imbalance of the distribution of case data, data augmentation is essential to generate data for low-frequency cases. Based on the existing data set and the application scenarios, we explore data quality issues in four areas. Then we systematically investigate them to figure out their impact on the data set. After that, we compare the four dimensions to find out which one has the most considerable damage to the data set.

[1]  Cornelia Kiefer,et al.  Assessing the Quality of Unstructured Data: An Initial Overview , 2016, LWDA.

[2]  Marta Indulska,et al.  Open data: Quality over quantity , 2017, Int. J. Inf. Manag..

[3]  Carlo Batini,et al.  On the Meaningfulness of “Big Data Quality” (Invited Paper) , 2015, Data Science and Engineering.

[4]  Julien Rabin,et al.  Revisiting Precision and Recall Definition for Generative Model Evaluation , 2019, ArXiv.

[5]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[6]  Carlo Batini,et al.  The Many Faces of Information and their Impact on Information Quality , 2012, ICIQ.

[7]  Xue-wen Chen,et al.  Combating the Small Sample Class Imbalance Problem Using Feature Selection , 2010, IEEE Transactions on Knowledge and Data Engineering.

[8]  Hamidah Ibrahim,et al.  Data quality: A survey of data quality dimensions , 2012, 2012 International Conference on Information Retrieval & Knowledge Management.

[9]  Cornelia Kiefer Quality Indicators for Text Data , 2019, BTW.

[10]  Carlo Batini,et al.  Data and Information Quality , 2016, Data-Centric Systems and Applications.

[11]  H. Cuayahuitl,et al.  Human-computer dialogue simulation using hidden Markov models , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..