Automatically Classify Chinese Judgment Documents Utilizing Machine Learning Algorithms

In law, a judgment is a decision by a court that resolves a controversy and determines the rights and liabilities of parties in a legal action or proceeding. In 2013, China Judgments Online system was launched officially for record keeping and notification, up to now, over 23 million electronic judgment documents are recorded. The huge amount of judgment documents has witnessed the improvement of judicial justice and openness. Document categorization becomes increasingly important for judgments indexing and further analysis. However, it is almost impossible to categorize them manually due to their large volume and rapid growth. In this paper, we propose a machine learning approach to automatically classify Chinese judgment documents using machine learning algorithms including Naive Bayes (NB), Decision Tree (DT), Random Forest (RF) and Support Vector Machine (SVM). A judgment document is represented as vector space model (VSM) using TF-IDF after words segmentation. To improve performance, we construct a set of judicial stop words. Besides, as TF-IDF generates a high dimensional feature vector, which leads to an extremely high time complexity, we utilize three dimensional reduction methods. Based on 6735 pieces of judgment documents, extensive experiments demonstrate the effectiveness and high classification performance of our proposed method.

[1]  Huiru Zheng,et al.  Machine Learning for Medical Applications , 2015, TheScientificWorldJournal.

[2]  Charu C. Aggarwal,et al.  An Introduction to Text Mining , 2022 .

[3]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[4]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[5]  Michael W. Berry,et al.  Large-Scale Sparse Singular Value Computations , 1992 .

[6]  Wei-Ying Ma,et al.  Supervised latent semantic indexing for document categorization , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[7]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[8]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[11]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[12]  Wanxiang Che,et al.  LTP: A Chinese Language Technology Platform , 2010, COLING.

[13]  Tomek Strzalkowski,et al.  Document Representation in Natural Language Text Retrieval , 1994, HLT.

[14]  Jaime G. Carbonell,et al.  Document Representation and Query Expansion Models for Blog Recommendation , 2008, ICWSM.

[15]  Lizhen Liu,et al.  Document representation based on semantic smoothed topic model , 2016, 2016 17th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD).

[16]  Shuo Yang,et al.  A novel approach for business document representation and processing without semantic ambiguity in e-commerce , 2015, 2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS).

[17]  Hongning Wang,et al.  Integrating rich document representations for text classification , 2016, 2016 IEEE Systems and Information Engineering Design Symposium (SIEDS).