Legal document clustering with built-in topic segmentation

Clustering is a useful tool for helping users navigate, summarize, and organize large quantities of textual documents available on the Internet, in news sources, and in digital libraries. A variety of clustering methods have also been applied to the legal domain, with various degrees of success. Some unique characteristics of legal content as well as the nature of the legal domain present a number of challenges. For example, legal documents are often multi-topical, contain carefully crafted, professional, domain-specific language, and possess a broad and unevenly distributed coverage of legal issues. Moreover, unlike widely accessible documents on the Internet, where search and categorization services are generally free, the legal profession is still largely a fee-for-service field that makes the quality (e.g., in terms of both recall and precision) a key differentiator of provided services. This paper introduces a classification-based recursive soft clustering algorithm with built-in topic segmentation. The algorithm leverages existing legal document metadata such as topical classifications, document citations, and click stream data from user behavior databases, into a comprehensive clustering framework. Techniques associated with the algorithm have been applied successfully to very large databases of legal documents, which include judicial opinions, statutes, regulations, administrative materials and analytical documents. Extensive evaluations were conducted to determine the efficiency and effectiveness of the proposed algorithm. Subsequent evaluations conducted by legal domain experts have demonstrated that the quality of the resulting clusters based upon this algorithm is similar to those created by domain experts.

[1]  Jimmy J. Lin,et al.  Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.

[2]  Dieter Merkl,et al.  En route to data mining in legal text corpora: clustering, neural computation, and international treaties , 1997, Database and Expert Systems Applications. 8th International Conference, DEXA '97. Proceedings.

[3]  John D. Lafferty,et al.  A Model of Lexical Attraction and Repulsion , 1997, ACL.

[4]  Johanna D. Moore,et al.  Latent Semantic Analysis for Text Segmentation , 2001, EMNLP.

[5]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[6]  George Karypis,et al.  A segment-based approach to clustering multi-topic documents , 2012, Knowledge and Information Systems.

[7]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[8]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[9]  Robert Kozma,et al.  A modified fuzzy ART for soft document clustering , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[10]  Hitoshi Isahara,et al.  A Statistical Model for Domain-Independent Text Segmentation , 2001, ACL.

[11]  Don-Lin Yang,et al.  An efficient Fuzzy C-Means clustering algorithm , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[12]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[13]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[14]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[15]  Carlos Ordonez,et al.  FREM: fast and robust EM clustering for large data sets , 2002, CIKM '02.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Hideki Kozima,et al.  Text Segmentation Based on Similarity between Words , 1993, ACL.

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[19]  Peter Jackson,et al.  Combining multiple classifiers for text categorization , 2001, CIKM '01.

[20]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[21]  Evangelos E. Milios,et al.  A Statistical Model for Topic Segmentation and Clustering , 2008, Canadian Conference on AI.

[22]  Paul S. Bradley,et al.  Clustering very large databases using EM mixture models , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[23]  Ying Zhao,et al.  Effective document clustering for large heterogeneous law firm collections , 2005, International Conference on Artificial Intelligence and Law.

[24]  Hideki Kozima,et al.  Similarity between Words Computed by Spreading Activation on an English Dictionary , 1993, EACL.