Legal documents are known for being lengthy. To our knowledge, some categories of legal documents contain duplicated information that do not require our attention. However, manually extracting non-duplicate information from documents requires considerable amount of effort. Thus, we want to use machine learning algorithms to pick up unordinary sentences for us. In this paper, we propose a set of algorithms that filters out duplicate information and returns useful information to the user. We are able to train a learner that can mark unordinary parts of a legal document for manual scrutinization. Our learning process contains two phases. At the first phase, we pick some legal documents that contain common patterns, e.g. software user agreements, to form a knowledge base for the trainer. We then run LDA [1] model on these documents. The LDA model will return us with a set of common topics across the knowledge base. At the second phase, we take a new piece of legal document as the test sample. We first remove common topic words from the test document to increase differences between sentences. We then use Word2Vec [2], [3] to convert sentences into vectors. After generating the feature space, we run Agglomerative Clustering and Local Outlier Factor(LOF) [4] algorithms on the feature vectors to detect special sentences in the given document. Last, we use PCA and t-SNE to visualize our result.
[1]
Robert R. Sokal,et al.
A statistical method for evaluating systematic relationships
,
1958
.
[2]
Hans-Peter Kriegel,et al.
LOF: identifying density-based local outliers
,
2000,
SIGMOD '00.
[3]
Geoffrey E. Hinton,et al.
Visualizing Data using t-SNE
,
2008
.
[4]
Michael I. Jordan,et al.
Latent Dirichlet Allocation
,
2001,
J. Mach. Learn. Res..
[5]
P. Rousseeuw.
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
,
1987
.
[6]
Manabu Okumura,et al.
Towards Multi-paper Summarization Using Reference Information
,
1999,
IJCAI.
[7]
Jeffrey Dean,et al.
Distributed Representations of Words and Phrases and their Compositionality
,
2013,
NIPS.
[8]
Wenjie Li,et al.
Simultaneous Ranking and Clustering of Sentences: A Reinforcement Approach to Multi-Document Summarization
,
2010,
COLING.
[9]
Jaideep Srivastava,et al.
Contextual Anomaly Detection in Text Data
,
2012,
Algorithms.
[10]
Dragomir R. Radev,et al.
Scientific Paper Summarization Using Citation Summary Networks
,
2008,
COLING.
[11]
Jeffrey Dean,et al.
Efficient Estimation of Word Representations in Vector Space
,
2013,
ICLR.