Detecting outlier sections in us congressional legislation

Reading congressional legislation, also known as bills, is often tedious because bills tend to be long and written in complex language. In IBM Many Bills, an interactive web-based visualization of legislation, users of different backgrounds can browse bills and quickly explore parts that are of interest to them. One task users have is to be able to locate sections that don't seem to fit with the overall topic of the bill. In this paper, we present novel techniques to determine which sections within a bill are likely to be outliers by employing approaches from information retrieval. The most promising techniques first detect the most topically relevant parts of a bill by ranking its sections, followed by a comparison between these topically relevant parts and the remaining sections in the bill. To compare sections we use various dissimilarity metrics based on Kullback-Leibler Divergence. The results indicate that these techniques are more successful than a classification based approach. Finally, we analyze how the dissimilarity metrics succeed in discriminating between sections that are strong outliers versus those that are 'milder' outliers.

[1]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[2]  Christian Plaunt,et al.  Subtopic structuring for full-length document access , 1993, SIGIR.

[3]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[4]  Eric W. Welch,et al.  Internet use, transparency, and interactivity effects on trust in government , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[5]  Susan T. Dumais,et al.  Similarity Measures for Short Segments of Text , 2007, ECIR.

[6]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  Norton Trevisan Roman,et al.  Attribute-value specification in customs fraud detection: a human-aided approach , 2009, D.GO.

[9]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[10]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[11]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[12]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[13]  Jacki O'Neill,et al.  A new tangible user interface for machine learning document review , 2010, Artificial Intelligence and Law.

[14]  Reda Alhajj,et al.  Hybrid Approach to Web Content Outlier Mining Without Query Vector , 2005, DaWaK.

[15]  Thomas M. Cover,et al.  Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) , 2006 .

[16]  W. Bruce Croft,et al.  Quantifying query ambiguity , 2002 .

[17]  Yannick Assogba,et al.  DocBlocks: communication-minded visualization of topics in U.S. congressional bills , 2010, CHI EA '10.

[18]  Reda Alhajj,et al.  Framework for mining web content outliers , 2004, SAC '04.

[19]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[20]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[21]  W. Bruce Croft,et al.  Generating hierarchical summaries for web searches , 2003, SIGIR '03.

[22]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[23]  Paul T. Jaeger,et al.  E-government around the world: lessons, challenges, and future directions , 2003, Gov. Inf. Q..

[24]  Kincho H. Law,et al.  An e-government information architecture for regulation analysis and compliance assistance , 2004, ICEC '04.

[25]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.