Next steps in near-duplicate detection for eRulemaking

Large volume public comment campaigns and web portals that encourage the public to customize form letters produce many near-duplicate documents, which increases processing and storage costs, but is rarely a serious problem. A more serious concern is that form letter customizations can include substantive issues that agencies are likely to overlook. The identification of exact- and near-duplicate texts, and recognition of unique text within near-duplicate documents, is an important component of data cleaning and integration processes for eRulemaking.This paper presents DURIAN (DUplicate Removal In lArge collectioN), a refinement of a prior near-duplicate detection algorithm DURIAN uses a traditional bag-of-words document representation, document attributes ("metadata"), and document content structure to identify form letters and their edited copies in public comment collections. Experimental results demonstrate that DURIAN is about as effective as human assessors. The paper concludes by discussing challenges to moving near-duplicate detection into operational rulemaking environments.

[1]  Kincho H. Law,et al.  A Relatedness Analysis Tool for Comparing Drafted Regulations and Associated Public Comments , 2005 .

[2]  Stuart W. Shulman E-Rulemaking: Issues in Current Research and Practice [1] , 2005 .

[3]  Jack G. Conrad,et al.  Online duplicate document detection: signature reliability in a dynamic retrieval environment , 2003, CIKM '03.

[4]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[5]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[6]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[7]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[8]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[9]  W. Bruce Croft,et al.  Similarity measures for tracking information flow , 2005, CIKM '05.

[10]  B. Mintz,et al.  A guide to federal agency rulemaking , 1991 .

[11]  Cornelius M. Kerwin,et al.  Rulemaking: How Government Agencies Write Law and Make Policy , 2019 .

[12]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[13]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[14]  K. Gwet Kappa Statistic is not Satisfactory for Assessing the Extent of Agreement Between Raters , 2002 .

[15]  Stuart W. Shulman The Internet Still Might (But Probably Won't) Change Everything , 2005 .

[16]  Grace Hui Yang,et al.  Near-duplicate detection for eRulemaking , 2005, DG.O.

[17]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[18]  Cary Coglianese,et al.  E-Rulemaking: Information Technology and the Regulatory Process , 2004 .

[19]  Jeffrey S. Lubbers A Guide to Federal Agency Rulemaking , 2003 .

[20]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.