Near-duplicate detection for eRulemaking

U.S. regulatory agencies are required to solicit, consider, and respond to public comments before issuing regulations. In recent years, agencies have begun to accept comments via both email and Web forms. The transition from paper to electronic comments makes it much easier for individuals to customize "form" letters, which they do, creating "near-duplicate" comments that express the same viewpoint in slightly different languages. This paper explores the use of simple text clustering and retrieval algorithms for identifying near-duplicate public comments. Experiments with public comments about a recent regulation proposed by the Environmental Protection Agency (EPA) demonstrate the effectiveness of the algorithms.

[1]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[2]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[3]  Hector Garcia-Molina,et al.  Duplicate Removal in Information Dissemination , 1998 .

[4]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[5]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[6]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[7]  Hector Garcia-Molina,et al.  Finding near-replicas of documents on the Web , 1999 .

[8]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[9]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[10]  Stuart W. Shulman An experiment in digital government at the United States National Organic Program , 2003 .

[11]  Hector Garcia-Molina,et al.  Finding Near-Replicas of Documents and Servers on the Web , 1998, WebDB.

[12]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[13]  Jack G. Conrad,et al.  Constructing a text corpus for inexact duplicate detection , 2004, SIGIR '04.

[14]  Jack G. Conrad,et al.  Online duplicate document detection: signature reliability in a dynamic retrieval environment , 2003, CIKM '03.