Recognizing Citations in Public Comments

ABSTRACT Notice and comment rulemaking is central to how U.S. federal agencies craft new regulation. E-rulemaking, the process of soliciting and considering public comments that are submitted electronically, poses a challenge for agencies. The large volume of comments received makes it difficult to distill and address the most substantive concerns of the public. This work attempts to alleviate this burden by applying existing machine learning techniques to the problem of recognizing citation sentences. A citation in this context is defined as a statement in which the author of the public comment references an external source of factual information that is associated with a specific person or organization. The problem is formulated as a binary classification problem: Is a specific person or organization mentioned in a sentence being referenced as an external source of information? We show that our definition of a citation is reproducible by human judges and that citations can be detected using machine learning techniques with some success. Casting this as a machine learning problem requires selecting an appropriate representation of the sentence. Several feature sets are evaluated individually and in combination. Superior results are obtained by combining feature sets. Syntactic features, which characterize the structure of the sentence rather than its content, significantly improve accuracy when combined with other features, but not when used in isolation. Although prediction error rate is adequate, coverage could be improved. An error analysis enumerates short-term and long-term challenges that must be overcome to improve recall.

[1]  Simone Teufel,et al.  Automatic classification of citation function , 2006, EMNLP.

[2]  Robert Dale,et al.  Evidence-Based Information Extraction for High Accuracy Citation and Author Name Identification , 2007, RIAO.

[3]  Janyce Wiebe,et al.  Learning Subjective Language , 2004, CL.

[4]  Dragomir R. Radev,et al.  Semi-Supervised Classification for Extracting Protein Interaction Sentences using Dependency Parsing , 2007, EMNLP.

[5]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[6]  Mark Stevenson,et al.  A Semantic Approach to IE Pattern Induction , 2005, ACL.

[7]  Grace Hui Yang,et al.  Near-duplicate detection by instance-level constrained clustering , 2006, SIGIR.

[8]  Roman Yangarber,et al.  Counter-Training in Discovery of Semantic Patterns , 2003, ACL.

[9]  D. Sculley,et al.  Mining millions of metaphors , 2008, Lit. Linguistic Comput..

[10]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[11]  J. Ziman,et al.  Public knowledge. An essay concerning the social dimension of science , 1970, Medical History.

[12]  Razvan C. Bunescu,et al.  A Shortest Path Dependency Kernel for Relation Extraction , 2005, HLT.

[13]  Simone Teufel,et al.  Whose Idea Was This, and Why Does it Matter? Attributing Scientific Work to Citations , 2007, HLT-NAACL.

[14]  Philip Resnik,et al.  Inducing Frame Semantic Verb Classes from WordNet and LDOCE , 2004, ACL.

[15]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[16]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[17]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[18]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[19]  Ellen Riloff,et al.  Automatically Constructing a Dictionary for Information Extraction Tasks , 1993, AAAI.

[20]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[21]  Simone Teufel,et al.  An annotation scheme for citation function , 2009, SIGDIAL Workshop.

[22]  Josef Ruppenhofer,et al.  FrameNet II: Extended theory and practice , 2006 .