Adaptive Rule Discovery for Labeling Text Data

Creating and collecting labeled data is one of the major bottlenecks in machine learning pipelines and the emergence of automated feature generation techniques such as deep learning, which typically requires a lot of training data, has further exacerbated the problem. While weak-supervision techniques have circumvented this bottleneck, existing frameworks either require users to write a set of diverse, high-quality rules to label data (e.g., Snorkel), or require a labeled subset of the data to automatically mine rules (e.g., Snuba). The process of manually writing rules can be tedious and time consuming. At the same time, creating a labeled subset of the data can be costly and even infeasible in imbalanced settings. This is due to the fact that a random sample in imbalanced settings often contains only a few positive instances. To address these shortcomings, we present Darwin, an interactive system designed to alleviate the task of writing rules for labeling text data in weakly-supervised settings. Given an initial labeling rule, Darwin automatically generates a set of candidate rules for the labeling task at hand, and utilizes the annotator's feedback to adapt the candidate rules. We describe how Darwin is scalable and versatile. It can operate over large text corpora (i.e., more than 1 million sentences) and supports a wide range of labeling functions (i.e., any function that can be specified using a context free grammar). Finally, we demonstrate with a suite of experiments over five real-world datasets that Darwin enables annotators to generate weakly-supervised labels efficiently and with a small cost. In fact, our experiments show that rules discovered by Darwin on average identify 40% more positive instances compared to Snuba even when it is provided with 1000 labeled instances.

[1]  Andrew Y. Ng,et al.  Semantic Compositionality through Recursive Matrix-Vector Spaces , 2012, EMNLP.

[2]  Xiaoyong Du,et al.  Cost-Effective Data Annotation using Game-Based Crowdsourcing , 2018, Proc. VLDB Endow..

[3]  R. Preston McAfee,et al.  Who moderates the moderators?: crowdsourcing abuse detection in user-generated content , 2011, EC '11.

[4]  Burr Settles,et al.  Active Learning , 2012, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[5]  Daniel Jurafsky,et al.  Learning Syntactic Patterns for Automatic Hypernym Discovery , 2004, NIPS.

[6]  Christopher Ré,et al.  Training Classifiers with Natural Language Explanations , 2018, ACL.

[7]  Arnab Nandi,et al.  ICARUS: Minimizing Human Effort in Iterative Data Completion , 2018, Proc. VLDB Endow..

[8]  Bo Zhao,et al.  A Survey on Truth Discovery , 2015, SKDD.

[9]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[10]  Stephen Roller,et al.  Hearst Patterns Revisited: Automatic Hypernym Detection from Large Text Corpora , 2018, ACL.

[11]  Christopher Ré,et al.  SLiMFast: Guaranteed Results for Data Fusion and Source Reliability , 2015, SIGMOD Conference.

[12]  Christopher Ré,et al.  Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..

[13]  Daniel L. Rubin,et al.  Inferring Generative Model Structure with Static Analysis , 2017, NIPS.

[14]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[15]  Pietro Perona,et al.  The Multidimensional Wisdom of Crowds , 2010, NIPS.

[16]  Hidekazu Oiwa,et al.  Scalable Semantic Querying of Text , 2018, Proc. VLDB Endow..

[17]  Gao Cong,et al.  Mining User Intents in Twitter: A Semi-Supervised Approach to Inferring Intent Categories for Tweets , 2015, AAAI.

[18]  Divesh Srivastava,et al.  Robust Entity Resolution using Random Graphs , 2018, SIGMOD Conference.

[19]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[20]  Tim Kraska,et al.  Fault-Tolerant Entity Resolution with the Crowd , 2015, ArXiv.

[21]  Thomas C. Wiegers,et al.  A CTD–Pfizer collaboration: manual curation of 88 000 scientific articles text mined for drug–disease and drug–phenotype interactions , 2013, Database J. Biol. Databases Curation.

[22]  Fan Yang,et al.  Differentiable Learning of Logical Rules for Knowledge Base Reasoning , 2017, NIPS.

[23]  Enrique Alfonseca,et al.  Pattern Learning for Relation Extraction with a Hierarchical Topic Model , 2012, ACL.

[24]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[25]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[26]  Christopher Ré,et al.  Snuba: Automating Weak Supervision to Label Training Data , 2018, Proc. VLDB Endow..

[27]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[28]  Divesh Srivastava,et al.  Big Data Integration , 2015, Synthesis Lectures on Data Management.