Snorkel: rapid training data creation with weak supervision

Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research laboratories. In a user study, subject matter experts build models $$2.8\times $$ 2.8 × faster and increase predictive performance an average $$45.5\%$$ 45.5 % versus seven hours of hand labeling. We study the modeling trade-offs in this new setting and propose an optimizer for automating trade-off decisions that gives up to $$1.8\times $$ 1.8 × speedup per pipeline execution. In two collaborations, with the US Department of Veterans Affairs and the US Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides $$132\%$$ 132 % average improvements to predictive performance over prior heuristic approaches and comes within an average $$3.60\%$$ 3.60 % of the predictive performance of large hand-curated training sets.

[1]  H. J. Scudder,et al.  Probability of error of some adaptive pattern-recognition machines , 1965, IEEE Trans. Inf. Theory.

[2]  ASHOK K. AGRAWALA,et al.  Learning with a probabilistic teacher , 1970, IEEE Trans. Inf. Theory.

[3]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[4]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[5]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[6]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[7]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[8]  Razvan C. Bunescu,et al.  Learning to Extract Relations from the Web using Minimal Supervision , 2007, ACL.

[9]  Jason Eisner,et al.  Modeling Annotators: A Generative Approach to Learning from Annotator Rationales , 2008, EMNLP.

[10]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[11]  Dan Klein,et al.  Learning from measurements in exponential families , 2009, ICML '09.

[12]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[13]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[14]  David Cohn,et al.  Active Learning , 2010, Encyclopedia of Machine Learning.

[15]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[16]  Gideon S. Mann,et al.  Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data , 2010, J. Mach. Learn. Res..

[17]  Andrew McCallum,et al.  Modeling Relations and Their Mentions without Labeled Text , 2010, ECML/PKDD.

[18]  David E. Irwin,et al.  Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior , 2011, 2011 31st International Conference on Distributed Computing Systems Workshops.

[19]  Benjamin B. Bederson,et al.  Human computation: a survey and taxonomy of a growing field , 2011, CHI.

[20]  Luke S. Zettlemoyer,et al.  Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations , 2011, ACL.

[21]  Kwong-Sak Leung,et al.  A Survey of Crowdsourcing Systems , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[22]  Enrique Alfonseca,et al.  Pattern Learning for Relation Extraction with a Hierarchical Topic Model , 2012, ACL.

[23]  Hiroshi Nakagawa,et al.  Reducing Wrong Labels in Distant Supervision for Relation Extraction , 2012, ACL.

[24]  Bo Zhao,et al.  A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration , 2012, Proc. VLDB Endow..

[25]  Dietrich Klakow,et al.  Combining Generative and Discriminative Model Scores for Distant Supervision , 2013, EMNLP.

[26]  Divesh Srivastava,et al.  Big data integration , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[27]  Thomas C. Wiegers,et al.  A CTD–Pfizer collaboration: manual curation of 88 000 scientific articles text mined for drug–disease and drug–phenotype interactions , 2013, Database J. Biol. Databases Curation.

[28]  Hongwei Li,et al.  Error Rate Analysis of Labeling by Crowdsourcing , 2013 .

[29]  Anirban Dasgupta,et al.  Aggregating crowdsourced binary ratings , 2013, WWW.

[30]  Yuval Kluger,et al.  Ranking and combining multiple predictors without labeled data , 2013, Proceedings of the National Academy of Sciences.

[31]  Divesh Srivastava,et al.  Fusing data with correlations , 2014, SIGMOD Conference.

[32]  Xi Chen,et al.  Spectral Methods Meet EM: A Provably Optimal Algorithm for Crowdsourcing , 2014, J. Mach. Learn. Res..

[33]  Christopher D. Manning,et al.  Improved Pattern Learning for Bootstrapped Entity Extraction , 2014, CoNLL.

[34]  Aditya G. Parameswaran,et al.  Comprehensive and reliable crowd assessment algorithms , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[35]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[36]  Jure Leskovec,et al.  The mobilize center: an NIH big data to knowledge center to advance human movement research and improve mobility , 2015, J. Am. Medical Informatics Assoc..

[37]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[38]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Christopher De Sa,et al.  Data Programming: Creating Large Training Sets, Quickly , 2016, NIPS.

[40]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[41]  Christopher De Sa,et al.  DeepDive: Declarative Knowledge Base Construction , 2016, SGMD.

[42]  Peter D. Karp,et al.  The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases , 2015, Nucleic Acids Res..

[43]  Bo Zhao,et al.  A Survey on Truth Discovery , 2015, SKDD.

[44]  M-Dyaa Albakour,et al.  What do a Million News Articles Look like? , 2016, NewsIR@ECIR.

[45]  Yifan Peng,et al.  Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task , 2016, Database J. Biol. Databases Curation.

[46]  Michael J. Cafarella,et al.  DeepDive , 2017 .

[47]  Thomas C. Wiegers,et al.  The Comparative Toxicogenomics Database: update 2017 , 2016, Nucleic Acids Res..

[48]  Christopher Ré,et al.  Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..

[49]  Stefano Ermon,et al.  Label-Free Supervision of Neural Networks with Physics and Domain Knowledge , 2016, AAAI.

[50]  Daniel L. Rubin,et al.  Inferring Generative Model Structure with Static Analysis , 2017, NIPS.

[51]  Christopher Ré,et al.  Learning the Structure of Generative Models without Labeled Data , 2017, ICML.

[52]  Christopher Ré,et al.  The HoloClean Framework Dataset to be cleaned Denial Constraints External Information t 1 t 4 t 2 t 3 Johnnyo ’ s , 2017 .

[53]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[54]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[55]  Christopher Ré,et al.  SLiMFast: Guaranteed Results for Data Fusion and Source Reliability , 2015, SIGMOD Conference.

[56]  Snuba , 2018, Proceedings of the VLDB Endowment.

[57]  Christopher Ré,et al.  Snuba: Automating Weak Supervision to Label Training Data , 2018, Proc. VLDB Endow..

[58]  P. Midford,et al.  The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases , 2007, Nucleic Acids Res..

[59]  Christopher Ré,et al.  Training Classifiers with Natural Language Explanations , 2018, ACL.

[60]  Christopher Ré,et al.  Fonduer: Knowledge Base Construction from Richly Formatted Data , 2017, SIGMOD Conference.

[61]  Christopher Ré,et al.  Snorkel MeTaL: Weak Supervision for Multi-Task Learning , 2018, DEEM@SIGMOD.

[62]  Euan A. Ashley,et al.  Weakly supervised classification of rare aortic valve malformations using unlabeled cardiac MRI sequences , 2018, bioRxiv.

[63]  Frederic Sala,et al.  Training Complex Models with Multi-Task Weak Supervision , 2018, AAAI.

[64]  Christopher Ré,et al.  The Role of Massively Multi-Task and Weak Supervision in Software 2.0 , 2019, CIDR.

[65]  Christopher Ré,et al.  Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale , 2018, SIGMOD Conference.