SEER: Auto-Generating Information Extraction Rules from User-Specified Examples

Time-consuming and complicated best describe the current state of the Information Extraction (IE) field. Machine learning approaches to IE require large collections of labeled datasets that are difficult to create and use obscure mathematical models, occasionally returning unwanted results that are unexplainable. Rule-based approaches, while resulting in easy-to-understand IE rules, are still time-consuming and labor-intensive. SEER combines the best of these two approaches: a learning model for IE rules based on a small number of user-specified examples. In this paper, we explain the design behind SEER and present a user study comparing our system against a commercially available tool in which users create IE rules manually. Our results show that SEER helps users complete text extraction tasks more quickly, as well as more accurately.

[1]  David Maxwell Chickering,et al.  ModelTracker: Redesigning Performance Analysis Tools for Machine Learning , 2015, CHI.

[2]  Jeffrey Heer,et al.  Predictive Interaction for Data Transformation , 2015, CIDR.

[3]  Paul A. Viola,et al.  Interactive Information Extraction with Constrained Conditional Random Fields , 2004, AAAI.

[4]  Christopher Ré,et al.  DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference , 2012, VLDS.

[5]  Luke S. Zettlemoyer,et al.  Extreme Extraction: Only One Hour per Relation , 2015, ArXiv.

[6]  E. Medvet,et al.  Inference of Regular Expressions for Text Extraction from Examples , 2016, IEEE Transactions on Knowledge and Data Engineering.

[7]  Robert Rieger,et al.  Enabling information extraction by inference of regular expressions from sample entities , 2011, CIKM '11.

[8]  Jeffrey Heer,et al.  Wrangler: interactive visual specification of data transformation scripts , 2011, CHI.

[9]  Frederick Reiss,et al.  Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! , 2013, EMNLP.

[10]  Raymond J. Mooney,et al.  Bottom-Up Relational Learning of Pattern Matching Rules for Information Extraction , 2003, J. Mach. Learn. Res..

[11]  Christopher Ré,et al.  Mindtagger: A Demonstration of Data Labeling in Knowledge Base Construction , 2015, Proc. VLDB Endow..

[12]  Eric Horvitz,et al.  Uncertainty, Action, and Interaction: In Pursuit of Mixed-Initiative Computing , 2016 .

[13]  Abraham Silberschatz,et al.  DataPlay: interactive tweaking and example-driven correction of graphical database queries , 2012, UIST.

[14]  Sriram Raghavan,et al.  Regular Expression Learning for Information Extraction , 2008, EMNLP.

[15]  Sumit Gulwani,et al.  FlashExtract: a framework for data extraction by examples , 2014, PLDI.

[16]  Alan Akbik,et al.  Propminer: A Workflow for Interactive Information Extraction and Exploration using Dependency Trees , 2013, ACL.

[17]  David Clark Natural Language, Relevancy Ranking, and Common Sense , 1999 .

[18]  Hao Wang,et al.  VINERy: A Visual IDE for Information Extraction , 2015, Proc. VLDB Endow..

[19]  Razvan C. Bunescu,et al.  A Shortest Path Dependency Kernel for Relation Extraction , 2005, HLT.

[20]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[21]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[22]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[23]  Aron Culotta,et al.  Dependency Tree Kernels for Relation Extraction , 2004, ACL.

[24]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[25]  Paul Buitelaar,et al.  RelExt: A Tool for Relation Extraction from Text in Ontology Extension , 2005, SEMWEB.

[26]  Tobias Scheffer,et al.  Learning to identify concise regular expressions that describe email campaigns , 2015, J. Mach. Learn. Res..

[27]  Frederick Reiss,et al.  SystemT: a system for declarative information extraction , 2009, SGMD.

[28]  Fabio Ciravegna,et al.  Adaptive Information Extraction from Text by Rule Induction and Generalisation , 2001, IJCAI.

[29]  Xi Chen,et al.  Statistical Decision Making for Optimal Budget Allocation in Crowd Labeling , 2014, J. Mach. Learn. Res..