Why Big Data Industrial Systems Need Rules and What We Can Do About It

Big Data industrial systems that address problems such as classification, information extraction, and entity matching very commonly use hand-crafted rules. Today, however, little is understood about the usage of such rules. In this paper we explore this issue. We discuss how these systems differ from those considered in academia. We describe default solutions, their limitations, and reasons for using rules. We show examples of extensive rule usage in industry. Contrary to popular perceptions, we show that there is a rich set of research challenges in rule generation, evaluation, execution, optimization, and maintenance. We discuss ongoing work at WalmartLabs and UW-Madison that illustrate these challenges. Our main conclusions are (1) using rules (together with techniques such as learning and crowdsourcing) is fundamental to building semantics-intensive Big Data systems, and (2) it is increasingly critical to address rule management, given the tens of thousands of rules industrial systems often manage today in an ad-hoc fashion.

[1]  Elena Baralis,et al.  A lazy approach to pruning classification rules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[2]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[3]  Mitul Tiwari,et al.  Entity Extraction, Linking, Classification, and Tagging for Social Media: A Wikipedia-Based Approach , 2013, Proc. VLDB Endow..

[4]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[5]  AnHai Doan,et al.  Chimera: Large-Scale Classification using Machine Learning, Rules, and Crowdsourcing , 2014, Proc. VLDB Endow..

[6]  AnHai Doan,et al.  Social Media Analytics: The Kosmix Story , 2013, IEEE Data Eng. Bull..

[7]  Anand Rajaraman,et al.  Building, maintaining, and using knowledge bases: a report from the trenches , 2013, SIGMOD '13.

[8]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[9]  Rajasekar Krishnamurthy,et al.  HIL: a high-level scripting language for entity integration , 2013, EDBT '13.

[10]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[11]  Jinyan Li,et al.  CAEP: Classification by Aggregating Emerging Patterns , 1999, Discovery Science.

[12]  Frederick Reiss,et al.  Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! , 2013, EMNLP.

[13]  Yiming Yang,et al.  Support vector machines classification with a very large-scale taxonomy , 2005, SKDD.

[14]  Zachary G. Ives,et al.  The Future of Data Integration , 2012 .

[15]  Qiang Yang,et al.  Deep classification in large-scale text hierarchies , 2008, SIGIR '08.

[16]  Fabio Ciravegna,et al.  Adaptive Information Extraction from Text by Rule Induction and Generalisation , 2001, IJCAI.

[17]  Sriram Raghavan,et al.  Regular Expression Learning for Information Extraction , 2008, EMNLP.

[18]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[19]  Pedro M. Domingos The RISE system: conquering without separating , 1994, Proceedings Sixth International Conference on Tools with Artificial Intelligence. TAI 94.

[20]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[21]  Heikki Mannila,et al.  Pruning and grouping of discovered association rules , 1995 .

[22]  Sholom M. Weiss,et al.  Predictive data mining - a practical guide , 1997 .

[23]  Ron Bekkerman,et al.  High-precision phrase-based document classification on a modern scale , 2011, KDD.

[24]  Jiawei Han,et al.  CPAR: Classification based on Predictive Association Rules , 2003, SDM.

[25]  Klaus Berberich,et al.  Mind the gap: large-scale frequent sequence mining , 2013, SIGMOD '13.

[26]  François Denis,et al.  Learning Regular Languages from Simple Positive Examples , 2001, Machine Learning.

[27]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[28]  Ashish Verma,et al.  Building re-usable dictionary repositories for real-world text mining , 2010, CIKM '10.

[29]  Hector Garcia-Molina,et al.  Entity resolution with evolving rules , 2010, Proc. VLDB Endow..

[30]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[31]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[32]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[33]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[34]  Dekang Lin Automatic Retrieval and Clustering of Similar Words , 2022, COLING.

[35]  ChenZheng,et al.  Support vector machines classification with a very large-scale taxonomy , 2005 .

[36]  Jeffrey F. Naughton,et al.  Corleone: hands-off crowdsourcing for entity matching , 2014, SIGMOD Conference.

[37]  Peter M. Schwarz,et al.  Finding Representative Association Rules from Large Rule Collections , 2009, SDM.

[38]  Henning Fernau,et al.  Algorithms for Learning Regular Expressions , 2005, ALT.

[39]  Paul R. Cohen,et al.  Learning Regular Languages from Positive Evidence , 1998 .

[40]  Dan Shen,et al.  Large-scale item categorization for e-commerce , 2012, CIKM.