BigGorilla: An Open-Source Ecosystem for Data Preparation and Integration

We present BIGGORILLA, an open-source resource for data scientists who need data preparation and integration tools, and the vision underlying the project. We then describe four packages that we contributed to BIGGORILLA: KOKO (an information extraction tool), FLEXMATCHER (a schema matching tool), MAGELLAN and DEEPMATCHER (two entity matching tools). We hope that as more software packages are added to BIGGORILLA, it will become a one-stop resource for both researchers and industry practitioners, and will enable our community to advance the state of the art at a faster pace.

[1]  Divesh Srivastava,et al.  Big data integration , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[2]  Frederick Reiss,et al.  Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! , 2013, EMNLP.

[3]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[4]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[5]  Data Matching , 2017, Encyclopedia of Machine Learning and Data Mining.

[6]  Joakim Nivre,et al.  Universal Dependency Annotation for Multilingual Parsing , 2013, ACL.

[7]  Alon Y. Halevy,et al.  Data Integration: After the Teenage Years , 2017, PODS.

[8]  Steven Bird,et al.  Fast Query for Large Treebanks , 2010, HLT-NAACL.

[9]  AnHai Doan,et al.  Human-in-the-Loop Challenges for Entity Matching: A Midterm Report , 2017, HILDA@SIGMOD.

[10]  Shuohang Wang,et al.  A Compare-Aggregate Model for Matching Text Sequences , 2016, ICLR.

[11]  Theodoros Rekatsinas,et al.  Deep Learning for Entity Matching: A Design Space Exploration , 2018, SIGMOD Conference.

[12]  AnHai Doan,et al.  Magellan: Toward Building Entity Matching Management Systems over Data Science Stacks , 2016, Proc. VLDB Endow..

[13]  Renée J. Miller The Future of Data Integration , 2017, KDD.

[14]  Erhard Rahm,et al.  Generic schema matching, ten years later , 2011, Proc. VLDB Endow..

[15]  Susan B. Davidson,et al.  Designing and Evaluating an XPath Dialect for Linguistic Queries , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[16]  AnHai Doan,et al.  MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive , 2017, Bioinform..

[17]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[18]  Peggy L. Peissig,et al.  Entity Matching Using Magellan: Matching Drug Reference Tables , 2017, CRI.

[19]  RahmErhard,et al.  A survey of approaches to automatic schema matching , 2001, VLDB 2001.

[20]  Wang-Chiew Tan Technical Perspective: Toward Building Entity Matching Management Systems , 2018 .

[21]  AnHai Doan,et al.  MatchCatcher: A Debugger for Blocking in Entity Matching , 2018, EDBT.

[22]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[23]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[24]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[25]  Dan Klein,et al.  Capturing Semantic Similarity for Entity Linking with Convolutional Neural Networks , 2016, NAACL.

[26]  Hidekazu Oiwa,et al.  Scalable Semantic Querying of Text , 2018, Proc. VLDB Endow..

[27]  Pedro M. Domingos,et al.  Learning to Match the Schemas of Data Sources: A Multistrategy Approach , 2003, Machine Learning.

[28]  Mihai Surdeanu,et al.  Odin’s Runes: A Rule Language for Information Extraction , 2016, LREC.

[29]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[30]  Peggy L. Peissig,et al.  CloudMatcher : A Cloud / Crowd Service for Entity Matching , 2017 .

[31]  Guoliang Li,et al.  String similarity search and join: a survey , 2016, Frontiers of Computer Science.

[32]  Eric Peukert,et al.  A Self-Configuring Schema Matching System , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[33]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..