Boosting approximate dictionary-based entity extraction with synonyms

Abstract Dictionary-based entity extraction is an important task in many data analysis applications, such as academic search, document classification, and code auto-debugging. To improve the effectiveness of extraction, many previous studies focused on the problem of approximate dictionary-based entity extraction, which aims at finding all substrings in documents that are similar to pre-defined entities in the reference entity dictionary. However, these studies only consider syntactical similarity metrics, such as Jaccard and edit distance. In real-world scenarios, there are many cases where syntactically different strings can express the same meaning. Existing approximate entity extraction work fails to identify such kind of semantic similarity and will definitely suffer from low recall. In this paper, we come up with the new problem of an approximate dictionary-based entity extraction with synonyms and propose an end-to-end framework Aeetes to solve it. We propose a new similarity measure Asymmetric Rule-based Jaccard ( JaccAR ) to combine the synonym rules with syntactic similarity metrics and capture the semantic similarity expressed in the synonyms. We devise and implement a filter-and-verification based strategy to improve the overall efficiency. To this end, we propose several pruning techniques to reduce the filter cost and develop novel strategies to improve verification performance. Experimental results on three real-world datasets demonstrate the superior effectiveness and efficiency of Aeetes .

[1]  Guoliang Li,et al.  An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[2]  Yong Zhang,et al.  A Hierarchical Framework for Top-k Location-Aware Error-Tolerant Keyword Search , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[3]  Guoliang Li,et al.  An Efficient Partition Based Method for Exact Set Similarity Joins , 2015, Proc. VLDB Endow..

[4]  Ying Zhang,et al.  An Efficient Framework for Exact Set Similarity Search Using Tree Structure Indexes , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[5]  Jin Wang,et al.  Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[6]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[7]  William W. Cohen,et al.  Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods , 2004, KDD.

[8]  Rishabh Singh,et al.  BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations , 2016, Proc. VLDB Endow..

[9]  Sumit Gulwani,et al.  Learning Semantic String Transformations from Examples , 2012, Proc. VLDB Endow..

[10]  Xin Luna Dong,et al.  CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web , 2018, Proc. VLDB Endow..

[11]  Jiaheng Lu,et al.  Boosting the Quality of Approximate String Matching by Synonyms , 2015, TODS.

[12]  Jiaheng Lu,et al.  Tutorial Proposal : Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join , 2019 .

[13]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near duplicate detection , 2008, WWW.

[14]  Surajit Chaudhuri,et al.  Learning String Transformations From Examples , 2009, Proc. VLDB Endow..

[15]  Surajit Chaudhuri,et al.  Mining Document Collections to Facilitate Accurate Approximate Entity Matching , 2009, Proc. VLDB Endow..

[16]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[17]  Yeye He,et al.  Automatic Discovery of Attribute Synonyms Using Query Logs and Table Corpora , 2016, WWW.

[18]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[19]  Carlo Zaniolo,et al.  An Efficient Sliding Window Approach for Approximate Entity Extraction with Synonyms , 2019, EDBT.

[20]  Surajit Chaudhuri,et al.  Transformation-based Framework for Record Matching , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[21]  Nikolaus Augsten,et al.  An Empirical Evaluation of Set Similarity Join Techniques , 2016, Proc. VLDB Endow..

[22]  Jiaheng Lu,et al.  String similarity measures and joins with synonyms , 2013, SIGMOD '13.

[23]  Yoshiharu Ishikawa,et al.  Local Similarity Search for Unstructured Text , 2016, SIGMOD Conference.

[24]  Carlo Zaniolo,et al.  MF-Join: Efficient Fuzzy String Similarity Join with Multi-level Filtering , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[25]  Jiawei Han,et al.  Automatic Synonym Discovery with Knowledge Bases , 2017, KDD.

[26]  Jin Wang,et al.  A Transformation-Based Framework for KNN Set Similarity Search , 2020, IEEE Transactions on Knowledge and Data Engineering.

[27]  Guoliang Li,et al.  PASS-JOIN: A Partition-based Method for Similarity Joins , 2011, Proc. VLDB Endow..

[28]  Paolo Papotti,et al.  Synthesizing Entity Matching Rules by Examples , 2017, Proc. VLDB Endow..

[29]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[30]  Surajit Chaudhuri,et al.  An efficient filter for approximate membership checking , 2008, SIGMOD Conference.

[31]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[32]  Guoliang Li,et al.  Fast-join: An efficient method for fuzzy token matching based string similarity join , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[33]  Dmitry Zelenko,et al.  Kernel methods for relation extraction , 2003 .

[34]  Chengqi Zhang,et al.  Efficient approximate entity extraction with edit distance constraints , 2009, SIGMOD Conference.

[35]  Surajit Chaudhuri,et al.  A framework for robust discovery of entity synonyms , 2012, KDD.

[36]  Xiaoyong Du,et al.  Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[37]  Michael Stonebraker,et al.  DataXFormer: A robust transformation discovery system , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[38]  Michael Stonebraker,et al.  Approximate String Joins with Abbreviations , 2017, Proc. VLDB Endow..

[39]  Guoliang Li,et al.  A unified framework for approximate dictionary-based entity extraction , 2014, The VLDB Journal.

[40]  Mitsuru Ishizuka,et al.  Relation Extraction from Wikipedia Using Subtree Mining , 2007, AAAI.

[41]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[42]  Surajit Chaudhuri,et al.  Scalable ad-hoc entity extraction from text collections , 2008, Proc. VLDB Endow..

[43]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..