Keyword Search with Real-time Entity Resolution in Relational Databases

Traditional methods of IR-style keyword search/query in relational databases are based on clean data without entity resolution (ER), and as a result, their answers to a query may contain duplicates for dirty datasets with duplicate tuples that have different identifiers and refer to the same real-world entity. In this paper, we propose a method for processing top-N keyword queries with real-time ER. This method creates an index to obtain candidate tuples for a keyword query, defines a function to compute the similarities between the query and its candidate tuples, and designs a clustering algorithm with the Divide and Conquer mechanism to deduplicate the query results. Extensive experiments are conducted to confirm the effectiveness and efficiency of the method for both dirty and (almost) clean datasets.

[1]  Surajit Chaudhuri,et al.  DBXplorer: a system for keyword-based search over relational databases , 2002, Proceedings 18th International Conference on Data Engineering.

[2]  Clement T. Yu,et al.  Effective keyword search in relational databases , 2006, SIGMOD Conference.

[3]  Surajit Chaudhuri,et al.  Keyword querying and Ranking in Databases , 2009, Proc. VLDB Endow..

[4]  Hotham Altwaijry,et al.  QuERy: A Framework for Integrating Entity Resolution with Query Processing , 2015, Proc. VLDB Endow..

[5]  Mayank Kejriwal,et al.  Entity Resolution in a Big Data Framework , 2015, AAAI.

[6]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[7]  Ashwin Machanavajjhala,et al.  Network sampling , 2013, KDD.

[8]  Chun-Nian Liu,et al.  Keyword search based on knowledge base in relational databases , 2009, 2009 International Conference on Machine Learning and Cybernetics.

[9]  Lise Getoor,et al.  Query-time entity resolution , 2006, KDD '06.

[10]  Xiaohui Yu,et al.  CI-Rank: Collective importance ranking for keyword search in databases , 2017, Inf. Sci..

[11]  Claudia Niederée,et al.  On-the-fly entity-aware query processing in the presence of linkage , 2010, Proc. VLDB Endow..

[12]  Peter J. Haas,et al.  Resolution-Aware Query Answering for Business Intelligence , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[13]  Ronald Fagin,et al.  A Declarative Framework for Linking Entities , 2016, ACM Trans. Database Syst..

[14]  Peter Christen,et al.  Forest-Based Dynamic Sorted Neighborhood Indexing for Real-Time Entity Resolution , 2014, CIKM.

[15]  Jeffrey Xu Yu,et al.  Keyword Search in Relational Databases: A Survey , 2010, IEEE Data Eng. Bull..