Online Topic-Aware Entity Resolution Over Incomplete Data Streams

In many real applications such as the data integration, social network analysis, and the Semantic Web, the entity resolution (ER) is an important and fundamental problem, which identifies and links the same real-world entities from various data sources. While prior works usually consider ER over static and complete data, in practice, application data are usually collected in a streaming fashion, and often incur missing attributes (due to the inaccuracy of data extraction techniques). Therefore, in this paper, we will formulate and tackle a novel problem, topic-aware entity resolution over incomplete data streams (TER-iDS), which online imputes incomplete tuples and detects pairs of topic-related matching entities from incomplete data streams. In order to effectively and efficiently tackle the TER-iDS problem, we propose an effective imputation strategy, carefully design effective pruning strategies, as well as indexes/synopsis, and develop an efficient TER-iDS algorithm via index joins. Extensive experiments have been conducted to evaluate the effectiveness and efficiency of our proposed TER-iDS approach over real data sets.

[1]  Lei Chen,et al.  Reducing Uncertainty of Schema Matching via Crowdsourcing , 2013, Proc. VLDB Endow..

[2]  Carlo Zaniolo,et al.  An Efficient Sliding Window Approach for Approximate Entity Extraction with Synonyms , 2019, EDBT.

[3]  Philip S. Yu,et al.  SCREEN: Stream Data Cleaning under Speed Constraints , 2015, SIGMOD Conference.

[4]  R. Paley,et al.  On some series of functions, (3) , 1930, Mathematical Proceedings of the Cambridge Philosophical Society.

[5]  Aldo Gangemi,et al.  A Comparison of Knowledge Extraction Tools for the Semantic Web , 2013, ESWC.

[6]  Yufei Tao,et al.  Maintaining sliding window skylines on data streams , 2006, IEEE Transactions on Knowledge and Data Engineering.

[7]  Bartunov Sergey,et al.  Joint Link-Attribute User Identity Resolution in Online Social Networks , 2012 .

[8]  Shafiq R. Joty,et al.  Distributed Representations of Tuples for Entity Resolution , 2018, Proc. VLDB Endow..

[9]  Jiuyong Li,et al.  Conditional Differential Dependencies (CDDs) , 2015, ADBIS.

[10]  J. Shane Culpepper,et al.  Monitoring the Top-m Rank Aggregation of Spatial Objects in Streaming Queries , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[11]  ChenLei,et al.  Event detection over twitter social media streams , 2014, VLDB 2014.

[12]  Divesh Srivastava,et al.  Online Entity Resolution Using an Oracle , 2016, Proc. VLDB Endow..

[13]  Felix Naumann,et al.  Data Fusion – Resolving Data Conflicts for Integration , 2009 .

[14]  Ahmed K. Elmagarmid,et al.  Query-time record linkage and fusion over Web databases , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[15]  Mong-Li Lee,et al.  Linking Temporal Records for Profiling Entities , 2015, SIGMOD Conference.

[16]  Detmar W. Straub,et al.  A linguistic signaling model of social support exchange in online health communities , 2020, Decis. Support Syst..

[17]  Divesh Srivastava,et al.  Combining Quantitative and Logical Data Cleaning , 2015, Proc. VLDB Endow..

[18]  Philip S. Yu,et al.  Time Series Data Cleaning: From Anomaly Detection to Anomaly Repairing , 2017, Proc. VLDB Endow..

[19]  Sunil Prabhakar,et al.  ERACER: a database approach for statistical inference and data cleaning , 2010, SIGMOD Conference.

[20]  Jianmin Wang,et al.  Enriching Data Imputation with Extensive Similarity Neighbors , 2015, Proc. VLDB Endow..

[21]  Daniel A. Newman Longitudinal Modeling with Randomly and Systematically Missing Data: A Simulation of Ad Hoc, Maximum Likelihood, and Multiple Imputation Techniques , 2003 .

[22]  Weilong Ren,et al.  Skyline queries over incomplete data streams , 2019, The VLDB Journal.

[23]  Caitlin Lustig,et al.  PatientsLikeMe : Empowerment and Representation in a Patient-Centered Social Network , 2009 .

[24]  John W. Graham,et al.  Missing Data: Analysis and Design , 2012 .

[25]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[26]  Weilong Ren,et al.  Effective and efficient top-k query processing over incomplete data streams , 2021, Inf. Sci..

[27]  Jianzhong Li,et al.  Towards certain fixes with editing rules and master data , 2010, The VLDB Journal.

[28]  Jiawei Han,et al.  A probabilistic model for linking named entities in web text with heterogeneous information networks , 2014, SIGMOD Conference.

[29]  Hong Cheng,et al.  Repairing Vertex Labels under Neighborhood Constraints , 2014, Proc. VLDB Endow..

[30]  Themis Palpanas,et al.  A Survey of Blocking and Filtering Techniques for Entity Resolution , 2019, ArXiv.

[31]  Pedro M. Domingos,et al.  Joint Inference in Information Extraction , 2007, AAAI.

[32]  Wolfgang Nejdl,et al.  Meta-Blocking: Taking Entity Resolutionto the Next Level , 2014, IEEE Transactions on Knowledge and Data Engineering.

[33]  Lei Chen,et al.  Differential dependencies: Reasoning and discovery , 2011, TODS.

[34]  Xiaoqian Jiang,et al.  Lessons Learned for Online Health Community Moderator Roles: A Mixed-Methods Study of Moderators Resigning From WebMD Communities , 2016, Journal of medical Internet research.

[35]  Weilong Ren,et al.  Efficient Join Processing Over Incomplete Data Streams , 2019, CIKM.

[36]  Sharad Mehrotra,et al.  Progressive approximate aggregate queries with a multi-resolution tree structure , 2001, SIGMOD '01.

[37]  Lei Chen,et al.  Event detection over twitter social media streams , 2013, The VLDB Journal.

[38]  Hong Cheng,et al.  Discovering Conditional Matching Rules , 2017, ACM Trans. Knowl. Discov. Data.

[39]  Jianmin Wang,et al.  Sequential Data Cleaning: A Statistical Approach , 2016, SIGMOD Conference.

[40]  Abhinandan Das,et al.  Efficient Approximation of Correlated Sums on Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[41]  Sonia Bergamaschi,et al.  Schema-Agnostic Progressive Entity Resolution , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[42]  Abhinandan Das,et al.  Approximate join processing over data streams , 2003, SIGMOD '03.

[43]  AnHai Doan,et al.  Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services , 2017, SIGMOD Conference.

[44]  Beng Chin Ooi,et al.  Approximate NN queries on Streams with Guaranteed Error/performance Bounds , 2004, VLDB.

[45]  Quan Pan,et al.  Adaptive imputation of missing values for incomplete pattern classification , 2016, Pattern Recognit..

[46]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..