Entity Resolution Approach of Data Stream Management Systems

Owing to the technological advancements in Semantic Web and sensor networks, a large amount of data has been produced in association with the open data policy. However, data stream management systems that process stream data have focused on the processing of a large amount of data with little priority on data identification, integration, and external linkage. Furthermore, entity resolution is focused mainly on static database-based technologies. In this study, a real-time stream data processing architecture that can perform the integration and entity resolution of streaming-type heterogeneous input data and interlink with external data is designed. To achieve this goal, a light adapter to integrate heterogeneous data into standard scheme and blocking technique to reduce comparison candidates are applied. The implemented data adapters shows 4 times higher throughput than open source data parsers and the entity resolution results with streaming data shows similar performance with the static data sets. The proposed streaming data entity resolution architecture is expected to form the basis of data integration research that can integrate various information sources of data efficiently, enrich internal data.

[1]  Heng Tao Shen,et al.  Hashing for Similarity Search: A Survey , 2014, ArXiv.

[2]  James D. Myers,et al.  Semantic Management of Streaming Data , 2009, SSN.

[3]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[4]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[5]  Michael Stonebraker,et al.  Aurora: a new model and architecture for data stream management , 2003, The VLDB Journal.

[6]  Andreas Thor,et al.  Load Balancing for MapReduce-based Entity Resolution , 2011, 2012 IEEE 28th International Conference on Data Engineering.

[7]  Jennifer Widom,et al.  STREAM: The Stanford Data Stream Management System , 2016, Data Stream Management.

[8]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[9]  Peter Christen,et al.  Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach , 2008, AusDM.

[10]  Danh Le Phuoc,et al.  A Native and Adaptive Approach for Unified Processing of Linked Streams and Linked Data , 2011, SEMWEB.

[11]  Alejandro P. Buchmann,et al.  Complex Event Processing , 2009, it Inf. Technol..

[12]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[13]  Óscar Corcho,et al.  Linked Stream Data: A Position Paper , 2009, SSN.

[14]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[15]  Peter Christen,et al.  Adaptive Temporal Entity Resolution on Dynamic Databases , 2013, PAKDD.

[16]  Peter Christen,et al.  Quality and Complexity Measures for Data Linkage and Deduplication , 2007, Quality Measures in Data Mining.

[17]  Jianhua Ma,et al.  U- and E-Service, Science and Technology: International Conference, UNESST 2009, Held as Part of the Future Generation Information Technology Conference, ... in Computer and Information Science) , 2010 .

[18]  Divesh Srivastava,et al.  Linking temporal records , 2011, VLDB 2011.

[19]  Andre Bolles,et al.  Streaming SPARQL - Extending SPARQL to Process Data Streams , 2008, ESWC.

[20]  Huizhi Liang,et al.  Dynamic Similarity-Aware Inverted Indexing for Real-Time Entity Resolution , 2013, PAKDD Workshops.

[21]  Ki-Weon Kang Vibration Fatigue Analysis of Spot Welded Component considering Change of Stiffness due to Fatigue Damage , 2014 .

[22]  Jean-Paul Calbimonte,et al.  Ontology-based access to sensor data streams , 2013 .

[23]  Hugh Glaser,et al.  Managing URI Synonymity to Enable Consistent Reference on the Semantic Web , 2008, IRSW.

[24]  Mark Dredze,et al.  Streaming Cross Document Entity Coreference Resolution , 2010, COLING.

[25]  Jin-Mook Kim,et al.  Modify of extended API for Smart-TV security , 2014 .

[26]  Eyal Oren,et al.  Sindice.com: Weaving the Open Linked Data , 2007, ISWC/ASWC.

[27]  Daniele Braga,et al.  C-SPARQL: SPARQL for continuous querying , 2009, WWW '09.

[28]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[29]  Mounir Ghogho,et al.  Energy Consumption Scheduling in a Smart Grid Including Renewable Energya Smart Grid Including Renewable Energy , 2015, J. Inf. Process. Syst..

[30]  Boris Bellalta,et al.  Public Open Sensor Data: Revolutionizing Smart Cities , 2013, IEEE Technology and Society Magazine.

[31]  Myoun-Jae Lee A Study on Convergence Development Direction of Gesture Recognition Game , 2014 .

[32]  David Guy Brizan,et al.  A. Survey of Entity Resolution and Record Linkage Methodologies , 2015, Communications of the IIMA.

[33]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[34]  Alessandro Campi,et al.  A First Step Towards Stream Reasoning , 2009, FIS.

[35]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .