Entity resolution in the web of data

This tutorial provides an overview of the key research results in the area of entity resolution that are relevant to addressing the new challenges in entity resolution posed by the Web of data, in which real world entities are described by interlinked data rather than documents. Since such descriptions are usually partial, overlapping and sometimes evolving, entity resolution emerges as a central problem both to increase dataset linking but also to search the Web of data for entities and their relations.

[1]  P. Jaccard Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines , 1901 .

[2]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[3]  A. Tversky Features of Similarity , 1977 .

[4]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[5]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[6]  Simone Santini,et al.  Similarity Measures , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Daphna Weinshall,et al.  Classification with Nonmetric Distances: Image Retrieval and Class Representation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[9]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[10]  George V. Moustakides,et al.  A Bayesian decision model for cost optimal record matching , 2003, The VLDB Journal.

[11]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[12]  Felix Naumann,et al.  Detecting duplicate objects in XML documents , 2004, IQIS '04.

[13]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[14]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[15]  Felix Naumann,et al.  DogmatiX tracks down duplicates in XML , 2005, SIGMOD '05.

[16]  Dmitri V. Kalashnikov,et al.  Domain-independent data cleaning via analysis of entity-relationship graph , 2006, TODS.

[17]  Felix Naumann,et al.  XML Duplicate Detection Using Sorted Neighborhoods , 2006, EDBT.

[18]  Felix Naumann,et al.  Informationsintegration - Architekturen und Methoden zur Integration verteilter und heterogener Datenquellen , 2006 .

[19]  Tomás Skopal,et al.  On Fast Non-metric Similarity Search by Metric Access Methods , 2006, EDBT.

[20]  Felix Naumann,et al.  Detecting Duplicates in Complex XML Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[21]  Amanda Spink,et al.  How are we searching the World Wide Web? A comparison of nine search engine transaction logs , 2006, Inf. Process. Manag..

[22]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[23]  C. Lee Giles,et al.  Adaptive sorted neighborhood methods for efficient record linkage , 2007, JCDL '07.

[24]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[25]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[26]  Mark B. Sandler,et al.  Automatic Interlinking of Music Datasets on the Semantic Web , 2008, LDOW.

[27]  Felix Naumann,et al.  Industry-scale duplicate detection , 2008, Proc. VLDB Endow..

[28]  Hanan Samet,et al.  Metric space similarity joins , 2008, TODS.

[29]  Claudia Niederée,et al.  Probabilistic Entity Linkage for Heterogeneous Information Spaces , 2008, CAiSE.

[30]  Alfio Ferrara,et al.  Towards a Benchmark for Instance Matching , 2008, OM.

[31]  Chang-Tien Lu,et al.  Nearest Neighbor Query , 2017, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[32]  Molly McClellan Duplicate Medical Records: A Survey of Twin Cities Healthcare Organizations , 2009, AMIA.

[33]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[34]  Martin Gaedke,et al.  Discovering and Maintaining Links on the Web of Data , 2009, SEMWEB.

[35]  Yi Li,et al.  RiMOM: A Dynamic Multistrategy Ontology Alignment Framework , 2009, IEEE Transactions on Knowledge and Data Engineering.

[36]  Martin Gaedke,et al.  Silk - A Link Discovery Framework for the Web of Data , 2009, LDOW.

[37]  Dongwon Lee,et al.  HARRA: fast iterative hashed record linkage for large-scale data collections , 2010, EDBT '10.

[38]  Pável Calado,et al.  An Overview of XML Duplicate Detection Algorithms , 2010, Soft Computing in XML Data Management.

[39]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[40]  Axel Polleres,et al.  Some entities are more equal than others: statistical methods to consolidate Linked Data , 2010 .

[41]  Peter Fankhauser,et al.  The missing links: discovering hidden same-as links among a billion of triples , 2010, iiWAS.

[42]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[43]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[44]  Peter Fankhauser,et al.  From Web Data to Entities and Back , 2010, CAiSE.

[45]  Shuicheng Yan,et al.  Non-Metric Locality-Sensitive Hashing , 2010, AAAI.

[46]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[47]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[48]  Claudia Niederée,et al.  Eliminating the redundancy in blocking-based entity resolution methods , 2011, JCDL '11.

[49]  Andreas Thor,et al.  Multi-pass sorted neighborhood blocking with MapReduce , 2012, Computer Science - Research and Development.

[50]  Alexei A. Efros,et al.  Data-driven visual similarity for cross-domain image matching , 2011, ACM Trans. Graph..

[51]  Peter Fankhauser,et al.  Efficient entity resolution for large heterogeneous information spaces , 2011, WSDM '11.

[52]  Jeff Heflin,et al.  Automatically Generating Data Linkages Using a Domain-Independent Candidate Selection Approach , 2011, SEMWEB.

[53]  Hector Garcia-Molina,et al.  Managing Information Leakage , 2011, CIDR.

[54]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[55]  Nilesh N. Dalvi,et al.  Large-Scale Collective Entity Matching , 2011, Proc. VLDB Endow..

[56]  Guido Moerkotte,et al.  Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[57]  Gerhard Weikum,et al.  Scalable knowledge harvesting with high precision and high recall , 2011, WSDM '11.

[58]  Sören Auer,et al.  LIMES - A Time-Efficient Approach for Large-Scale Link Discovery on the Web of Data , 2011, IJCAI.

[59]  Robert Isele,et al.  Efficient Multidimensional Blocking for Link Discovery without losing Recall , 2011, WebDB.

[60]  Octavian Udrea,et al.  Apples and oranges: a comparison of RDF benchmarks and real RDF datasets , 2011, SIGMOD '11.

[61]  Craig MacDonald,et al.  MapReduce indexing strategies: Studying scalability and efficiency , 2012, Inf. Process. Manag..

[62]  Jürgen Umbrich,et al.  An empirical survey of Linked Data conformance , 2012, J. Web Semant..

[63]  Andrew Borthwick,et al.  Dynamic Record Blocking: Efficient Linking of Massive Databases in MapReduce , 2012 .

[64]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[65]  Gerhard Weikum,et al.  KORE: keyphrase overlap relatedness for entity disambiguation , 2012, CIKM.

[66]  Andreas Thor,et al.  Dedoop: Efficient Deduplication with Hadoop , 2012, Proc. VLDB Endow..

[67]  Gerhard Weikum,et al.  LINDA: distributed web-of-data-scale entity matching , 2012, CIKM.

[68]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[69]  Yan Dong,et al.  A Similarity-Oriented RDF Graph Matching Algorithm for Ranking Linked Data , 2012, 2012 IEEE 12th International Conference on Computer and Information Technology.

[70]  Ashwin Machanavajjhala,et al.  Entity Resolution: Theory, Practice & Open Challenges , 2012, Proc. VLDB Endow..

[71]  Claudia Niederée,et al.  Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data , 2012, WSDM '12.

[72]  Robert Isele,et al.  Learning Expressive Linkage Rules using Genetic Programming , 2012, Proc. VLDB Endow..

[73]  Michael Gamon,et al.  Active objects: actions for entity-centric search , 2012, WWW.

[74]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[75]  Andreas Thor,et al.  Load Balancing for MapReduce-based Entity Resolution , 2011, 2012 IEEE 28th International Conference on Data Engineering.

[76]  Christos Faloutsos,et al.  V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors , 2012, Proc. VLDB Endow..

[77]  Christopher Ré,et al.  Elementary: Large-Scale Knowledge-Base Construction via Machine Learning and Statistical Inference , 2012, Int. J. Semantic Web Inf. Syst..

[78]  Aamod Sane,et al.  Fast and accurate incremental entity resolution relative to an entity knowledge base , 2012, CIKM '12.

[79]  Daniel P. Miranker,et al.  An Unsupervised Algorithm for Learning Blocking Schemes , 2013, 2013 IEEE 13th International Conference on Data Mining.

[80]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[81]  Gerhard Weikum,et al.  Knowledge harvesting in the big-data era , 2013, SIGMOD '13.

[82]  Karl Aberer,et al.  TRank: Ranking Entity Types Using the Web of Data , 2013, International Semantic Web Conference.

[83]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[84]  Robert Isele,et al.  Active learning of expressive linkage rules using genetic programming , 2013, J. Web Semant..

[85]  Stefan Decker,et al.  Linked cancer genome atlas database , 2013, I-SEMANTICS '13.

[86]  Hector Garcia-Molina,et al.  Incremental entity resolution on rules and data , 2014, The VLDB Journal.

[87]  Simone Paolo Ponzetto,et al.  Collaboratively built semi-structured content and Artificial Intelligence: The story so far , 2013, Artif. Intell..

[88]  Hector Garcia-Molina,et al.  Question Selection for Crowd Entity Resolution , 2013, Proc. VLDB Endow..

[89]  Gjergji Kasneci,et al.  SIGMa: simple greedy matching for aligning large knowledge bases , 2012, KDD.

[90]  Jérôme Euzenat,et al.  Ontology Matching: State of the Art and Future Challenges , 2013, IEEE Transactions on Knowledge and Data Engineering.

[91]  Nectarios Koziris,et al.  ~okeanos: Building a Cloud, Cluster by Cluster , 2013, IEEE Internet Computing.

[92]  Yuan Xue,et al.  Scalable load balancing for mapreduce-based record linkage , 2013, 2013 IEEE 32nd International Performance Computing and Communications Conference (IPCCC).

[93]  Avigdor Gal,et al.  MFIBlocks: An effective blocking algorithm for entity resolution , 2013, Inf. Syst..

[94]  Hector Garcia-Molina,et al.  Pay-As-You-Go Entity Resolution , 2013, IEEE Transactions on Knowledge and Data Engineering.

[95]  Hector Garcia-Molina,et al.  Disinformation techniques for entity resolution , 2013, CIKM.

[96]  Gianluca Demartini,et al.  Large-scale linked data integration using probabilistic reasoning and crowdsourcing , 2013, The VLDB Journal.

[97]  Klaus Berberich,et al.  Mind the gap: large-scale frequent sequence mining , 2013, SIGMOD '13.

[98]  Yannis Tzitzikas,et al.  Scalable entity-based summarization of web search results using MapReduce , 2014, Distributed and Parallel Databases.

[99]  Claudia Niederée,et al.  A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces , 2013, IEEE Transactions on Knowledge and Data Engineering.

[100]  Fabien L. Gandon,et al.  Survey of Linked Data Based Exploration Systems , 2014, IESD@ISWC.

[101]  Eduardo Valle,et al.  Large-Scale Distributed Locality-Sensitive Hashing for General Metric Data , 2014, SISAP.

[102]  Nathalie Pernelle,et al.  Logical Detection of Invalid SameAs Statements in RDF Data , 2014, EKAW.

[103]  Din J. Wasem Mining of Massive Datasets , 2014 .

[104]  Nilesh N. Dalvi,et al.  Crowdsourcing Algorithms for Entity Resolution , 2014, Proc. VLDB Endow..

[105]  Gautam Shroff,et al.  Graph-Parallel Entity Resolution using LSH & IMM , 2014, EDBT/ICDT Workshops.

[106]  C. Bizer,et al.  Integrating product data from websites offering microdata markup , 2014, WWW.

[107]  Heiko Paulheim,et al.  Adoption of the Linked Data Best Practices in Different Topical Domains , 2014, SEMWEB.

[108]  Jiawei Han,et al.  On building entity recommender systems using user click log and freebase knowledge , 2014, WSDM.

[109]  Jens Lehmann,et al.  Test-driven evaluation of linked data quality , 2014, WWW.

[110]  Kuansan Wang,et al.  Entity linking at the tail: sparse signals, unknown entities, and phrase models , 2014, WSDM.

[111]  Wolfgang Nejdl,et al.  Meta-Blocking: Taking Entity Resolutionto the Next Level , 2014, IEEE Transactions on Knowledge and Data Engineering.

[112]  Christian Bizer,et al.  The WebDataCommons Microdata, RDFa and Microformat Dataset Series , 2014, International Semantic Web Conference.

[113]  Wen-Syan Li,et al.  String Similarity Joins: An Experimental Evaluation , 2014, Proc. VLDB Endow..

[114]  Roi Blanco,et al.  From "Selena Gomez" to "Marlon Brando": Understanding Explorative Entity Search , 2015, WWW.

[115]  Vasilis Efthymiou,et al.  Entity Resolution in the Web of Data , 2015, Entity Resolution in the Web of Data.

[116]  Felix Naumann,et al.  Progressive Duplicate Detection , 2015, IEEE Transactions on Knowledge and Data Engineering.

[117]  Lorena Otero-Cerdeira,et al.  Ontology matching: A literature review , 2015, Expert Syst. Appl..

[118]  Heiko Paulheim,et al.  Heuristics for Fixing Common Errors in Deployed schema.org Microdata , 2015, ESWC.

[119]  Hongyuan Zha,et al.  Cross-Modal Similarity Learning via Pairs, Preferences, and Active Supervision , 2015, AAAI.

[120]  George Papastefanatos,et al.  Scaling Entity Resolution to Large, Heterogeneous Data with Enhanced Meta-blocking , 2016, EDBT.

[121]  Sonia Bergamaschi,et al.  BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution , 2016, Proc. VLDB Endow..

[122]  Juan-Zi Li,et al.  RiMOM-IM: A Novel Iterative Framework for Instance Matching , 2016, Journal of Computer Science and Technology.

[123]  George Papastefanatos,et al.  Boosting the Efficiency of Large-Scale Entity Resolution with Enhanced Meta-Blocking , 2016, Big Data Res..

[124]  Yannis Tzitzikas,et al.  Radius-aware approximate blank node matching using signatures , 2016, Knowledge and Information Systems.

[125]  Michael Granitzer,et al.  DoSeR - A Knowledge-Base-Agnostic Framework for Entity Disambiguation Using Semantic Embeddings , 2016, ESWC.