Multi-Source Spatial Entity Linkage

Besides the traditional cartographic data sources, spatial information can also be derived from location-based sources. Location-based sources offer rich spatial information describing the semantics of locations. However, even though different location-based sources refer to the same physical world, each one has only partial coverage of the spatial entities of interest, describe them with different attributes, and sometimes provide contradicting information. Hence, the problem of finding which pairs of spatial entities belong to the same physical spatial entity demands specific attention. We propose a solution (QuadSky) to the problem of spatial entity linkage across diverse location-based sources. QuadSky starts with a spatial blocking technique (QuadFlex) that inherits the concept and the complexity from the quadtree algorithm but improves the splitting technique not to separate nearby points. After comparing the spatial entities of the same block, we propose a novel algorithm, referred to as SkyEx that separates the pairs considered as a match (positive class) from the rest (negative class) by using Pareto optimality. SkyEx does not require weights on the attributes, scoring function, or a training set. QuadSky achieves 0.85 precision and 0.85 recall for a manually labeled dataset of 1,500 pairs and 0.87 precision and 0.6 recall for a semi-manually labeled dataset of 777,452 pairs. Moreover, QuadSky provides the best trade-off between precision and recall and consequently, the best F-measure compared to the existing baselines.

[1]  Peter Christen,et al.  Febrl - A Parallel Open Source Data Linkage System: http://datamining.anu.edu.au/linkage.html , 2004, PAKDD.

[2]  Norman W. Paton,et al.  Pay-as-you-go Configuration of Entity Resolution , 2016, Trans. Large Scale Data Knowl. Centered Syst..

[3]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[4]  Wahidah Husain,et al.  A Survey on Data Integration in Bioinformatics , 2011 .

[5]  David Guy Brizan,et al.  A. Survey of Entity Resolution and Record Linkage Methodologies , 2015, Communications of the IIMA.

[6]  David A. Cieslak,et al.  Learning Decision Trees for Unbalanced Data , 2008, ECML/PKDD.

[7]  Divesh Srivastava,et al.  Big data integration , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[8]  Sébastien Mustière,et al.  Data Matching - a Matter of Belief , 2008, SDH.

[9]  Avigdor Gal,et al.  Comparative Analysis of Approximate Blocking Techniques for Entity Resolution , 2016, Proc. VLDB Endow..

[10]  Udo W. Lipeck,et al.  SimMatching: adaptable road network matching for efficient and scalable spatial data integration , 2014, SIGSPATIAL PhD '14.

[11]  Hamideh Afsarmanesh,et al.  Entity resolution for probabilistic data , 2014, Inf. Sci..

[12]  Jianping Zhang,et al.  Learning rules from highly unbalanced data sets , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[13]  Fabien Duchateau,et al.  Representing Uncertainty in Visual Integration , 2014, DMS.

[14]  Torben Bach Pedersen,et al.  Seed-Driven Geo-Social Data Extraction - Full Version , 2019, ArXiv.

[15]  Alexander Panchenko,et al.  Large-Scale Parallel Matching of Social Network Profiles , 2015, AIST.

[16]  Michael D. Gordon,et al.  Recall-precision trade-off: A derivation , 1989, JASIS.

[17]  Fabien Duchateau,et al.  GeoBench: a geospatial integration tool for building a spatial entity matching benchmark , 2014, SIGSPATIAL/GIS.

[18]  Jon Louis Bentley,et al.  The Complexity of Finding Fixed-Radius Near Neighbors , 1977, Inf. Process. Lett..

[19]  Nacéra Bennacer,et al.  LIAISON: reconciLIAtion of Individuals Profiles Across SOcial Networks , 2015, EGC.

[20]  Y. Censor Pareto optimality in multiobjective problems , 1977 .

[21]  Sören Auer,et al.  LIMES - A Time-Efficient Approach for Large-Scale Link Discovery on the Web of Data , 2011, IJCAI.

[22]  Hanan Samet,et al.  The Quadtree and Related Hierarchical Data Structures , 1984, CSUR.

[23]  Gianluca Quercini,et al.  Profile Reconciliation Through Dynamic Activities Across Social Networks , 2019, CAiSE.

[24]  Heiner Stuckenschmidt,et al.  Results of the Ontology Alignment Evaluation Initiative , 2007 .

[25]  Markus Nentwig,et al.  A survey of current Link Discovery frameworks , 2016, Semantic Web.

[26]  Torben Bach Pedersen,et al.  Seed-Driven Geo-Social Data Extraction , 2019, SSTD.

[27]  Claudia Niederée,et al.  Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data , 2012, WSDM '12.

[28]  Martin Gaedke,et al.  Silk - A Link Discovery Framework for the Web of Data , 2009, LDOW.

[29]  Lise Getoor,et al.  Entity resolution in geospatial data integration , 2006, GIS '06.

[30]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[31]  Rifaat Abdalla,et al.  Geospatial Data Integration , 2016 .

[32]  Jianzhong Li,et al.  Rule-Based Method for Entity Resolution , 2015, IEEE Transactions on Knowledge and Data Engineering.

[33]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[34]  Sree Hari Krishnan Parthasarathi,et al.  Exploiting innocuous activity for correlating users across sites , 2013, WWW.

[35]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[36]  Lior Rokach,et al.  Entity Matching in Online Social Networks , 2013, 2013 International Conference on Social Computing.

[37]  Jacynthe Pouliot,et al.  a Webgis to Support Gpr 3d Data Acquisition: a First Step for the Integration of Underground Utility Networks in 3d City Models , 2017 .

[38]  Divesh Srivastava,et al.  Online Entity Resolution Using an Oracle , 2016, Proc. VLDB Endow..

[39]  Volker Walter,et al.  Matching spatial data sets: a statistical approach , 1999, Int. J. Geogr. Inf. Sci..

[40]  Rima Kilany,et al.  Integration of Similar Location Based Services Proposed by Several Providers , 2010, NDT.

[41]  Paul Rayson,et al.  Sampling labelled profile data for identity resolution , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[42]  Michael Healy,et al.  Theory and Applications of Ontology: Computer Applications , 2010 .

[43]  Abdelkader Hameurlain,et al.  Transactions on Large-Scale Data- and Knowledge-Centered Systems XXIX , 2016, Lecture Notes in Computer Science.

[44]  GalAvigdor,et al.  Comparative analysis of approximate blocking techniques for entity resolution , 2016, VLDB 2016.

[45]  Toon Calders,et al.  Multi-Source Entity Resolution for Genealogical Data , 2015, Population Reconstruction.

[46]  Reza Zafarani,et al.  User Identity Linkage across Online Social Networks: A Review , 2017, SKDD.

[47]  Stefano Spaccapietra,et al.  Modelling geographic data with multiple representations , 2004, Int. J. Geogr. Inf. Sci..

[48]  Matteo Magnani,et al.  A Survey on Uncertainty Management in Data Integration , 2010, JDIQ.

[49]  Haym Hirsh,et al.  Learning to Predict Extremely Rare Events , 2000 .