Assessing similarity matching for possible integration of feature classifications of geospatial data from official and informal sources

One difficulty in integrating geospatial data sets from different sources is variation in feature classification and semantic content of the data. One step towards achieving beneficial semantic interoperability is to assess the semantic similarity among objects that are categorised within data sets. This article focuses on measuring semantic and structural similarities between categories of formal data, such as Ordnance Survey (OS) cartographic data, and volunteered geographic information (VGI), such as that sourced from OpenStreetMap (OSM), with the intention of assessing possible integration. The model involves ‘tokenisation’ to search for common roots of words, and the feature classifications have been modelled as an XML schema labelled rooted tree for hierarchical analysis. The semantic similarity was measured using the WordNet::Similarity package, while the structural similarities between sub-trees of the source and target schemas have also been considered. Along with dictionary and structural matching, the data type of the category itself is a comparison variable. The overall similarity is based on a weighted combination of these three measures. The results reveal that the use of a generic similarity matching system leads to poor agreement between the semantics of OS and OSM data sets. It is concluded that a more rigorous peer-to-peer assessment of VGI data, increasing numbers and transparency of contributors, the initiation of more programs of quality testing and the development of more directed ontologies can improve spatial data integration.

[1]  Miriam J. Metzger,et al.  The credibility of volunteered geographic information , 2008 .

[2]  Paul A. Longley,et al.  Estimating secondary school catchment areas and the spatial equity of access , 2011, Comput. Environ. Urban Syst..

[3]  Siddharth Patwardhan,et al.  Incorporating Dictionary and Corpus Information into a Context Vector Measure of Semantic Relatednes , 2003 .

[4]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[5]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[6]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[7]  A. Tversky Features of Similarity , 1977 .

[8]  James M. Keller,et al.  Automated Geospatial Conflation of Vector Road Maps to High Resolution Imagery , 2009, IEEE Transactions on Image Processing.

[9]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[10]  Euripides G. M. Petrakis,et al.  Semantic similarity methods in wordNet and their application to information retrieval on the web , 2005, WIDM '05.

[11]  Christian Heipke,et al.  Integration of heterogeneous geospatial data in a federated database , 2007 .

[12]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[13]  Boris Kovalerchuk,et al.  Virtual Experts for Imagery Registration and Conflation , 2003, CISST.

[14]  Jaewook Kim,et al.  A layered approach to semantic similarity analysis of XML schemas , 2008, 2008 IEEE International Conference on Information Reuse and Integration.

[15]  Ogt O'Brien,et al.  O'Brien on Ramm, Topf, Chilton: OpenStreetMap: Using and Enhancing the Free Map of the World , 2011 .

[16]  Angela Schwering,et al.  Approaches to Semantic Similarity Measurement for Geo‐Spatial Data: A Survey , 2008, Trans. GIS.

[17]  Graeme Hirst,et al.  Lexical chains as representations of context for the detection and correction of malapropisms , 1995 .

[18]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[19]  Jean-Léon Masse Calcul par la méthode L. C. M. O. de l’effet d’un ou plusieurs méthyles sur le spectre d’un hydrocarbure conjugué , 1954 .

[20]  Joep Crompvoets,et al.  A characterization of Volunteered Geographic Information , 2010, GIScience 2010.

[21]  Werner Kuhn,et al.  Geospatial Semantics: Why, of What, and How? , 2005, J. Data Semant..

[22]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[23]  Stephan Winter,et al.  Ontology: buzzword or paradigm shift in GI science? , 2001, Int. J. Geogr. Inf. Sci..

[24]  Vyron Antoniou,et al.  How Many Volunteers Does it Take to Map an Area Well? The Validity of Linus’ Law to Volunteered Geographic Information , 2010 .

[25]  John Krumm,et al.  User-Generated Content , 2008, IEEE Pervasive Comput..

[26]  Alan Saalfeld,et al.  Conflation Automated map compilation , 1988, Int. J. Geogr. Inf. Sci..

[27]  Dan J. Smith,et al.  Hierarchical Approach for Datatype Matching in XML Schemas , 2007, 24th British National Conference on Databases (BNCOD'07).

[28]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[29]  M. Haklay How Good is Volunteered Geographical Information? A Comparative Study of OpenStreetMap and Ordnance Survey Datasets , 2010 .

[30]  Kevin McDougall,et al.  Volunteered geographic information for building SDI , 2009 .

[31]  Huynh Quyet Thang,et al.  XML Schema Automatic Matching Solution , 2010 .

[32]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[33]  M. Goodchild Citizens as sensors: the world of volunteered geography , 2007 .

[34]  Anna Formica,et al.  Similarity of XML-Schema Elements: A Structural and Information Content Approach , 2008, Comput. J..

[35]  Graeme Hirst,et al.  Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures , 2004 .

[36]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[37]  Nuwee Wiwatwattana,et al.  SAXM : Semi-automatic XML Schema Mapping , 2009 .

[38]  Kajal T. Claypool,et al.  QMatch - Using paths to match XML schemas , 2007, Data Knowl. Eng..

[39]  David Fairbairn,et al.  Assessing the accuracy of 'crowdsourced' data and its integration with official spatial data sets , 2010 .

[40]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.