A Method for Identifying Geospatial Data Sharing Websites by Combining Multi-Source Semantic Information and Machine Learning

Geospatial data sharing is an inevitable requirement for scientific and technological innovation and economic and social development decisions in the era of big data. With the development of modern information technology, especially Web 2.0, a large number of geospatial data sharing websites (GDSW) have been developed on the Internet. GDSW is a point of access to geospatial data, which is able to provide a geospatial data inventory. How to precisely identify these data websites is the foundation and prerequisite of sharing and utilizing web geospatial data and is also the main challenge of data sharing at this stage. GDSW identification can be regarded as a binary website classification problem, which can be solved by the current popular machine learning method. However, the websites obtained from the Internet contain a large number of blogs, companies, institutions, etc. If GDSW is directly used as the sample data of machine learning, it will greatly affect the classification precision. For this reason, this paper proposes a method to precisely identify GDSW by combining multi-source semantic information and machine learning. Firstly, based on the keyword set, we used the Baidu search engine to find the websites that may be related to geospatial data in the open web environment. Then, we used the multi-source semantic information of geospatial data content, morphology, sources, and shared websites to filter out a large number of websites that contained geospatial keywords but were not related to geospatial data in the search results through the calculation of comprehensive similarity. Finally, the filtered geospatial data websites were used as the sample data of machine learning, and the GDSWs were identified and evaluated. In this paper, training sets are extracted from the original search data and the data filtered by multi-source semantics, the two datasets are trained by machine learning classification algorithms (KNN, LR, RF, and SVM), and the same test datasets are predicted. The results show that: (1) compared with the four classification algorithms, the classification precision of RF and SVM on the original data is higher than that of the other two algorithms. (2) Taking the data filtered by multi-source semantic information as the sample data for machine learning, the precision of all classification algorithms has been greatly improved. The SVM algorithm has the highest precision among the four classification algorithms. (3) In order to verify the robustness of this method, different initial sample data mentioned above are selected for classification using the same method. The results show that, among the four classification algorithms, the classification precision of SVM is still the highest, which shows that the proposed method is robust and scalable. Therefore, taking the data filtered by multi-source semantic information as the sample data to train through machine learning can effectively improve the classification precision of GDSW, and comparing the four classification algorithms, SVM has the best classification effect. In addition, this method has good robustness, which is of great significance to promote and facilitate the sharing and utilization of open geospatial data.

[1]  Thomas L. Saaty,et al.  How to Make a Decision: The Analytic Hierarchy Process , 1990 .

[2]  Chih-Yuan Huang,et al.  GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources , 2016, ISPRS Int. J. Geo Inf..

[3]  Chih-Jen Lin,et al.  Training v-Support Vector Classifiers: Theory and Algorithms , 2001, Neural Computation.

[4]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[5]  Stefano Nativi,et al.  Big Data challenges in building the Global Earth Observation System of Systems , 2015, Environ. Model. Softw..

[6]  Shlomo Geva,et al.  Adaptive nearest neighbor pattern classification , 1991, IEEE Trans. Neural Networks.

[7]  Paolo Mazzetti,et al.  Current status and future directions of geoportals , 2020, Int. J. Digit. Earth.

[8]  Dieter Fritsch,et al.  Automatic Map Retrieval and Map Interpretation in the Internet , 2013 .

[9]  Wei-Yin Loh,et al.  Fifty Years of Classification and Regression Trees , 2014 .

[10]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[11]  Dimitrios Tzovaras,et al.  Automatic categorization of Web service elements , 2018, Int. J. Web Inf. Syst..

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  Wang Juanle,et al.  A Study of Earth System Science Data Classification for Data Sharing , 2014 .

[14]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[15]  Chaowei Yang,et al.  Utilizing Cloud Computing to address big geospatial data challenges , 2017, Comput. Environ. Urban Syst..

[16]  Dirk Thorleuchter,et al.  Predicting e-commerce company success by mining the text of its publicly-accessible website , 2012, Expert Syst. Appl..

[17]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[18]  Si Wu,et al.  Improving support vector machine classifiers by modifying kernel functions , 1999, Neural Networks.

[19]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Christopher Bone,et al.  A geospatial search engine for discovering multi-format geospatial data across the web , 2016, Int. J. Digit. Earth.

[21]  Ali Mansourian,et al.  Multi-Agent Planning for Automatic Geospatial Web Service Composition in Geoportals , 2018, ISPRS Int. J. Geo Inf..

[22]  A-Xing Zhu,et al.  A similarity-based automatic data recommendation approach for geographic models , 2017, Int. J. Geogr. Inf. Sci..

[23]  W. Li,et al.  Semantic-based web service discovery and chaining for building an Arctic spatial data infrastructure , 2011, Comput. Geosci..

[24]  Neeraj Kumar,et al.  An efficient scheme for automatic web pages categorization using the support vector machine , 2016, New Rev. Hypermedia Multim..

[25]  Leonid Stoimenov,et al.  Methodology for geospatial data source discovery in ontology-driven geo-information integration architectures , 2015, J. Web Semant..

[26]  Liping Di,et al.  Integrating semantic web technologies and geospatial catalog services for geospatial information discovery and processing in cyberinfrastructure , 2011, GeoInformatica.

[27]  Yan Zhang,et al.  Feature Extraction with TF-IDF and Game-Theoretic Shadowed Sets , 2020, IPMU.

[28]  Thomas S. Huang,et al.  A Smart Web-Based Geospatial Data Discovery System with Oceanographic Data as an Example , 2018, ISPRS Int. J. Geo Inf..

[29]  Robert G. Raskin,et al.  Knowledge representation in the semantic web for Earth and environmental terminology (SWEET) , 2005, Comput. Geosci..

[30]  Chongjun Yang,et al.  An active crawler for discovering geospatial Web services and their distribution pattern – A case study of OGC Web Map Service , 2010, Int. J. Geogr. Inf. Sci..

[31]  Alex Singleton,et al.  Web mapping 2.0: The neogeography of the GeoWeb , 2008 .

[32]  Michael F. Goodchild,et al.  Towards geospatial semantic search: exploiting latent semantic relations in geospatial data , 2014, Int. J. Digit. Earth.

[33]  Hai-Tao Zheng,et al.  An ontology-based approach to Chinese semantic advertising , 2012, Inf. Sci..

[34]  Renato Bruni,et al.  Identifying e-Commerce in Enterprises by means of Text Mining and Classification Algorithms , 2018 .

[35]  Thomas S. Huang,et al.  A comprehensive methodology for discovering semantic relationships among geospatial vocabularies using oceanographic data discovery as an example , 2017, Int. J. Geogr. Inf. Sci..

[36]  Nikhil Ketkar,et al.  Convolutional Neural Networks , 2021, Deep Learning with Python.

[37]  Renato Bruni,et al.  Website categorization: A formal approach and robustness analysis in the case of e-commerce detection , 2020, Expert Syst. Appl..

[38]  Myra Bambacus,et al.  The Emerging Concepts and Applications of the Spatial Web Portal , 2007 .

[39]  Mohammad H. Vahidnia,et al.  Open Community-Based Crowdsourcing Geoportal for Earth Observation Products: A Model Design and Prototype Implementation , 2021, ISPRS Int. J. Geo Inf..

[40]  Jean-Michel Poggi,et al.  Random Forests for Big Data , 2015, Big Data Res..

[41]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[42]  Ah-Hwee Tan,et al.  Learning and inferencing in user ontology for personalized Semantic Web search , 2009, Inf. Sci..

[43]  Aytug Onan,et al.  Classifier and feature set ensembles for web page classification , 2016, J. Inf. Sci..