Reconstructing Sessions from Data Discovery and Access Logs to Build a Semantic Knowledge Base for Improving Data Discovery

Big geospatial data are archived and made available through online web discovery and access. However, finding the right data for scientific research and application development is still a challenge. This paper aims to improve the data discovery by mining the user knowledge from log files. Specifically, user web session reconstruction is focused upon in this paper as a critical step for extracting usage patterns. However, reconstructing user sessions from raw web logs has always been difficult, as a session identifier tends to be missing in most data portals. To address this problem, we propose two session identification methods, including time-clustering-based and time-referrer-based methods. We also present the workflow of session reconstruction and discuss the approach of selecting appropriate thresholds for relevant steps in the workflow. The proposed session identification methods and workflow are proven to be able to extract data access patterns for further pattern analyses of user behavior and improvement of data discovery for more relevancy data ranking, suggestion, and navigation.

[1]  Jaideep Srivastava,et al.  Data Preparation for Mining World Wide Web Browsing Patterns , 1999, Knowledge and Information Systems.

[2]  Jiawei Han,et al.  Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[3]  K. Shadan,et al.  Available online: , 2012 .

[4]  Jian Pei,et al.  Mining Access Patterns Efficiently from Web Logs , 2000, PAKDD.

[5]  D. Hunter,et al.  mixtools: An R Package for Analyzing Mixture Models , 2009 .

[6]  Aaron Halfaker,et al.  User Session Identification Based on Strong Regularities in Inter-activity Time , 2014, WWW.

[7]  Brigitte Trousse,et al.  Advanced data preprocessing for intersites Web usage mining , 2004, IEEE Intelligent Systems.

[8]  Carl D. Meyer,et al.  Google's PageRank and Beyond , 2007 .

[9]  Jing Li,et al.  A performance, semantic and service quality-enhanced distributed search engine for improving geospatial resource discovery , 2013, Int. J. Geogr. Inf. Sci..

[10]  Neha Sharma,et al.  Web Usage Mining:A Novel Approach for Web User Session Construction , 2015 .

[11]  Enrico Motta,et al.  SemSearch: A Search Engine for the Semantic Web , 2006, EKAW.

[12]  Sebastián Ventura,et al.  Data mining in course management systems: Moodle case study and tutorial , 2008, Comput. Educ..

[13]  Myra Spiliopoulou,et al.  A Framework for the Evaluation of Session Reconstruction Heuristics in Web-Usage Analysis , 2003, INFORMS J. Comput..

[14]  Rosie Jones,et al.  Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs , 2008, CIKM '08.

[15]  Sebastián Ventura,et al.  Web usage mining for predicting final marks of students that use Moodle courses , 2013, Comput. Appl. Eng. Educ..

[16]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[17]  Ali A. Ghorbani,et al.  The reconstruction of user sessions from a server log using improved time-oriented heuristics , 2004, Proceedings. Second Annual Conference on Communication Networks and Services Research, 2004..

[18]  Swapna S. Gokhale,et al.  Web robot detection techniques: overview and limitations , 2010, Data Mining and Knowledge Discovery.

[19]  Chaowei Phil Yang,et al.  Redefining the possibility of digital Earth and geosciences with spatial cloud computing , 2013, Int. J. Digit. Earth.

[20]  Doru Tanasa,et al.  Web Usage Mining: Contributions to Intersites Logs Preprocessing and Sequential Pattern Extraction with Low Support , 2005 .

[21]  David R. Hunter,et al.  mixtools: An R Package for Analyzing Mixture Models , 2009 .

[22]  Amy Nicole Langville,et al.  Google's PageRank and beyond - the science of search engine rankings , 2006 .

[23]  Myra Spiliopoulou,et al.  The Impact of Site Structure and User Environment on Session Reconstruction in Web Usage Analysis , 2002, WEBKDD.

[24]  Bin Zhou,et al.  Distributed geospatial information processing: sharing distributed geospatial resources to support Digital Earth , 2008, Int. J. Digit. Earth.

[25]  Pablo Fernández,et al.  Google’s pagerank and beyond: The science of search engine rankings , 2008 .

[26]  Ranga Raju Vatsavai,et al.  Spatiotemporal data mining in the era of big spatial data: algorithms and applications , 2012, BigSpatial '12.

[27]  David E. Goldschmidt,et al.  Architecting a Search Engine for the Semantic Web , 2005 .

[28]  G. Jenks The Data Model Concept in Statistical Mapping , 1967 .

[29]  James Hendler,et al.  Google’s PageRank and Beyond: The Science of Search Engine Rankings , 2007 .

[30]  Kai Liu,et al.  Using Semantic Search and Knowledge Reasoning to Improve the Discovery of Earth Science Records: An Example with the ESIP Semantic Testbed , 2014, Int. J. Appl. Geospat. Res..

[31]  Zhenlong Li,et al.  Contemporary Computing Technologies for Processing Big Spatiotemporal Data , 2015 .

[32]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.