Characterising dataset search - An analysis of search logs and data requests

Large amounts of data are becoming increasingly available online. In order to benefit from it we need tools to retrieve the most relevant datasets that match ones data needs. Several vocabularies have been developed to describe datasets in order to increase their discoverability, but for data publishers is costly to cumbersome to annotate them using all, leading to the question of what properties are more important. In this work we contribute with a systematic study of the patterns and specific attributes that data consumers use to search for data and how it compares with general web search. We performed a query log analysis based on logs from four national open data portals and conducted a qualitative analysis of user data requests for requests issued to one of them. Search queries issued on data portals differ from those issued to web search engines in their length, topic, and structure. Based on our findings we hypothesise that portals search functionalities are currently used in an exploratory manner, rather than to retrieve a specific resource. In our study of data requests we found that geospatial and temporal attributes, as well as information on the required granularity of the data are the most common features. The findings of both analyses suggest that these features are of higher importance in dataset retrieval in contrast to general web search, suggesting that efforts of dataset publishers should focus on generating dataset descriptions including them.

[1]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[2]  Flavius Frasincar,et al.  Faceted product search powered by the Semantic Web , 2012, Decis. Support Syst..

[3]  Uzay Kaymak,et al.  Facet selection algorithms for web product search , 2013, CIKM.

[4]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[5]  Brigitte Mathiak,et al.  Are There Any Differences in Data Set Retrieval Compared to Well-Known Literature Retrieval? , 2015, TPDL.

[6]  Elena Paslaru Bontas Simperl,et al.  A Query Log Analysis of Dataset Search , 2017, ICWE.

[7]  A. Bryman Integrating quantitative and qualitative research: how is it done? , 2006 .

[8]  Maarten de Rijke,et al.  People searching for people: analysis of a people search engine log , 2011, SIGIR '11.

[9]  Richard Y. Wang,et al.  Data quality assessment , 2002, CACM.

[10]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[11]  Dominique Ritze,et al.  A Large Public Corpus of Web Tables containing Time and Context Metadata , 2016, WWW.

[12]  Jayant Madhavan,et al.  Structured Data on the Web , 2009, 2010 12th International Asia-Pacific Web Conference.

[13]  Philip S. Yu,et al.  Adding the temporal dimension to search - a case study in publication search , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[14]  Navjot Kaur,et al.  Query based approach for referrer field analysis of log data using web mining techniques for ontology improvement , 2018 .

[15]  Xijin Tang,et al.  TFIDF, LSI and multi-word in information retrieval and text categorization , 2008, 2008 IEEE International Conference on Systems, Man and Cybernetics.

[16]  Ido Guy,et al.  Best faces forward: a large-scale study of people search in the enterprise , 2012, CHI.

[17]  Ophir Frieder,et al.  Hourly analysis of a very large topically categorized web query log , 2004, SIGIR '04.

[18]  Elena Paslaru Bontas Simperl,et al.  The Trials and Tribulations of Working with Structured Data: -a Study on Information Seeking Behaviour , 2017, CHI.

[19]  Amanda Spink,et al.  U.S. versus European web searching trends , 2002, SIGF.

[20]  Susan T. Dumais,et al.  Learning user interaction models for predicting web search result preferences , 2006, SIGIR.

[21]  Amanda Spink,et al.  An analysis of Web searching by European AlltheWeb.com users , 2005, Inf. Process. Manag..

[22]  Pushpraj Shukla,et al.  Early identification of adverse drug reactions from search log data , 2016, J. Biomed. Informatics.

[23]  Axel-Cyrille Ngonga Ngomo,et al.  TAIPAN: Automatic Property Mapping for Tabular Data , 2016, EKAW.

[24]  Udo Kruschwitz,et al.  Automatically structuring domain knowledge from text: An overview of current research , 2012, Inf. Process. Manag..

[25]  Christian S. Jensen,et al.  Google fusion tables: web-centered data management and collaboration , 2010, SIGMOD Conference.

[26]  Ahmed Patel,et al.  An analysis of web proxy logs with query distribution pattern approach for search engines , 2012, Comput. Stand. Interfaces.

[27]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[28]  Jürgen Umbrich,et al.  Characteristics of Open Data CSV Files , 2016, 2016 2nd International Conference on Open and Big Data (OBD).

[29]  Susan T. Dumais,et al.  Large-Scale Analysis of Email Search and Organizational Strategies , 2017, CHIIR.

[30]  Oren Etzioni,et al.  Scaling question answering to the Web , 2001, WWW '01.

[31]  Han Yi,et al.  Trust and e-commerce: a study of consumer perceptions , 2003, Electron. Commer. Res. Appl..

[32]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[33]  David R. Thomas,et al.  A General Inductive Approach for Analyzing Qualitative Evaluation Data , 2006 .

[34]  J. Knottnerus,et al.  Real world research. , 2010, Journal of clinical epidemiology.

[35]  Richard Y. Wang,et al.  Data Quality Assessment , 2002 .

[36]  Amanda Spink,et al.  How are we searching the World Wide Web? A comparison of nine search engine transaction logs , 2006, Inf. Process. Manag..

[37]  Diane Kelly,et al.  Methods for Evaluating Interactive Information Retrieval Systems with Users , 2009, Found. Trends Inf. Retr..

[38]  Philip S. Yu,et al.  Time Sensitive Ranking with Application to Publication Search , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[39]  W. Bruce Croft,et al.  Analysis of long queries in a large scale search log , 2009, WSCD '09.

[40]  David P. Anderson Preserving hybrid objects , 2016, Commun. ACM.

[41]  Jian Pei,et al.  Mining search and browse logs for web search , 2013, ACM Trans. Intell. Syst. Technol..

[42]  Jürgen Umbrich,et al.  Lifting Data Portals to the Web of Data , 2017, LDOW@WWW.

[43]  Sören Auer,et al.  Dataset Retrieval , 2013, 2013 IEEE Seventh International Conference on Semantic Computing.

[44]  Bamshad Mobasher,et al.  Web search personalization with ontological user profiles , 2007, CIKM '07.

[45]  Susan T. Dumais,et al.  Characterizing Email Search using Large-scale Behavioral Logs and Surveys , 2017, WWW.

[46]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[47]  Jürgen Umbrich,et al.  Searching and browsing Linked Data with SWSE: The Semantic Web Search Engine , 2011, J. Web Semant..

[48]  Cristina Ribeiro,et al.  Use of Temporal Expressions in Web Search , 2008, ECIR.

[49]  Torsten Suel,et al.  Analysis of geographic queries in a search engine log , 2008, LocWeb.