Mining User Queries with Information Extraction Methods and Linked Data

Purpose: Advanced usage of Web Analytics tools allows to capture the content of user queries. Despite their relevant nature, the manual analysis of large volumes of user queries is problematic. This paper demonstrates the potential of using information extraction techniques and Linked Data to gather a better understanding of the nature of user queries in an automated manner. Design/methodology/approach: The paper presents a large-scale case-study conducted at the Royal Library of Belgium consisting of a data set of 83 854 queries resulting from 29 812 visits over a 12 month period of the historical newspapers platform BelgicaPress. By making use of information extraction methods, knowledge bases and various authority files, this paper presents the possibilities and limits to identify what percentage of end users are looking for person and place names. Findings: Based on a quantitative assessment, our method can successfully identify the majority of person and place names from user queries. Due to the specific character of user queries and the nature of the knowledge bases used, a limited amount of queries remained too ambiguous to be treated in an automated manner. Originality/value: This paper demonstrates in an empirical manner both the possibilities and limits of gaining more insights from user queries extracted from a Web Analytics tool and analysed with the help of information extraction tools and knowledge bases. Methods and tools used are generalisable and can be reused by other collection holders.

[1]  Zaiqing Nie,et al.  Joint Entity Recognition and Disambiguation , 2015, EMNLP.

[2]  Håkan Jonsson,et al.  Named Entity Recognition for Short Text Messages , 2011 .

[3]  Mark Dredze,et al.  Entity Linking: Finding Extracted Entities in a Knowledge Base , 2013, Multi-source, Multilingual Information Extraction and Summarization.

[4]  Ruben Verborgh,et al.  Linked Data for Libraries, Archives and Museums: How to Clean, Link and Publish Your Metadata , 2014 .

[5]  Karl Aberer,et al.  TRank: Ranking Entity Types Using the Web of Data , 2013, International Semantic Web Conference.

[6]  Oksana L. Zavalina,et al.  Understanding the Information Needs of Large-Scale Digital Library Users , 2014 .

[7]  Marius Pasca,et al.  Weakly-supervised discovery of named entities using web search queries , 2007, CIKM '07.

[8]  Gianluca DemartiniClaudiu Why finding entities in Wikipedia is difficult, sometimes , 2010 .

[9]  Brooke Cowan,et al.  Named Entity Recognition in Travel-Related Search Queries , 2015, AAAI.

[10]  Michel Beigbeder,et al.  L'utilisation des entités nommées pour l'expansion sémantique des requêtes Web , 2014, EGC.

[11]  Simon Hengchen,et al.  Semantic Enrichment of a Multilingual Archive with Linked Open Data , 2017, Digit. Humanit. Q..

[12]  Jody Condit Fagan,et al.  The Suitability of Web Analytics Key Performance Indicators in the Academic Library Environment , 2014 .

[13]  Clemens Neudecker,et al.  An Open Corpus for Named Entity Recognition in Historic Newspapers , 2016, LREC.

[14]  Lora Aroyo,et al.  Using Linked Data to Diversify Search Results a Case Study in Cultural Heritage , 2014 .

[15]  Andreas Spitz,et al.  NECKAr: A Named Entity Classifier for Wikidata , 2017, GSCL.

[16]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[17]  Rik Van de Walle,et al.  The fallacy of the multi-API culture: Conceptual and practical benefits of Representational State Transfer (REST) , 2015, J. Documentation.

[18]  Franco Maria Nardini,et al.  Improving Europeana Search Experience Using Query Logs , 2011, TPDL.

[19]  Areej Alasiry Named entity recognition and classification in search queries , 2015 .

[20]  Hinrich Schütze,et al.  A Piggyback System for Joint Entity Mention Detection and Linking in Web Queries , 2016, WWW.

[21]  Theo van Veen,et al.  Semantic Enrichment: a Low-barrier Infrastructure and Proposal for Alignment , 2015, D Lib Mag..

[22]  Lora Aroyo,et al.  Using Linked Data to Diversify Search Results a Case Study in Cultural Heritage , 2014, EKAW.

[23]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[24]  Vesa Suominen,et al.  The problem of 'userism', and how to overcome it in library theory , 2007, Inf. Res..

[25]  Isabelle Boydens Informatique, normes et temps , 1999 .

[26]  Elizabeth Joan Kelly,et al.  Altmetrics and Archives , 2017 .

[27]  Raphaël Troncy,et al.  Analysis of named entity recognition and linking for tweets , 2014, Inf. Process. Manag..

[28]  Alan Ritter,et al.  Results of the WNUT16 Named Entity Recognition Shared Task , 2016, NUT@COLING.

[29]  Richard Khoury,et al.  Query classification using Wikipedia , 2011, Int. J. Intell. Inf. Database Syst..

[30]  Paul Gooding,et al.  Exploring the information behaviour of users of Welsh Newspapers Online through web log analysis , 2016, J. Documentation.

[31]  Elizabeth Joan Kelly,et al.  Assessment of Digitized Library and Archives Materials: A Literature Review , 2014 .

[32]  Oksana L. Zavalina Collection-level user searches in federated digital resource environment , 2008, ASIST.

[33]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[34]  Hang Li,et al.  Named entity recognition in query , 2009, SIGIR.

[35]  Ophir Frieder,et al.  Automatic classification of Web queries using very large unlabeled query logs , 2007, TOIS.

[36]  Enhong Chen,et al.  Context-aware query classification , 2009, SIGIR.

[37]  Karl Aberer,et al.  Contextualized ranking of entity types based on knowledge graphs , 2016, J. Web Semant..

[38]  Jiawei Han,et al.  Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions , 2015, IEEE Transactions on Knowledge and Data Engineering.

[39]  Mickaël Coustaty,et al.  Impact of OCR Errors on the Use of Digital Libraries: Towards a Better Access to Information , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[40]  Beatrice Alex,et al.  Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC-2016) , 2016 .

[41]  G. Fliedl,et al.  Text Preparation Through Extended Tokenization , 2006 .