Query Segmentation and Resource Disambiguation Leveraging Background Knowledge

Accessing the wealth of structured data available on the Data Web is still a key challenge for lay users. Keyword search is the most convenient way for users to access information (e.g., from data reposito- ries). In this paper we introduce a novel approach for determining the correct resources for user-supplied keyword queries based on a hidden Markov model. In our approach the user-supplied query is modeled as the observed data and the background knowledge is used for parameter estimation. Instead of learning parameter estimation from training data, we leverage the semantic relationships between data items for comput- ing the parameter estimations. In order to maximize accuracy and us- ability, query segmentation and resource disambiguation are mutually tightly interwoven. First, an initial set of potential segmentations is ob- tained leveraging the underlying knowledge base; then the nal correct set of segments is determined after the most likely resource mapping was computed using a scoring function. While linguistic methods like named entity, multi-word unit recognition and POS-tagging fail in the case of an incomplete sentences (e.g. for keyword-based queries), we will show that our statistical approach is robust with regard to query expression variance. Our experimental results when employing the hidden Markov model for resource identication in keyword queries reveal very promising results.

[1]  Ravi Kumar,et al.  Searching with context , 2006, WWW '06.

[2]  Günter Ladwig,et al.  Index structures and top-k join algorithms for native keyword search databases , 2011, CIKM '11.

[3]  Thorsten Joachims,et al.  Accurately interpreting clickthrough data as implicit feedback , 2005, SIGIR '05.

[4]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[5]  Peter Boros,et al.  Query Segmentation for Web Search , 2003, WWW.

[6]  Daniel Gayo-Avello,et al.  On the Fly Query Entity Decomposition Using Snippets , 2010, ArXiv.

[7]  Steve Lawrence,et al.  Context in Web Search , 2000, IEEE Data Eng. Bull..

[8]  Shui-Lung Chuang,et al.  Towards automatic generation of query taxonomy: a hierarchical query clustering approach , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[9]  Hang Li,et al.  Named entity recognition in query , 2009, SIGIR.

[10]  Bamshad Mobasher,et al.  Personalized recommendation in social tagging systems using hierarchical clustering , 2008, RecSys '08.

[11]  Haofen Wang,et al.  Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-Shaped (RDF) Data , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[12]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[13]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[14]  Jaime G. Carbonell,et al.  The impact of history length on personalized search , 2008, SIGIR '08.

[15]  Deniz Yuret,et al.  Word Sense Disambiguation for Information Retrieval , 1999, AAAI/IAAI.

[16]  Hwee Tou Ng,et al.  Named Entity Recognition: A Maximum Entropy Approach Using Global Information , 2002, COLING.

[17]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[18]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[19]  Ji-Rong Wen,et al.  Query clustering using user logs , 2002, TOIS.

[20]  Xiaohui Yu,et al.  Query segmentation using conditional random fields , 2009, KEYS '09.

[21]  K. Pu,et al.  Keyword query cleaning , 2008, Proc. VLDB Endow..

[22]  Eric Brill,et al.  Man* vs. Machine: A Case Study in Base Noun Phrase Learning , 1999, ACL.

[23]  Sebastian Hellmann,et al.  Keyword-Driven SPARQL Query Generation Leveraging Background Knowledge , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[24]  Jaime Teevan,et al.  Implicit feedback for inferring user preference: a bibliography , 2003, SIGF.

[25]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[26]  Fuchun Peng,et al.  Unsupervised query segmentation using generative language models and wikipedia , 2008, WWW.