Usage-Driven Unified Model for User Profile and Data Source Profile Extraction

This thesis addresses a problem related to usage analysis in information retrieval systems. Indeed, we exploit the history of search queries as support of analysis to extract a profile model. The objective is to characterize the user and the data source that interact in a system to allow different types of comparison (user-to-user, source-to-source, user-to-source). According to the study we conducted on the work done on profile model, we concluded that the large majority of the contributions are strongly related to the applications within they are proposed. As a result, the proposed profile models are not reusable and suffer from several weaknesses. For instance, these models do not consider the data source, they lack of semantic mechanisms and they do not deal with scalability (in terms of complexity). Therefore, we propose a generic model of user and data source profiles. The characteristics of this model are the following. First, it is generic, being able to represent both the user and the data source. Second, it enables to construct the profiles in an implicit way based on histories of search queries. Third, it defines the profile as a set of topics of interest, each topic corresponding to a semantic cluster of keywords extracted by a specific clustering algorithm. Finally, the profile is represented according to the vector space model. The model is composed of several components organized in the form of a framework, in which we assessed the complexity of each component. The main components of the framework are: - a method for keyword queries disambiguation; - a method for semantically representing search query logs in the form of a taxonomy; - a clustering algorithm that allows fast and efficient identification of topics of interest as semantic clusters of keywords; - a method to identify user and data source profiles according to the generic model. This framework enables in particular to perform various tasks related to usage-based structuration of a distributed environment. As an example of application, the framework is used to the discovery of user communities, and the categorization of data sources. To validate the proposed framework, we conduct a series of experiments on real logs from the search engine AOL search, which demonstrate the efficiency of the disambiguation method in short queries, and show the relation between the quality based clustering and the structure based clustering.

[1]  Lyes Limam,et al.  Live-Ticker Supported Sports-Video Annotation Enabling Tactic Analysis , 2010 .

[2]  Fabrizio Silvestri,et al.  Mining query logs to optimize index partitioning in parallel web search engines , 2007, Infoscale.

[3]  Elaine Rich,et al.  Building and Exploiting User Models , 1979, IJCAI.

[4]  B. Jaumard,et al.  Cluster Analysis and Mathematical Programming , 2003 .

[5]  Elaine Rich,et al.  User Modeling via Stereotypes , 1998, Cogn. Sci..

[6]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[7]  V. Udhayakumar,et al.  A Web Search Engine-Based Approach to Measure Semantic Similarity between Words , 2015 .

[8]  Shui-Lung Chuang,et al.  Towards automatic generation of query taxonomy: a hierarchical query clustering approach , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[9]  Michael L. Fredman,et al.  Trans-dichotomous algorithms for minimum spanning trees and shortest paths , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[10]  S. Stigler Francis Galton's Account of the Invention of Correlation , 1989 .

[11]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[12]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[13]  Carlo Strapparava,et al.  User Modelling for News Web Sites with Word Sense Based Techniques , 2004, User Modeling and User-Adapted Interaction.

[14]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[15]  Sung-Hyuk Cha Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[16]  Ji-Rong Wen,et al.  Query clustering using user logs , 2002, TOIS.

[17]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[18]  C. Raymond Perrault,et al.  Speech Acts as a Basis for Understanding Dialogue Coherence , 1978, TINLAP.

[19]  D. Defays,et al.  An Efficient Algorithm for a Complete Link Method , 1977, Comput. J..

[20]  F. Warren Burton,et al.  Expected Complexity of Fast Search with Uniformly Distributed Data , 1981, Inf. Process. Lett..

[21]  Martin Kurth,et al.  The limits and limitations of transaction log analysis , 1993 .

[22]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[23]  Sofia Stamou,et al.  Search personalization through query and page topical analysis , 2009, User Modeling and User-Adapted Interaction.

[24]  Bernard J. Jansen Investigating the relevance of sponsored results for web ecommerce queries , 2007, SIGIR.

[25]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[26]  Ricardo A. Baeza-Yates,et al.  Relating Web Characteristics with Link Based Web Page Ranking , 2001, SPIRE.

[27]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[28]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[29]  Matteo Golfarelli,et al.  Mining Preferences from OLAP Query Logs for Proactive Personalization , 2011, ADBIS.

[30]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[31]  Haïfa Zargayouna,et al.  Mesure de similarité sémantique pour l'indexation de documents semi-structurés , 2004 .

[32]  Hamid Seifoddini,et al.  Single linkage versus average linkage clustering in machine cells formation applications , 1989 .

[33]  Bernard J. Jansen,et al.  Search log analysis: What it is, what's been done, how to do it , 2006 .

[34]  Krishna P. Gummadi,et al.  An analysis of Internet content delivery systems , 2002, OPSR.

[35]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  T. Joachims WebWatcher : A Tour Guide for the World Wide Web , 1997 .

[37]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[38]  Lionel Brunie,et al.  Centrality-based peer rewiring in semantic overlay networks: Short paper , 2013, IEEE 7th International Conference on Research Challenges in Information Science (RCIS).

[39]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[40]  Steven Furnell,et al.  A practical evaluation of Web analytics , 2004, Internet Res..

[41]  Thorsten Joachims,et al.  WebWatcher : A Learning Apprentice for the World Wide Web , 1995 .

[42]  Stuart E. Middleton,et al.  Ontological user profiling in recommender systems , 2004, TOIS.

[43]  C. Lee Giles,et al.  A system for automatic personalized tracking of scientific literature on the Web , 1999, DL '99.

[44]  Ann C. Weller,et al.  Using Transaction Log Analysis to Improve OPAC Retrieval Results , 1998 .

[45]  Olivia R. Liu Sheng,et al.  Analysis of the query logs of a Web site search engine , 2005, J. Assoc. Inf. Sci. Technol..

[46]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[47]  Venkata Rama Kiran Garimella,et al.  Mining web query logs to analyze political issues , 2012, WebSci '12.

[48]  Fan Li,et al.  Ranking specialization for web search: a divide-and-conquer approach by using topical RankSVM , 2010, WWW '10.

[49]  Umberto Straccia,et al.  User Profile Modeling and Applications to Digital Libraries , 1999, ECDL.

[50]  Giuseppe Pirrò,et al.  A semantic similarity metric combining features and intrinsic information content , 2009, Data Knowl. Eng..

[51]  Alfred Kobsa,et al.  The user modeling shell system BGP-MS , 2005, User Modeling and User-Adapted Interaction.

[52]  Jianfeng Gao,et al.  Exploring web scale language models for search query processing , 2010, WWW '10.

[53]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[54]  Analía Amandi,et al.  Modeling user interests by conceptual clustering , 2006, Inf. Syst..

[55]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[56]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.

[57]  Elizabeth M. Belding-Royer,et al.  A review of current routing protocols for ad hoc mobile wireless networks , 1999, IEEE Wirel. Commun..

[58]  Philip K. Chan,et al.  Learning implicit user interest hierarchy for context in personalization , 2008, IUI '03.

[59]  Joemon M. Jose,et al.  Personalizing Web Search with Folksonomy-Based User and Document Profiles , 2010, ECIR.

[60]  Bamshad Mobasher,et al.  Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search , 2007, IEEE Intell. Informatics Bull..

[61]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[62]  Michael A. Shepherd,et al.  An adaptive user profile for filtering news based on a user interest hierarchy , 2006, ASIST.

[63]  Lyes Limam,et al.  Query Log Analysis for User-Centric Multimedia Databases , 2008 .

[64]  Jeremy J. Carroll,et al.  Matching RDF Graphs , 2002, SEMWEB.

[65]  Alfred Kobsa,et al.  Generic User Modeling Systems , 2001, User Modeling and User-Adapted Interaction.

[66]  Zeina Torbey Takkouz Increasing data availability in mobile ad-hoc networks : A community-centric and resource-aware replication approach , 2012 .

[67]  Lei Zhang,et al.  Keyword Query Routing , 2014, IEEE Transactions on Knowledge and Data Engineering.

[68]  Olfa Nasraoui,et al.  Mining search engine query logs for social filtering-based query recommendation , 2008, Appl. Soft Comput..

[69]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[70]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[71]  ChengXiang Zhai,et al.  Implicit user modeling for personalized search , 2005, CIKM '05.

[72]  Gerhard Fischer,et al.  User Modeling in Human–Computer Interaction , 2001, User Modeling and User-Adapted Interaction.

[73]  Derek H. Sleeman,et al.  UMFE: A User Modelling Front-End Subsystem , 1985, Int. J. Man Mach. Stud..

[74]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[75]  Michael J. Pazzani,et al.  Learning and Revising User Profiles: The Identification of Interesting Web Sites , 1997, Machine Learning.

[76]  Graeme Hirst,et al.  Lexical chains as representations of context for the detection and correction of malapropisms , 1995 .

[77]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[78]  Qi Gao,et al.  Semantic Enrichment of Twitter Posts for User Profile Construction on the Social Web , 2011, ESWC.

[79]  G. Leech 100 million words of English , 1993, English Today.

[80]  Alfred Kobsa,et al.  User Modeling for Personalized City Tours , 2002, Artificial Intelligence Review.

[81]  Boi Faltings,et al.  OSS: A Semantic Similarity Function based on Hierarchical Ontologies , 2007, IJCAI.

[82]  Anil K. Jain,et al.  A Clustering Performance Measure Based on Fuzzy Set Decomposition , 1981, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[83]  Michael J. Fischer,et al.  An improved equivalence algorithm , 1964, CACM.

[84]  Fabio Abbattista,et al.  Extraction of User Profiles by Discovering Preferences through Machine Learning , 2003, IIS.

[85]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[86]  Fabio Crestani,et al.  Towards query log based personalization using topic models , 2010, CIKM.

[87]  G. Aghila,et al.  A Survey of Semantic Similarity Methods for Ontology Based Information Retrieval , 2010, 2010 Second International Conference on Machine Learning and Computing.

[88]  Alessandro Micarelli,et al.  User Profiles for Personalized Information Access , 2007, The Adaptive Web.

[89]  C. Raymond Perrault,et al.  A Plan-Based Analysis of Indirect Speech Act , 1980, CL.

[90]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[91]  Bernard J. Jansen,et al.  Real time search user behavior , 2010, CHI EA '10.