Modeling user interests by conceptual clustering

As more information becomes available on the Web, there has been a crescent interest in effective personalization techniques. Personal agents providing assistance based on the content of Web documents and the user interests emerged as a viable alternative to this problem. Provided that these agents rely on having knowledge about users contained into user profiles, i.e., models of user preferences and interests gathered by observation of user behavior, the capacity of acquiring and modeling user interest categories has become a critical component in personal agent design. User profiles have to summarize categories corresponding to diverse user information interests at different levels of abstraction in order to allow agents to decide on the relevance of new pieces of information. In accomplishing this goal, document clustering offers the advantage that an a priori knowledge of categories is not needed, therefore the categorization is completely unsupervised. In this paper we present a document clustering algorithm, named WebDCC (Web Document Conceptual Clustering), that carries out incremental, unsupervised concept learning over Web documents in order to acquire user profiles. Unlike most user profiling approaches, this algorithm offers comprehensible clustering solutions that can be easily interpreted and explored by both users and other agents. By extracting semantics from Web pages, this algorithm also produces intermediate results that can be finally integrated in a machine-understandable format such as an ontology. Empirical results of using this algorithm in the context of an intelligent Web search agent proved it can reach high levels of accuracy in suggesting Web pages.

[1]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[2]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[3]  Dunja Mladenic,et al.  Text-learning and related intelligent agents: a survey , 1999, IEEE Intell. Syst..

[4]  Byron Dom,et al.  An Information-Theoretic External Cluster-Validity Measure , 2002, UAI.

[5]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[6]  R. Michalski,et al.  Learning from Observation: Conceptual Clustering , 1983 .

[7]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[8]  MladenicDunja Text-Learning and Related Intelligent Agents , 1999 .

[9]  Mark P. Sinka,et al.  A Large Benchmark Dataset for Web Document Clustering , 2002 .

[10]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[11]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[12]  P. Langley,et al.  Concept formation in structured domains , 1991 .

[13]  Gerard Salton,et al.  The SMART Retrieval System , 1971 .

[14]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[15]  M. Pazzani,et al.  Concept formation knowledge and experience in unsupervised learning , 1991 .

[16]  Douglas Hayes Fisher,et al.  Knowledge acquisition via incremental conceptual clustering : a dussertation submitted in partial satisfaction of the requirements for the degree doctor of philosophy in information and computer science , 1987 .

[17]  Aaron Kershenbaum,et al.  Category Levels in Hierarchical Text Categorization , 1998, EMNLP.

[18]  Michael J. Pazzani,et al.  Syskill & Webert: Identifying Interesting Web Sites , 1996, AAAI/IAAI, Vol. 1.

[19]  Catherine Faron-Zucker,et al.  Learning ontologies from RDF annotation , 2001 .

[20]  Pádraig Cunningham,et al.  Ontology Discovery for the Semantic Web Using Hierarchical Clustering , 2002 .

[21]  Analía Amandi,et al.  PersonalSearcher: An Intelligent Agent for Searching Web Pages , 2000, IBERAMIA-SBIA.

[22]  David Faure,et al.  A corpus-based conceptual clustering method for verb frames and ontology , 1998 .

[23]  Catherine Faron-Zucker,et al.  Learning Ontologies from RDF annotations , 2001, Workshop on Ontology Learning.

[24]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[25]  Andreas Hotho,et al.  Towards Semantic Web Mining , 2002, SEMWEB.

[26]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[27]  Gunnar Aastrand Grimnes Learning Knowledge Rich User Models from the Semantic Web , 2003, User Modeling.

[28]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[29]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[30]  Max J. Egenhofer,et al.  Assessing semantic similarity among spatial entity classes , 2000 .

[31]  Analía Amandi,et al.  Enriching Information Agents' Knowledge by Ontology Comparison: A Case Study , 2002, IBERAMIA.

[32]  Alexander Maedche,et al.  Clustering Ontology-Based Metadata in the Semantic Web , 2002, PKDD.

[33]  Steffen Staab,et al.  Ontology Learning for the Semantic Web , 2002, IEEE Intell. Syst..

[34]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[35]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.