Webpage Clustering - Taking the Zero Step: a Case Study of an Iranian Website

The expansion of websites and their too many pages not only have pushed their visitors to frustration but also have made the websites ever more difficult to be managed and controlled by their owners. In the past few years data mining (clustering) has been of great help so as to assist website's owner to address the complexities related to owners' extracting their visitor's preferences and their coming to know their websites properly. In this line of literature, this paper contains several parts and features. First, with regard to the fact that SOM has been the popular algorithm in dealing with page clustering, a comparison between SOM and K-means (another popular clustering algorithm) were performed to show the superiority of SOM in dealing with the task of webpage clustering. Second, due to the clustering tasks' complication not being able to be tested (unlike Classification), this study aims at proposing a mind-set by which one before taking any other actions has to go through some steps in order to choose the best set of data. Thirdly, looking at the literature, one can see the question about the suitability of types of data (content, structure and usage) and the task they are being used for has never been raised. Using an Iranian website's data, a field study and SOM algorithm, we presented that the popular belief about the type of data and the task they are appropriate for should be open to doubt. It was also depicted that different sets of data in two chosen tasks - webpage profiling and extracting visitors' preference - can influence the results tremendously. Last but not least, apart from observing the influence of different sets of data, both data mining tasks have been performed to the end and the results are presented in the paper. Additionally, using the second clustering task's results (the extraction of visitors' preferences) a novel recommendation system is presented. The recommendation system in question was installed in the website for more than a month and its influence on the whole website is observed and analysed.

[1]  Soe-Tsyr Yuan,et al.  A study on VRM-awareness enterprise websites , 2002, Expert Syst. Appl..

[2]  Chung-Chih Li,et al.  Self-organizing map based web pages clustering using web logs , 2007, SEDE.

[3]  Damianos Gavalas,et al.  Classification of Web Documents using Fuzzy Logic Categorical Data Clustering , 2007, AIAI.

[4]  Michael J. A. Berry,et al.  Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management , 2004 .

[5]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[6]  Dorothea Wagner,et al.  How to Evaluate Clustering Techniques , 2007 .

[7]  Katsumi Takahashi,et al.  Naviz : Website Navigational Behavior Visualizer , 2002, PAKDD.

[8]  Tom Heskes,et al.  Categorization of web pages and user clustering with mixtures of hidden Markov models , 2008, KDD 2008.

[9]  Wei-Pang Yang,et al.  Structure clustering for Chinese patent documents , 2008, Expert Syst. Appl..

[10]  Chang-Chun Lin,et al.  Website reorganization using an ant colony system , 2010, Expert Syst. Appl..

[11]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[12]  Sungjune Park,et al.  Sequence-based clustering for Web usage mining: A new experimental framework and ANN-enhanced K-means algorithm , 2008, Data Knowl. Eng..

[13]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[14]  YongSeog Kim,et al.  Weighted order-dependent clustering and visualization of web navigation patterns , 2007, Decis. Support Syst..