Multiple factor hierarchical clustering algorithm for large scale web page and search engine clickstream data

The developments in World Wide Web and the advances in digital data collection and storage technologies during the last two decades allow companies and organizations to store and share huge amounts of electronic documents. It is hard and inefficient to manually organize, analyze and present these documents. Search engine helps users to find relevant information by present a list of web pages in response to queries. How to assist users to find the most relevant web pages from vast text collections efficiently is a big challenge. The purpose of this study is to propose a hierarchical clustering method that combines multiple factors to identify clusters of web pages that can satisfy users’ information needs. The clusters are primarily envisioned to be used for search and navigation and potentially for some form of visualization as well. An experiment on Clickstream data from a processional search engine was conducted to examine the results shown that the clustering method is effective and efficient, in terms of both objective and subjective measures.

[1]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[2]  Heeseok Lee,et al.  Strategic Agent Based Web System Development Methodology , 2008, Int. J. Inf. Technol. Decis. Mak..

[3]  Zhengxin Chen,et al.  A Multi-criteria Convex Quadratic Programming model for credit data analysis , 2008, Decis. Support Syst..

[4]  Dong-Sik Jang,et al.  Fuzzy Art-Based Image Clustering Method for Content-Based Image Retrieval , 2007, Int. J. Inf. Technol. Decis. Mak..

[5]  Inderjit S. Dhillon,et al.  Efficient Clustering of Very Large Document Collections , 2001 .

[6]  Yi Peng,et al.  Discovering Credit Cardholders’ Behavior by Multiple Criteria Linear Programming , 2005, Ann. Oper. Res..

[7]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[8]  Xijin Tang,et al.  Distribution of Multi-Words in Chinese and English Documents , 2009, Int. J. Inf. Technol. Decis. Mak..

[9]  Язык программирования,et al.  Cross Industry Standard Process for Data Mining , 2010 .

[10]  Robert L. Grossman,et al.  Data Mining for Scientific and Engineering Applications , 2001, Massive Computing.

[11]  Marti A. Hearst Untangling Text Data Mining , 1999, ACL.

[12]  Mahesh Kumar,et al.  Using clustering to improve sales forecasts in retail merchandising , 2010, Ann. Oper. Res..

[13]  Raid Al-Aomar,et al.  A Customer-Oriented Decision Agent for Product Selection in Web-Based Services , 2008, Int. J. Inf. Technol. Decis. Mak..

[14]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[15]  Qingyu Zhang,et al.  Web Mining: a Survey of Current Research, Techniques, and Software , 2008, Int. J. Inf. Technol. Decis. Mak..

[16]  Jaideep Srivastava,et al.  Data Preparation for Mining World Wide Web Browsing Patterns , 1999, Knowledge and Information Systems.

[17]  Yi Peng,et al.  Multiple criteria linear programming approach to data mining: Models, algorithm designs and software development , 2003, Optim. Methods Softw..

[18]  Yong Shi The Research Trend of Information Technology and Decision Making in 2009 , 2010, Int. J. Inf. Technol. Decis. Mak..

[19]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[20]  Guillermo Ricardo Simari,et al.  Defeasible Reasoning in Web-Based Forms through Argumentation , 2008, Int. J. Inf. Technol. Decis. Mak..

[21]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[22]  Zhengxin Chen,et al.  Classifying Credit Card Accounts for Business Intelligence and Decision Making: a Multiple-criteria Quadratic Programming Approach , 2005, Int. J. Inf. Technol. Decis. Mak..

[23]  Olfa Nasraoui,et al.  Mining Evolving User Profiles in Noisy Web Clickstream Data with a Scalable Immune System Clustering Algorithm , 2003 .

[24]  Vannevar Bush,et al.  As we may think , 1945, INTR.

[25]  Ning Zhong,et al.  Web Farming with Clickstream , 2008, Int. J. Inf. Technol. Decis. Mak..

[26]  Patrick Pantel,et al.  Clustering by committee , 2003 .

[27]  Soongoo Hong,et al.  Evaluating Government Website Accessibility: a Comparative Study , 2008, Int. J. Inf. Technol. Decis. Mak..

[28]  Zhengxin Chen,et al.  A Descriptive Framework for the Field of Data Mining and Knowledge Discovery , 2008, Int. J. Inf. Technol. Decis. Mak..

[29]  Current Research Trend: Information Technology and Decision Making in 2008 , 2009, Int. J. Inf. Technol. Decis. Mak..

[30]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .