Comparative Study of Different Web Mining Algorithms to Discover Knowledge on the Web

Abstract. Nowadays the World Wide Web (commonly called as Web) is used widely and it has impacted onalmost every facet of our lives. To search and retrieve the information from the web requires an effective andefficient technique as it has become a challenge due to expanding size and complexity of web. Web Miningtackles this problem by gathering useful information from web by using its three categories web structuremining, web content mining and web usage mining. In this paper discussion is done by explaining the area ofWeb Mining, its categories and algorithms associated with it. The algorithms discussed are PageRank, SimRank,TF-IDF, k- nearest neighbour, PageGather and CDL4. Then we summarize the algorithms over parameters suchas its working, input parameters, complexity and their pros and cons. Also we analyze discussed algorithmsover the parameters: relevance, their technique and regression analysis. Keywords: Web mining, Web structure mining, Web content mining, Web usage mining, PageRank,SimRank, TF-IDF, kNN, PageGather, CDL4.

[1]  Lise Getoor,et al.  Link mining: a new data mining challenge , 2003, SKDD.

[2]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[3]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[4]  Stavros Valsamidis,et al.  A Clustering Methodology of Web Log Data for Learning Management Systems , 2012, J. Educ. Technol. Soc..

[5]  Zhiguo Gong,et al.  Web structure mining: an introduction , 2005, 2005 IEEE International Conference on Information Acquisition.

[6]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[7]  Oren Etzioni,et al.  Adaptive Web Sites: Automatically Synthesizing Web Pages , 1998, AAAI/IAAI.

[8]  L. H. Patil,et al.  A novel approach for feature selection method TF-IDF in document clustering , 2013, 2013 3rd IEEE International Advance Computing Conference (IACC).

[9]  Jian Pei,et al.  Mining search and browse logs for web search , 2013, ACM Trans. Intell. Syst. Technol..

[10]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[11]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[12]  Donald Perlis,et al.  Information Retrieval on the World Wide Web and Active Logic: A Survey and Problem Definition , 2002 .

[13]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[14]  Nasser Yazdani,et al.  DistanceRank: An intelligent ranking algorithm for web pages , 2008, Inf. Process. Manag..

[15]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[16]  Georgios Paliouras,et al.  Web Usage Mining as a Tool for Personalization: A Survey , 2003, User Modeling and User-Adapted Interaction.

[17]  Huang Yuan,et al.  Web mining: knowledge discovery on the Web , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[18]  Jaideep Srivastava,et al.  Web mining: information and pattern discovery on the World Wide Web , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.