Novel pre-processing technique for web log mining by removing global noise and web robots

Today internet has made the life of human dependent on it. Almost everything and anything can be searched on net. Web pages usually contain huge amount of information that may not interest the user, as it may not be the part of the main content of the web page. Web Usage Mining (WUM) is one of the main applications of data mining, artificial intelligence and so on to the web data and forecast the user's visiting behaviors and obtains their interests by investigating the samples. Since WUM directly involves in applications, such as, e-commerce, e-learning, Web analytics, information retrieval etc. Weblog data is one of the major sources which contain all the information regarding the users visited links, browsing patterns, time spent on a particular page or link and this information can be used in several applications like adaptive web sites, modified services, customer summary, pre-fetching, generate attractive web sites etc. There are varieties of problems related with the existing web usage mining approaches. Existing web usage mining algorithms suffer from difficulty of practical applicability. This paper continues the line of research on Web access log analysis is to analyze the patterns of web site usage and the features of users behavior. It is the fact that the normal Log data is very noisy and unclear and it is vital to preprocess the log data for efficient web usage mining process. Preprocessing is the process comprises of three phases which includes data cleaning, user identification, and pattern discovery and pattern analysis. Log data is characteristically noisy and unclear, so preprocessing is an essential process for effective mining process. In this paper, a novel pre-processing technique is proposed by removing local and global noise and web robots. Preprocessing is an important step since the Web architecture is very complex in nature and 80% of the mining process is done at this phase. Anonymous Microsoft Web Dataset and MSNBC.com Anonymous Web Dataset are used for evaluating the proposed preprocessing technique.

[1]  Mahmudur Rahman,et al.  Pattern Discovery of Web Usage Mining , 2009, 2009 International Conference on Computer Technology and Development.

[2]  Alfredo Petrosino,et al.  An Heuristic Approach to Page Recommendation in Web Usage Mining , 2009, 2009 Ninth International Conference on Intelligent Systems Design and Applications.

[3]  Jaideep Srivastava,et al.  Grouping Web page references into transactions for mining World Wide Web browsing patterns , 1997, Proceedings 1997 IEEE Knowledge and Data Engineering Exchange Workshop.

[4]  Deepayan Chakrabarti,et al.  Page-level template detection via isotonic smoothing , 2007, WWW '07.

[5]  Yan Li,et al.  Research on Path Completion Technique in Web Usage Mining , 2008, 2008 International Symposium on Computer Science and Computational Technology.

[6]  Tae-Seong Kim,et al.  Facial Image Retrieval through Compound Queries Using Constrained Independent Component Analysis , 2007 .

[7]  Jaideep Srivastava,et al.  Data Preparation for Mining World Wide Web Browsing Patterns , 1999, Knowledge and Information Systems.

[8]  Ranieri Baraglia,et al.  SUGGEST: a Web usage mining system , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.

[9]  Pawan Lingras,et al.  Temporal Web usage mining , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[10]  S. Ramkumar A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites , 2014 .

[11]  S.K. Shinde,et al.  A New Approach for on Line Recommender System in Web Usage Mining , 2008, 2008 International Conference on Advanced Computer Theory and Engineering.

[12]  Chih-Hung Wu,et al.  Web usage mining on the sequences of clicking patterns in a grid computing environment , 2010, 2010 International Conference on Machine Learning and Cybernetics.

[13]  Philip S. Yu,et al.  SpeedTracer: A Web Usage Mining and Analysis Tool , 1998, IBM Syst. J..

[14]  K. Thangavel,et al.  Rough Set Based Feature Selection for Web Usage Mining , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[15]  Peiying Zhao,et al.  Web usage mining based on fuzzy clustering in identifying target group , 2009, 2009 ISECS International Colloquium on Computing, Communication, Control, and Management.

[16]  Marie-Jeanne Lesot,et al.  A New Web Usage Mining and Visualization Tool , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[17]  Bin Liu,et al.  Discovering Web usage patterns by mining cross-transaction association rules , 2004, Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.04EX826).

[18]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[19]  Kobra Etminani,et al.  Web usage mining: Discovery of the users' navigational patterns using SOM , 2009, 2009 First International Conference on Networked Digital Technologies.

[20]  Hiroshi Ando,et al.  Psychodynamic Appraisal Mechanism for Emotional Development through Multi-modal Interaction , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[21]  Ying Wah Teh,et al.  Using Incremental Fuzzy Clustering to Web Usage Mining , 2009, 2009 International Conference of Soft Computing and Pattern Recognition.

[22]  Zhang Huiying,et al.  An intelligent algorithm of data pre-processing in Web usage mining , 2004, Fifth World Congress on Intelligent Control and Automation (IEEE Cat. No.04EX788).

[23]  Yu-Hsiang Fu,et al.  Web Usage Mining Based on Clustering of Browsing Features , 2008, 2008 Eighth International Conference on Intelligent Systems Design and Applications.

[24]  Nicholas Kushmerick,et al.  Learning to remove Internet advertisements , 1999, AGENTS '99.

[25]  Mehrdad Jalali,et al.  A Web Usage Mining Approach Based on LCS Algorithm in Online Predicting Recommendation Systems , 2008, 2008 12th International Conference Information Visualisation.

[26]  Demin Dong Exploration on Web Usage Mining and its Application , 2009, 2009 International Workshop on Intelligent Systems and Applications.