Machine Learning Techniques in Web Content Mining: A Comparative Analysis

With incessantly growing amount of information published over Web pages, the World Wide Web (WWW) has become prolific in the field of data mining research. The heterogeneous and semi-structured nature of Web data has made the process of automated discovery a challenging issue. Web Content Mining (WCM) essentially uses data mining techniques to effectively discover knowledge from Web page contents. The intent of this study is to provide a comparative analysis of Machine Learning (ML) techniques available in the literature for WCM. For analysis, the article focuses on issues such as representation techniques, learning methods, datasets used and performance of each method as a criterion. The survey observes that some of the traditional ML algorithms have been efficiently used to work on Web data. Finally, the paper concludes citing some promising issues for further research in this domain.

[1]  Min-Yen Kan,et al.  Fast webpage classification using URL features , 2005, CIKM '05.

[2]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[3]  MAGDALINI EIRINAKI,et al.  Web mining for web personalization , 2003, TOIT.

[4]  Rafael Corchuelo,et al.  A statistical approach to URL-based web page clustering , 2012, WWW.

[5]  Ali Ahmadi,et al.  Intelligent classification of web pages using contextual and visual features , 2011, Appl. Soft Comput..

[6]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[7]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[8]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[9]  Markus Peura,et al.  The Self-Organizing Map of Trees , 1998, Neural Processing Letters.

[10]  Soumen Chakrabarti,et al.  Mining the web - discovering knowledge from hypertext data , 2002 .

[11]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[12]  Damianos Gavalas,et al.  Classification of Web Documents using Fuzzy Logic Categorical Data Clustering , 2007, AIAI.

[13]  Kate Smith-Miles,et al.  Web page clustering using a self-organizing map of user navigation patterns , 2003, Decis. Support Syst..

[14]  Ioannis Anagnostopoulos,et al.  Classifying Web pages employing a probabilistic neural network , 2004, IEE Proc. Softw..

[15]  V.F. Fernandez,et al.  Naive Bayes Web Page Classification with HTML Mark-Up Enrichment , 2006, 2006 International Multi-Conference on Computing in the Global Information Technology - (ICCGI'06).

[16]  Rong Jin,et al.  Understanding bag-of-words model: a statistical framework , 2010, Int. J. Mach. Learn. Cybern..

[17]  Yong Yu,et al.  A Novel Web Page Categorization Algorithm Based on Block Propagation Using Query-Log Information , 2006, WAIM.

[18]  Rung Ching Chen,et al.  Web page classification based on a support vector machine using a weighted vote schema , 2006, Expert Syst. Appl..

[19]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[20]  Kevin Chen-Chuan Chang,et al.  Editorial: special issue on web content mining , 2004, SKDD.

[21]  Jaideep Srivastava,et al.  Web mining: information and pattern discovery on the World Wide Web , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[22]  Hsinchun Chen,et al.  Web mining: Machine learning for web applications , 2005, Annu. Rev. Inf. Sci. Technol..

[23]  Takashi Washio,et al.  Automatic Web-Page Classification by Using Machine Learning Methods , 2001, Web Intelligence.

[24]  Jiawei Han,et al.  PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[25]  Jiawei Han,et al.  Hierarchical Web-Page Clustering via In-Page and Cross-Page Link Structures , 2010, PAKDD.

[26]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[27]  Sachindra Joshi,et al.  A matrix density based algorithm to hierarchically co-cluster documents and words , 2003, WWW '03.

[28]  Dhananjay M. Kanade,et al.  Web Page Clustering using Latent Semantic Analysis , 2012 .

[29]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[30]  Xindong Wu,et al.  A Phrase-Based Method for Hierarchical Clustering of Web Snippets , 2010, AAAI.

[31]  James Allan,et al.  Web Page Clustering Using Heuristic Search in the Web Graph , 2007, IJCAI.

[32]  Valter Crescenzi,et al.  Clustering Web pages based on their structure , 2005, Data Knowl. Eng..

[33]  Pallavi J. Chaudhari,et al.  Clustering With Multi-Viewpoint Based Similarity Measure: An Overview , 2012 .

[34]  Zdravko Markov,et al.  Information Retrieval and Web Search , 2006 .

[35]  Francesco Archetti,et al.  A probabilistic relational approach for web document clustering , 2010, Inf. Process. Manag..

[36]  A. Joshi,et al.  Web mining: research and practice , 2004, Computing in Science & Engineering.

[37]  Lihui Chen,et al.  Clustering with Multiviewpoint-Based Similarity Measure , 2012, IEEE Transactions on Knowledge and Data Engineering.

[38]  Xiaoying Gao,et al.  Query Directed Web Page Clustering , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[39]  Xiaohua Hu,et al.  Web clustering based on the information of sibling pages , 2008, 2008 IEEE International Conference on Granular Computing.

[40]  Alex Alves Freitas,et al.  Web Page Classification with an Ant Colony Algorithm , 2004, PPSN.

[41]  Ivan Herman,et al.  Graph Visualization and Navigation in Information Visualization: A Survey , 2000, IEEE Trans. Vis. Comput. Graph..

[42]  Soumen Chakrabarti Web Search and Information Retrieval , 2003 .

[43]  Morteza Haghir Chehreghani,et al.  Density link-based methods for clustering web pages , 2009, Decis. Support Syst..

[44]  O. Etzioni,et al.  The world-wide web : Quagmire or gold mine ? : Data mining and knowledge discovery in databases , 1996 .

[45]  Kjersti Aas,et al.  Text Categorisation: A Survey , 1999 .

[46]  Ee-Peng Lim,et al.  Web classification using support vector machine , 2002, WIDM '02.

[47]  Yanchun Zhang,et al.  Utilizing Hyperlink Transitivity to Improve Web Page Clustering , 2003, ADC.

[48]  Chih-Ming Chen,et al.  Two novel feature selection approaches for web page classification , 2009, Expert Syst. Appl..

[49]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.

[50]  Zenglin Xu,et al.  Web page classification with heterogeneous data fusion , 2007, WWW '07.

[51]  Witold Pedrycz,et al.  Semantic Web Content Analysis: A Study in Proximity-Based Collaborative Clustering , 2007, IEEE Transactions on Fuzzy Systems.

[52]  Wang Yong-gui,et al.  Research on semantic Web mining , 2010, 2010 International Conference On Computer Design and Applications.

[53]  Ramachandra V. Pujeri,et al.  DISTRIBUTED APPROACH to WEB PAGE CATEGORIZATION USING MAP- REDUCE PROGRAMMING MODEL , 2012 .

[54]  Choochart Haruechaiyasak,et al.  Hierarchical Web Page Classification Based on a Topic Model and Neighboring Pages Integration , 2010, ArXiv.

[55]  Mark Craven,et al.  Relational Learning with Statistical Predicate Invention: Better Models for Hypertext , 2001, Machine Learning.

[56]  M. Mahdavi,et al.  Web page clustering using Harmony Search optimization , 2008, 2008 Canadian Conference on Electrical and Computer Engineering.

[57]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[58]  Barbara Rosario,et al.  Latent Semantic Indexing : An Overview 1 Latent Semantic Indexing : An overview INFOSYS 240 Spring 2000 Final Paper , 2001 .

[59]  Ana Isabel Canhoto Ontology-Based Interpretation and Validation of Mined Knowledge: Normative and Cognitive Factors in Data Mining , 2008 .

[60]  Liu Zhijing,et al.  Web mining research , 2003, Proceedings Fifth International Conference on Computational Intelligence and Multimedia Applications. ICCIMA 2003.

[61]  S. M. Kamruzzaman Web Page Categorization Using Artificial Neural Networks , 2010, ArXiv.

[62]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.