An Effective fuzzy Clustering Algorithm for Web Document Classification: a Case Study in Cultural Content Mining

This article presents a novel crawling and clustering method for extracting and processing cultural data from the web in a fully automated fashion. Our architecture relies upon a focused web crawler to download web documents relevant to culture. The focused crawler is a web crawler that searches and processes only those web pages that are relevant to a particular topic. After downloading the pages, we extract from each document a number of words for each thematic cultural area, filtering the documents with non-cultural content; we then create multidimensional document vectors comprising the most frequent cultural term occurrences. We calculate the dissimilarity between the cultural-related document vectors and for each cultural theme, we use cluster analysis to partition the documents into a number of clusters. Our approach is validated via a proof-of-concept application which analyzes hundreds of web pages spanning different cultural thematic areas.

[1]  Z. S. Xu,et al.  An Overview of Distance and Similarity Measures of Intuitionistic Fuzzy Sets , 2008, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[2]  Biswajit Sahoo,et al.  Adaptive focused crawling based on link analysis , 2010, 2010 2nd International Conference on Education Technology and Computer.

[3]  Debakar Shamanta,et al.  Focused web crawling: A framework for crawling of country based financial data , 2010, 2010 2nd IEEE International Conference on Information and Financial Engineering.

[4]  Khaled Khelif,et al.  Focused Crawling Using Name Disambiguation on Search Engine Results , 2011, 2011 European Intelligence and Security Informatics Conference.

[5]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[6]  Hierarchy in Web Page Similarity Link Analysis , 2006 .

[7]  Liu Bingwu,et al.  Similarity Computation of Web Pages of Focused Crawler , 2010, 2010 International Forum on Information Technology and Applications.

[8]  Ling Lin,et al.  A Method for Automating the Extraction of Specialized Information from the Web , 2005, CIS.

[9]  Sankar K. Pal,et al.  Fuzzy models for pattern recognition : methods that search for structures in data , 1992 .

[10]  Ee-Peng Lim,et al.  Web classification of conceptual entities using co-training , 2011, Expert Syst. Appl..

[11]  Arputharaj Kannan,et al.  LSCrawler: A Framework for an Enhanced Focused Web Crawler Based on Link Semantics , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[12]  Lin-Chih Chen,et al.  Using a new relational concept to improve the clustering performance of search engines , 2011, Inf. Process. Manag..

[13]  Rung Ching Chen,et al.  Web page classification based on a support vector machine using a weighted vote schema , 2006, Expert Syst. Appl..

[14]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[15]  Thomas A. Runkler,et al.  Web mining with relational clustering , 2003, Int. J. Approx. Reason..

[16]  Damianos Gavalas,et al.  Classification of Web Documents using Fuzzy Logic Categorical Data Clustering , 2007, AIAI.

[17]  Jung-Hsien Chiang,et al.  Hierarchically SVM classification based on support vector clustering method and its application to document categorization , 2007, Expert Syst. Appl..

[18]  Ioannis Anagnostopoulos,et al.  Classifying Web pages employing a probabilistic neural network , 2004, IEE Proc. Softw..

[19]  Dustin Boswell Distributed High-performance Web Crawlers : A Survey of the State of the Art , 2003 .

[20]  Hai Dong,et al.  Focused Crawling for Automatic Service Discovery, Annotation, and Classification in Industrial Digital Ecosystems , 2011, IEEE Transactions on Industrial Electronics.

[21]  Qiang Zhu An Algorithm OFC for the Focused Web Crawler , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[22]  Brian D. Davison,et al.  Human Performance on Clustering Web Pages: A Preliminary Study , 1998, KDD.

[23]  Wanli Zuo,et al.  First-order focused crawling , 2007, WWW '07.

[24]  Stephen L. Chiu,et al.  Fuzzy Model Identification Based on Cluster Estimation , 1994, J. Intell. Fuzzy Syst..

[25]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[26]  Peng Tao,et al.  A new framework for focused Web crawling , 2008, Wuhan University Journal of Natural Sciences.

[27]  Yun Huang,et al.  wHunter: A Focused Web Crawler - A Tool for Digital Library , 2004, ICADL.

[28]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[29]  Filippo Menczer,et al.  Topical web crawlers: Evaluating adaptive algorithms , 2004, TOIT.

[30]  Jong-Hyeok Lee,et al.  Text categorization based on k-nearest neighbor approach for Web site classification , 2003, Inf. Process. Manag..

[31]  Yunming Ye,et al.  iSurfer: A Focused Web Crawler Based on Incremental Learning from Positive Samples , 2004, APWeb.

[32]  Hassan Khotanlou,et al.  A new fuzzy-based method to weigh the related concepts in semantic focused web crawlers , 2011, 2011 3rd International Conference on Computer Research and Development.

[33]  Gil-Chang Kim,et al.  Multiple sets of features for automatic genre classification of web documents , 2005, Inf. Process. Manag..

[34]  Debashis Hati,et al.  UDBFC: An effective focused crawling approach based on URL Distance calculation , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[35]  Mohammed J. Zaki,et al.  Web Usage Mining — Languages and Algorithms , 2003 .

[36]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[37]  Byung-Won On,et al.  An effective web document clustering algorithm based on bisection and merge , 2011, Artificial Intelligence Review.

[38]  Bo Sun,et al.  A genetic K-means approaches for automated Web page classification , 2004, Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, 2004. IRI 2004..

[39]  B. B. Meshram,et al.  Focused web crawler with revisit policy , 2011, ICWET.

[40]  Valter Crescenzi,et al.  Clustering Web pages based on their structure , 2005, Data Knowl. Eng..

[41]  Huaxiang Zhang,et al.  SCTWC: An online semi-supervised clustering approach to topical web crawlers , 2010, Appl. Soft Comput..

[42]  C. Lee Giles,et al.  Designing clustering-based web crawling policies for search engine crawlers , 2007, CIKM '07.