Web documents categorization using fuzzy representation and HAC

Most of the existing techniques for the characterization of Web documents are based on term-frequency analysis. In such models, given a set of documents, the characterization of each document is represented by a feature vector in a vector space. However, as Web documents written in HTML are semi-structured by means of tags, the traditional techniques that assign term weights only by the frequency of occurrence may not be able to provide satisfactory results in representing the content of such documents. Some recent studies have shown that the fuzzy representation (FR) of WWW information based on the significance of HTML tags is an effective alternative for characterizing Web documents. In this paper, the FR is used to generate the feature vector for each Web document and the hierarchical agglomerative clustering (HAC) algorithm is applied to investigate its efficiency and effectiveness for the automatic categorization of Web documents with similar contents. Experiments that have been conducted suggest several benefits of using such an approach.

[1]  Loren G. Terveen,et al.  Constructing, organizing, and visualizing collections of topically related Web resources , 1999, TCHI.

[2]  Wolfgang May Modeling and querying structure and contents of the Web , 1999, Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99.

[3]  Howard C. Card,et al.  Categorizing Web pages using modified ART , 1998, Conference Proceedings. IEEE Canadian Conference on Electrical and Computer Engineering (Cat. No.98TH8341).

[4]  Tok Wang Ling,et al.  Integration of semistructured data with partial and inconsistent information , 1999, Proceedings. IDEAS'99. International Database Engineering and Applications Symposium (Cat. No.PR00265).

[5]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[6]  Ishfaq Ahmad,et al.  Allocating data objects to multiple sites for fast browsing of hypermedia documents , 1998, Proceedings. The Twenty-Second Annual International Computer Software and Applications Conference (Compsac '98) (Cat. No.98CB 36241).

[7]  Wenfei Fan,et al.  Keys with Upward Wildcards for XML , 2001, DEXA.

[8]  Howard C. Card,et al.  Categorizing Web pages on the subject of neural networks , 1998, J. Netw. Comput. Appl..

[9]  Jean Scholtz,et al.  VISVIP: 3D visualization of paths through web sites , 1999, Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99.

[10]  Gabriella Pasi,et al.  A fuzzy representation of HTML documents for information retrieval systems , 1996, Proceedings of IEEE 5th International Fuzzy Systems.

[11]  Yan Qiu Chen,et al.  Using cluster skeleton as prototype for data labeling , 2000, IEEE Trans. Syst. Man Cybern. Part B.

[12]  Amine Bensaid,et al.  Semi-Supervised Hierarchical Clustering Algorithms , 1997, SCAI.