Web document classification based on fuzzy association

In this paper, a method of automatically classifying web documents into a set of categories using the fuzzy association concept is proposed. Using the same word or vocabulary to describe different entities creates ambiguity, especially in the web environment where the user population is large. To solve this problem, fuzzy association is used to capture the relationships among different index terms or keywords in the documents, i.e., each pair of words has an associated value to distinguish itself from the others. Therefore, the ambiguity in word usage is avoided. Experiments using data sets collected from two web portals: Yahoo! and Open Directory Project are conducted. We compare our approach to the vector space model with the cosine coefficient. The results show that our approach yields higher accuracy compared to the vector space model.

[1]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[2]  Two approaches for information retrieval through fuzzy associations , 1989, IEEE Trans. Syst. Man Cybern..

[3]  Tetsuya Morita,et al.  A fuzzy document retrieval system using the keyword connection matrix and a learning method , 1991 .

[4]  Peter Pirolli,et al.  Mining Longest Repeating Subsequences to Predict World Wide Web Surfing , 1999, USENIX Symposium on Internet Technologies and Systems.

[5]  Jaideep Srivastava,et al.  Web mining: information and pattern discovery on the World Wide Web , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[6]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[7]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[8]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[9]  Eli Upfal,et al.  Web search using automatic classification , 1996, WWW 1996.

[10]  Sadaaki Miyamoto,et al.  Fuzzy Information Retrieval Based on a Fuzzy Pseudothesaurus , 1986, IEEE Transactions on Systems, Man, and Cybernetics.

[11]  Jia Wang,et al.  A survey of web caching schemes for the Internet , 1999, CCRV.

[12]  Sadaaki Miyamoto,et al.  Generation of a pseudothesaurus for information retrieval based on cooccurrences and fuzzy set operations , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[13]  Rosni Abdullah,et al.  Automatic Topic Identification Using Ontology Hierarchy , 2001, CICLing.

[14]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[15]  Choochart Haruechaiyasak,et al.  Disjoint Web Document Clustering and Management in Electronic Commerce , 2001 .

[16]  Choochart Haruechaiyasak,et al.  Mining user access behavior on the WWW , 2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236).