Granulate and Conquer : Clustering Web Pages Semantically using Combinatorial Topology

Granulation is a natural problem-solving methodology deeply rooted in human thinking; Human body is granulated into head, neck, and etc. The notion is intrinsically fuzzy, vague and imprecise. Previously, we have proposed mathematical models and illustrate the idea in college calculus, fuzzy set theory, and web intelligence. In this paper, we will focus on how to use the notion to cluster web pages. A set of documents is a knowledge representation some human thoughts; The thoughts will be called the Latent Semantic Space. In this paper, we will granulate the Latent Semantic Space into a polyhedron. As it is a geometric, hence, a language independent, representation. S one can identify the semantic equivalency between a set of English documents and, say, its Chinese version without direct translation. A polyhedron is a closed and bounded subset of an Euclidean space that has a combinatorial structure, called a simplicial complex. In this representation the structure of concepts can be captured by this geometry: Primitive concepts, concepts, and the idea are represented by sub-polyhedra of simplexes, connected components, and the simplicial complex respectively. Based on these structures, documents can be clustered into meaningful classes. By applying this method to a return of some current search engine, one may provide users a more semantically oriented search results.

[1]  R. Agarwal Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[2]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[3]  Shamsul Chowdhury,et al.  Extraction of Knowledge from a Database , 1988 .

[4]  Vipin Kumar,et al.  Document Categorization and Query Generation on the World Wide Web Using WebACE , 1999, Artificial Intelligence Review.

[5]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[6]  Alistair Moffat,et al.  Compression and Fast Indexing for Multi-Gigabyte Text Databases , 1994, Aust. Comput. J..

[7]  Tsau Young Lin,et al.  A simplicial complex, a hypergraph, structure in the latent semantic space of document clustering , 2005, Int. J. Approx. Reason..

[8]  Chanathip Namprempre,et al.  HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering , 1996, HYPERTEXT '96.

[9]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[10]  Tsay Young Attribute (Feature) Completion - The Theory of Attributes from Data'Mining Prospect , 2002 .

[11]  Lotfi A. Zadeh,et al.  Some reflections on soft computing, granular computing and their roles in the conception, design and utilization of information/intelligent systems , 1998, Soft Comput..

[12]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[13]  Tsau Young Lin,et al.  Granular computing: examples, intuitions and modeling , 2005, 2005 IEEE International Conference on Granular Computing.

[14]  Tsau Young Lin,et al.  Divide and conquer in granular computing topological partitions , 2005, NAFIPS 2005 - 2005 Annual Meeting of the North American Fuzzy Information Processing Society.

[15]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[16]  Hui Chen,et al.  Automatic information discovery from the "invisible Web" , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.