Web document clustering based on web log mining

As an increasing number of user access information on the Web, there is a great opportunity to learn from the Web server logs to cluster large amounts of Web documents. One approach is to cluster the documents based on information provided only by users' usage logs and not by the content of the documents. A major advantage of this approach is that the relevancy information is objectively reflected by the usage logs; frequent simultaneous visits to two seemingly unrelated documents should indicate that they are in fact closely related. Our clustering algorithm PDBSCAN (Partitioning Based DBSCAN algorithm) is based on DBSCAN, a density based algorithm that has been proven in its ability in processing very large datasets. In addition, we prove both analytically and experimentally that our method yields clustering results that are superior to that of DBSCAN.