A Graph-Based Framework for Web Document Mining

In this paper we describe methods of performing data mining on web documents, where the web document content is represented by graphs. We show how traditional clustering and classification methods, which usually operate on vector representations of data, can be extended to work with graph-based data. Specifically, we give graph-theoretic extensions of the k-Nearest Neighbors classification algorithm and the k-means clustering algorithm that process graphs, and show how the retention of structural information can lead to improved performance over the case of the vector model approach. We introduce several different types of web document representations that utilize graphs and compare their performance for clustering and classification.

[1]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[2]  Horst Bunke,et al.  On Median Graphs: Properties, Algorithms, and Applications , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[4]  Abraham Kandel,et al.  Classification of Web documents using a graph model , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[5]  Horst Bunke,et al.  On Graphs with Unique Node Labels , 2003, GbRPR.

[6]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[7]  Gabriel Valiente,et al.  A graph distance metric combining maximum common subgraph and minimum common supergraph , 2001, Pattern Recognit. Lett..

[8]  Abraham Kandel,et al.  Graph Representations for Web Document Clustering , 2003, IbPRIA.

[9]  David S. Doermann,et al.  Logical Labeling of Document Images Using Layout Graph Matching with Adaptive Learning , 2002, Document Analysis Systems.

[10]  Miro Kraetzl,et al.  Graph distances using graph union , 2001, Pattern Recognit. Lett..

[11]  Abraham Kandel,et al.  Classification Of Web Documents Using Graph Matching , 2004, Int. J. Pattern Recognit. Artif. Intell..

[12]  Yuan-Fang Wang,et al.  The use of bigrams to enhance text categorization , 2002, Inf. Process. Manag..

[13]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[14]  Apostolos Antonacopoulos,et al.  Web Document Analysis: Challenges and Opportunities , 2003 .

[15]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[16]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[17]  Gordon Wilfong,et al.  Applications of Graph Probing to Web Document Analysis , 2003, Web Document Analysis.

[18]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[19]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[20]  Horst Bunke,et al.  A graph distance metric based on the maximal common subgraph , 1998, Pattern Recognit. Lett..

[21]  Daniel P. Lopresti,et al.  Document Analysis Systems V , 2002, Lecture Notes in Computer Science.

[22]  Sourav S. Bhowmick,et al.  Research Issues in Web Data Mining , 1999, DaWaK.

[23]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[24]  Abraham Kandel,et al.  Clustering of Web Documents using a Graph Model , 2003, Web Document Analysis.

[25]  Ning Zhong,et al.  In Search of the Wisdom Web , 2002, Computer.