Classification Of Web Documents Using Graph Matching

In this paper we describe a classification method that allows the use of graph-based representations of data instead of traditional vector-based representations. We compare the vector approach combined with the k-Nearest Neighbor (k-NN) algorithm to the graph-matching approach when classifying three different web document collections, using the leave-one-out approach for measuring classification accuracy. We also compare the performance of different graph distance measures as well as various document representations that utilize graphs. The results show the graph-based approach can outperform traditional vector-based methods in terms of accuracy, dimensionality and execution time.

[1]  Horst Bunke,et al.  On a relation between graph edit distance and maximum common subgraph , 1997, Pattern Recognit. Lett..

[2]  Xin Lu Document retrieval: A structural approach , 1990, Inf. Process. Manag..

[3]  Gordon Wilfong,et al.  Applications of Graph Probing to Web Document Analysis , 2003, Web Document Analysis.

[4]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[5]  Svetha Venkatesh,et al.  Graph Matching: Fast Candidate Elimination Using Machine Learning Techniques , 2000, SSPR/SPR.

[6]  Gabriel Valiente,et al.  A graph distance metric combining maximum common subgraph and minimum common supergraph , 2001, Pattern Recognit. Lett..

[7]  Horst Bunke,et al.  A graph distance metric based on the maximal common subgraph , 1998, Pattern Recognit. Lett..

[8]  David S. Doermann,et al.  Logical Labeling of Document Images Using Layout Graph Matching with Adaptive Learning , 2002, Document Analysis Systems.

[9]  Miro Kraetzl,et al.  Graph distances using graph union , 2001, Pattern Recognit. Lett..

[10]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[11]  Yuan-Fang Wang,et al.  The use of bigrams to enhance text categorization , 2002, Inf. Process. Manag..

[12]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[13]  Abraham Kandel,et al.  Clustering of Web Documents using a Graph Model , 2003, Web Document Analysis.

[14]  Horst Bunke,et al.  A New Algorithm for Error-Tolerant Subgraph Isomorphism Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Horst Bunke,et al.  Recent developments in graph matching , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.