Graph-Theoretic Techniques for Web Content Mining

In this dissertation we introduce several novel techniques for performing data mining on web documents which utilize graph representations of document content. Graphs are more robust than typical vector representations as they can model structural information that is usually lost when converting the original web document content to a vector representation. For example, we can capture information such as the location, order and proximity of term occurrence, which is discarded under the standard document vector representation models. Many machine learning methods rely on distance computations, centroid calculations, and other numerical techniques. Thus many of these methods have not been applied to data represented by graphs since no suitable graph-theoretical concepts were previously available. We introduce the novel Graph Hierarchy Construction Algorithm (GHCA), which performs topic-oriented hierarchical clustering of web search results modeled using graphs. The system we created around this new algorithm and its prior version is compared with similar web search clustering systems to gauge its usefulness. An important advantage of this approach over conventional web search systems is that the results are better organized and more easily browsed by users. Next we present extensions to classical machine learning algorithms, such as the k-means clustering algorithm and the k-Nearest Neighbors classification algorithm, which allows the use of graphs as fundamental data items instead of vectors. We perform experiments comparing the performance of the new graph-based methods to the traditional vector-based methods for three web document collections. Our experimental results show an improvement for the graph approaches over the vector approaches for both clustering and classification of web documents. An important advantage of the graph representations we propose is that they allow the computation of graph similarity in polynomial time; usually the determination of graph similarity with the techniques we use is an NP-Complete problem. In fact, there are some cases where the execution time of the graph-oriented approach was faster than the vector approaches.

[1]  Swarup Medasani,et al.  Graph matching by relaxation of fuzzy assignments , 2001, IEEE Trans. Fuzzy Syst..

[2]  Edwin R. Hancock,et al.  An Energy Function and Continuous Edit Process for Graph Matching , 1998, Neural Computation.

[3]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[4]  Abraham Kandel,et al.  Graph Representations for Web Document Clustering , 2003, IbPRIA.

[5]  Abraham Kandel,et al.  Comparison of Distance Measures for Graph-Based Clustering of Documents , 2003, GbRPR.

[6]  Benjamin Piwowarski,et al.  A Machine Learning Model for Information Retrieval with Structured Documents , 2003, MLDM.

[7]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[8]  Kaizhong Zhang,et al.  Algorithms for Approximate Graph Matching , 1995, Inf. Sci..

[9]  King-Sun Fu,et al.  A distance measure between attributed relational graphs for pattern recognition , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[10]  G. Chartrand,et al.  Graph similarity and distance in graphs , 1998 .

[11]  Florent Masseglia,et al.  WebTool: An Integrated Framework for Data Mining , 1999, DEXA.

[12]  Maxime Crochemore,et al.  Direct Construction of Compact Directed Acyclic Word Graphs , 1997, CPM.

[13]  A. Paone,et al.  Discrete Time Relaxation Based on Direct Quadrature Methods for Volterra Integral Equations , 1999, Computing.

[14]  Ravi Kothari,et al.  On finding the number of clusters , 1999, Pattern Recognit. Lett..

[15]  Horst Bunke,et al.  On a relation between graph edit distance and maximum common subgraph , 1997, Pattern Recognit. Lett..

[16]  Roberto J. Bayardo,et al.  Athena: Mining-Based Interactive Management of Text Database , 2000, EDBT.

[17]  Iven Van Mechelen,et al.  A HIERARCHICAL CLASSES MODEL: THEORY AND METHOD WITH APPLICATIONS IN PSYCHOLOGY AND PSYCHOPATHOLOGY , 1996 .

[18]  Horst Bunke,et al.  A graph distance metric based on the maximal common subgraph , 1998, Pattern Recognit. Lett..

[19]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[20]  Noureddine Zahid,et al.  A new cluster-validity for fuzzy clustering , 1999, Pattern Recognit..

[21]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics , 1998 .

[22]  George J. Klir,et al.  Fuzzy sets and fuzzy logic - theory and applications , 1995 .

[23]  Raymond J. Mooney,et al.  A Mutually Beneficial Integration of Data Mining and Information Extraction , 2000, AAAI/IAAI.

[24]  Abraham Kandel,et al.  Design and implementation of a web mining system for organizing search engine results , 2005, Int. J. Intell. Syst..

[25]  Edwin R. Hancock,et al.  Graph Matching with Hierarchical Discrete Relaxation , 1997, NIPS.

[26]  Huicheng Zheng,et al.  Fingerprint recognition system by use of graph matching , 2001, International Symposium on Multispectral Image Processing and Pattern Recognition.

[27]  Hannu Vanharanta,et al.  Visualizing Sequences of Texts Using Collocational Networks , 2003, MLDM.

[28]  Svetha Venkatesh,et al.  Graph Matching: Fast Candidate Elimination Using Machine Learning Techniques , 2000, SSPR/SPR.

[29]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[30]  Horst Bunke,et al.  Self-organizing map for clustering in the graph domain , 2002, Pattern Recognit. Lett..

[31]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[32]  M. A. Merzbacher Discovering Semantic Proximity for Web Pages , 1999, ISMIS.

[33]  Gordon Wilfong,et al.  Applications of Graph Probing to Web Document Analysis , 2003, Web Document Analysis.

[34]  Alberto Sanfeliu,et al.  Clustering of attributed graphs and unsupervised synthesis of function-described graphs , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[35]  O. Owolabi An efficient graph approach to matching chemical structures , 1988, J. Chem. Inf. Comput. Sci..

[36]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[37]  J. J. McGregor,et al.  Backtrack search algorithms and the maximal common subgraph problem , 1982, Softw. Pract. Exp..

[38]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[39]  Abraham Kandel,et al.  Clustering of Web Documents using a Graph Model , 2003, Web Document Analysis.

[40]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[41]  Nasser Kehtarnavaz,et al.  Determining number of clusters and prototype locations via multi-scale clustering , 1998, Pattern Recognit. Lett..

[42]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[43]  Norbert Krüger,et al.  Face recognition by elastic bunch graph matching , 1997, Proceedings of International Conference on Image Processing.

[44]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[45]  Edwin R. Hancock,et al.  Inexact graph matching using genetic search , 1997, Pattern Recognit..

[46]  A. Kandel,et al.  A term-based algorithm for hierarchical clustering of Web documents , 2001, Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569).

[47]  P. Boeck,et al.  Hierarchical classes: Model and data analysis , 1988 .

[48]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[49]  G. Levi A note on the derivation of maximal common subgraphs of two directed or undirected graphs , 1973 .

[50]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[51]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[52]  Robert Burgin The retrieval effectiveness of five clustering algorithms as a function of indexing exhaustivity , 1995 .

[53]  Horst Bunke,et al.  A New Algorithm for Error-Tolerant Subgraph Isomorphism Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[54]  Hinrich Schütze,et al.  Projections for efficient document clustering , 1997, SIGIR '97.

[55]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[56]  Hichem Frigui,et al.  A robust algorithm for automatic extraction of an unknown number of clusters from noisy data , 1996, Pattern Recognit. Lett..

[57]  Xin Lu Document retrieval: A structural approach , 1990, Inf. Process. Manag..

[58]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[59]  Gabriel Valiente,et al.  A graph distance metric combining maximum common subgraph and minimum common supergraph , 2001, Pattern Recognit. Lett..

[60]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[61]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[62]  Edwin R. Hancock,et al.  Structural Matching by Discrete Relaxation , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[63]  Ludovic Denoyer,et al.  A Belief Networks-Based Generative Model for Structured Documents. An Application to the XML Categorization , 2003, MLDM.

[64]  Charles T. Meadow,et al.  Text information retrieval systems , 1992 .

[65]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[66]  Ning Zhong,et al.  In Search of the Wisdom Web , 2002, Computer.

[67]  King-Sun Fu,et al.  A graph distance measure for image analysis , 1984, IEEE Transactions on Systems, Man, and Cybernetics.

[68]  Edwin R. Hancock,et al.  Multiple graph matching with Bayesian inference , 1997, Pattern Recognit. Lett..

[69]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[70]  Horst Bunke,et al.  Error Correcting Graph Matching: On the Influence of the Underlying Cost Function , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[71]  Abraham Kandel,et al.  Classification Of Web Documents Using Graph Matching , 2004, Int. J. Pattern Recognit. Artif. Intell..

[72]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[73]  Maristella Agosti,et al.  Information Retrieval and Hypertext , 1996, Information Retrieval and Hypertext.

[74]  Abraham Kandel,et al.  Mean and maximum common subgraph of two graphs , 2000, Pattern Recognit. Lett..

[75]  Boudewijn P. F. Lelieveldt,et al.  A new cluster validity index for the fuzzy c-mean , 1998, Pattern Recognit. Lett..

[76]  Noureddine Zahid,et al.  Unsupervised fuzzy clustering , 1999, Pattern Recognit. Lett..

[77]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[78]  Donald H. Kraft,et al.  An Integrated Approach to Information Retrieval with Fuzzy Clustering and Fuzzy Inferencing , 2000 .

[79]  Jack Minker,et al.  An Analysis of Some Graph Theoretical Cluster Techniques , 1970, JACM.

[80]  Horst Bunke,et al.  Recent developments in graph matching , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[81]  C. Siva Ram Murthy,et al.  Optimal task allocation in distributed systems by graph matching and state space search , 1999, J. Syst. Softw..

[82]  Soumen Chakrabarti,et al.  Distributed Hypertext Resource Discovery Through Examples , 1999, VLDB.

[83]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[84]  Steven Gold,et al.  A Graduated Assignment Algorithm for Graph Matching , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[85]  Josiane Mothe,et al.  TetraFusion: information discovery on the Internet , 1999, IEEE Intell. Syst..

[86]  A. Hardy On the number of clusters , 1996 .

[87]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[88]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[89]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[90]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[91]  Horst Bunke,et al.  Towards Bridging the Gap between Statistical and Structural Pattern Recognition: Two New Concepts in Graph Matching , 2001, ICAPR.

[92]  L. Hubert,et al.  Quadratic assignment as a general data analysis strategy. , 1976 .

[93]  Yuan-Fang Wang,et al.  The use of bigrams to enhance text categorization , 2002, Inf. Process. Manag..

[94]  Kostas Haris,et al.  Model-based morphological segmentation and labeling of coronary angiograms , 1999, IEEE Transactions on Medical Imaging.

[95]  David E. Johnson,et al.  Maximizing Text-Mining Performance , 1999 .

[96]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[97]  Andrea Torsello,et al.  Clustering shock trees , 2001 .

[98]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[99]  Abraham Kandel,et al.  Classification of Web documents using a graph model , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[100]  Chia-Hui Chang,et al.  Customizable Multi-Engine Search Tool with Clustering , 1997, Comput. Networks.

[101]  MAGDALINI EIRINAKI,et al.  Web mining for web personalization , 2003, TOIT.

[102]  Gustaf Neumann,et al.  MSEEC – A Multi Search Engine with Multiple Clustering , 2000 .

[103]  Sankar K. Pal,et al.  Web mining in soft computing framework: relevance, state of the art and future directions , 2002, IEEE Trans. Neural Networks.

[104]  George Luger,et al.  Artificial Intelligence: Structures and Strategies for Complex Problem Solving (5th Edition) , 2004 .

[105]  Edwin R. Hancock,et al.  Bayesian Graph Edit Distance , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[106]  R. Shanmugam Multivariate Analysis: Part 2: Classification, Covariance Structures and Repeated Measurements , 1998 .

[107]  Massimo Marchiori,et al.  The Quest for Correct Information on the Web: Hyper Search Engines , 1997, Comput. Networks.

[108]  Benoit Huet,et al.  Shape recognition from large image libraries by inexact graph matching , 1999, Pattern Recognit. Lett..

[109]  Miro Kraetzl,et al.  Graph distances using graph union , 2001, Pattern Recognit. Lett..

[110]  Carolyn J. Crouch,et al.  The use of cluster hierarchies in hypertext information retrieval , 1989, Hypertext.

[111]  Mu-Chun Su,et al.  A novel algorithm for data clustering , 2001, Pattern Recognit..

[112]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[113]  David S. Doermann,et al.  Logical Labeling of Document Images Using Layout Graph Matching with Adaptive Learning , 2002, Document Analysis Systems.

[114]  M. Klemettinen,et al.  Applying Data Mining Techniques in Text Analysis , 1997 .

[115]  Mario Vento,et al.  Thirty Years Of Graph Matching In Pattern Recognition , 2004, Int. J. Pattern Recognit. Artif. Intell..

[116]  Chris H. Q. Ding,et al.  Automatic topic identification using webpage clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[117]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[118]  Nikhil R. Pal,et al.  Cluster validation using graph theoretic concepts , 1997, Pattern Recognit..