Strukturelle Analyse web-basierter Dokumente

Im Zuge der web-basierten Kommunikation und in Anbetracht der gigantischen Datenmengen, die im World Wide Web verfugbar sind, erlangt das so genannte Web Mining eine immer starkere Bedeutung. Ziel des Web Mining ist die Informationsgewinnung und Analyse web-basierter Daten auf der Grundlage von Data Mining-Methoden. Die eigentliche Problemstellung des Data Mining ist die Entdeckung von Mustern und Strukturen in grosen Datenbestanden. Web Mining ist also eine Variante des Data Mining; es kann grob in drei Bereiche unterteilt werden: Web Structure Mining, Web Content Mining und Web Usage Mining. Die zentrale Problemstellung des Web Structure Mining, die in dieser Arbeit besonders im Vordergrund steht, ist die Erforschung und Untersuchung struktureller Eigenschaften web-basierter Dokumente. Das Web wird in dieser Arbeit wie ublich als Hypertext aufgefasst. In der Anfangsphase der Hypertextforschung wurden graphbasierte Indizes zur Messung struktureller Auspragungen und Strukturvergleichen von Hypertexten verwendet. Diese sind jedoch im Hinblick auf die ahnlichkeitsbasierte Gruppierung graphbasierter Hypertextstrukturen unzureichend. Daher konzentriert sich die vorliegende Arbeit auf die Entwicklung neuer graphentheoretischer und ahnlichkeitsbasierter Analysemethoden. Ahnlichkeitsbasierte Analysemethoden, die auf graphentheoretischen Modellen beruhen, konnen nur dann sinnvoll im Hypertextumfeld eingesetzt werden, wenn sie aussagekraftige und effiziente strukturelle Vergleiche graphbasierter Hypertexte ermoglichen. Aus diesem Grund wird in dieser Arbeit ein parametrisches Graphahnlichkeitsmodell entwickelt, welches viele Anwendungen im Web Structure Mining besitzt. Dabei stellt die Konstruktion eines Verfahrens zur Bestimmung der strukturellen Ahnlichkeit von Graphen eine zentrale Herausforderung dar. Klassische Verfahren zur Bestimmung der Graphahnlichkeit beruhen in den meisten Fallen auf Isomorphie- und Untergraphisomorphiebeziehungen. Dagegen wird in dieser Arbeit ein Verfahren zur Bestimmung der strukturellen Ahnlichkeit hierarchisierter und gerichteter Graphen entwickelt, welches nicht auf Isomorphiebeziehungen aufbaut. Oft wird im Rahmen von Analysen web-basierter Dokumentstrukturen das bekannte Vektorraummodell zu Grunde gelegt. Auf der Basis eines graphbasierten Reprasentationsmodells wird dagegen in dieser Arbeit die These vertreten und belegt, dass die graphbasierte Reprasentation einen sinnvollen Ausgangspunkt fur die Modellierung web-basierter Dokumente darstellt. In einem experimentellen Teil werden die entwickelten Graphahnlichkeitsmase erfolgreich evaluiert und die aus der Evaluierung resultierenden Anwendungen vorgestellt.

[1]  Robert Giegerich,et al.  Local similarity in RNA secondary structures , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[2]  Gordon F. Royle,et al.  Algebraic Graph Theory , 2001, Graduate texts in mathematics.

[3]  Alex Bavelas,et al.  Communication Patterns in Task‐Oriented Groups , 1950 .

[4]  Giorgio Gallo,et al.  Directed Hypergraphs and Applications , 1993, Discret. Appl. Math..

[5]  John E. McEneaney Visualizing and assessing navigation in hypertext , 1999, Hypertext.

[6]  D. Cvetkovic,et al.  Spectra of Graphs: Theory and Applications , 1997 .

[7]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[8]  Philip H. Winne,et al.  Exploring Individual Differences in Studying Strategies Using Graph Theoretic Statistics. , 1994 .

[9]  Alexander Mehler,et al.  Towards Structure-sensitive Hypertext Categorization , 2005, GfKl.

[10]  Mia Hubert,et al.  Clustering in an object-oriented environment , 1997 .

[11]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[12]  Lusheng Wang,et al.  Alignment of trees: an alternative to tree edit , 1995 .

[13]  W. Sierpinski,et al.  Sur le probléme des courbes gauches en Topologie , 2022 .

[14]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[15]  Arthur M. Lesk Bioinformatik - eine Einführung , 2002 .

[16]  T. Richter,et al.  LOGPAT: A semi-automatic way to analyze hypertext navigation behavior , 2003 .

[17]  Gregory Gutin,et al.  Digraphs - theory, algorithms and applications , 2002 .

[18]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[19]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[20]  Yair Weiss,et al.  Segmentation using eigenvectors: a unifying view , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[21]  J. J. Sylvester,et al.  On an Application of the New Atomic Theory to the Graphical Representation of the Invariants and Covariants of Binary Quantics, With Three Appendices, [Continued] , 1878 .

[22]  J.-M. Jolion Graph Matching: What Are We Really Talking About? , 2001 .

[23]  Thomas Gärtner,et al.  On Graph Kernels: Hardness Results and Efficient Alternatives , 2003, COLT.

[24]  Gilad Even-Tzur,et al.  Graph Theory Applications to GPS Networks , 2001, GPS Solutions.

[25]  Daniel P. Fasulo,et al.  An Analysis of Recent Work on Clustering Algorithms , 1999 .

[26]  Shane S. Sturrock,et al.  Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95). , 2000 .

[27]  Lutz Volkmann,et al.  Graphen und Digraphen : eine Einführung in die Graphentheorie , 1991 .

[28]  Eli Upfal,et al.  Stochastic models for the Web graph , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[29]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Jeff Conklin,et al.  Hypertext: An Introduction and Survey , 1987, Computer.

[31]  Soumen Chakrabarti,et al.  Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction , 2001, WWW '01.

[32]  Michael J. A. Berry,et al.  Data mining techniques - for marketing, sales, and customer support , 1997, Wiley computer publishing.

[33]  Alexander Mehler,et al.  Aspekte der Kategorisierung von Webseiten , 2004, GI Jahrestagung.

[34]  Norman,et al.  Structural Models: An Introduction to the Theory of Directed Graphs. , 1966 .

[35]  Stathis Zachos,et al.  Does co-NP Have Short Interactive Proofs? , 1987, Inf. Process. Lett..

[36]  Katharina Morik,et al.  Maschinelles Lernen und Data Mining , 2013, Handbuch der Künstlichen Intelligenz.

[37]  J. M. de St Georges All for one and one for all. , 1993, Dental teamwork.

[38]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[39]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[40]  H. Bunke Graph Matching : Theoretical Foundations , Algorithms , and Applications , 2022 .

[41]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[42]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[43]  Tim Oren,et al.  The architecture of static hypertexts , 1987, Hypertext.

[44]  Richard Ernest Bellman,et al.  Dynamische Programmierung und selbstanpassende Regelprozesse , 1967 .

[45]  Max Mühlhäuser,et al.  eLearning after four decades : what about sustainability?. , 2004 .

[46]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[47]  Horst Bunke Attributed Programmed Graph Grammars and Their Application to Schematic Diagram Interpretation , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Alexander Fronk Algebraische Semantik einer objektorientierten Sprache zur Spezifikation von Hyperdokumenten , 2002, Softwaretechnik-Trends.

[49]  Helmuth Spaeth,et al.  Cluster-Analyse-Algorithmen zur Objektklassifizierung und Datenreduktion , 1975 .

[50]  Frank Harary,et al.  Status and Contrastatus , 1959 .

[51]  Frieder Kaden,et al.  Graphmetriken und Isometrieprobleme zugehöriger Distanzgraphen , 1986 .

[52]  Frank Wm. Tompa A data model for flexible hypertext database systems , 1989, TOIS.

[53]  Bohdan Zelinka,et al.  On a certain distance between isomorphism classes of graphs , 1975 .

[54]  D. Koenig Theorie Der Endlichen Und Unendlichen Graphen , 1965 .

[55]  Angelika Storrer Kohärenz in Text und Hypertext , 1999 .

[56]  Alex Smola,et al.  Lernen mit Kernen Support-Vektor-Methoden zur Analyse hochdimensionaler Daten , 1999 .

[57]  Christos Faloutsos,et al.  ANF: a fast and scalable tool for data mining in massive graphs , 2002, KDD.

[58]  Frank Harary,et al.  Graphical enumeration , 1973 .

[59]  Isabel F. Cruz,et al.  Measuring Structural Similarity Among Web Documents: Preliminary Results , 1998, EP.

[60]  Theodore H. Nelson,et al.  Computer Lib/Dream Machines , 1974 .

[61]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[62]  Takashi Washio,et al.  Complete Mining of Frequent Patterns from Graphs: Mining Graph Data , 2003, Machine Learning.

[63]  Davida Charney,et al.  Comprehending non-linear text: the role of discourse cues and reading strategies , 1987, Hypertext.

[64]  D. Cvetkovic,et al.  Spectra of graphs : theory and application , 1995 .

[65]  Kaizhong Zhang,et al.  Comparing multiple RNA secondary structures using tree comparisons , 1990, Comput. Appl. Biosci..

[66]  Lada A. Adamic,et al.  Internet: Growth dynamics of the World-Wide Web , 1999, Nature.

[67]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[68]  Michael Berthold,et al.  Intelligent Data Analysis , 1999, Springer Berlin Heidelberg.

[69]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[70]  Ray R. Larson Hypertext hands‐on!: An introduction to a new way of organizing and accessing information , 1990 .

[71]  Frank G. Halasz,et al.  Reflections on NoteCards: seven issues for the next generation of hypermedia systems , 1987, CACM.

[72]  Matthias Dehmer,et al.  Data Mining-Konzepte und graphentheoretische Methoden zur Analyse hypertextueller Daten , 2005, LDV Forum.

[73]  Hans Rudolf Christen,et al.  Grundlagen der Allgemeinen und anorganischen Chemie , 1969 .

[74]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[75]  Hartmut Ehrig,et al.  Introduction to the Algebraic Theory of Graph Grammars (A Survey) , 1978, Graph-Grammars and Their Application to Computer Science and Biology.

[76]  Takashi Washio,et al.  State of the art of graph-based data mining , 2003, SKDD.

[77]  Vikraman Arvind,et al.  Graph Isomorphism is in SPP , 2006, Inf. Comput..

[78]  Uwe Schöning,et al.  Theoretische Informatik kurz gefasst , 1992 .

[79]  Alexander Fronk,et al.  Towards The Algebraic Analysis Of Hyperlink Structures , 2003, Int. J. Softw. Eng. Knowl. Eng..

[80]  Gunther Schmidt,et al.  Relationen und Graphen , 1989, Mathematik für Informatiker.

[81]  P. ERDtiS,et al.  Some Applications of Probability to Graph Theory and Combinatorial Problems , 2002 .

[82]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[83]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[84]  Elio Masciari,et al.  Detecting Structural Similarities between XML Documents , 2002, WebDB.

[85]  Vijay V. Raghavan,et al.  A critical investigation of recall and precision as measures of retrieval system performance , 1989, TOIS.

[86]  Lesley Jones,et al.  Microarray Gene Expression Data Analysis: A Beginners Guide , 2004, Human Genetics.

[87]  Subhash C. Basak,et al.  QSPR Modeling: Graph Connectivity Indices versus Line Graph Connectivity Indices , 2000, J. Chem. Inf. Comput. Sci..

[88]  D. Unz Lernen mit Hypertext , 2000 .

[89]  Jaideep Srivastava,et al.  Web mining: information and pattern discovery on the World Wide Web , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[90]  Jerome David Sable,et al.  LANGUAGE AND INFORMATION STRUCTURE IN INFORMATION SYSTEMS , 1963 .

[91]  D. Watts,et al.  Small Worlds: The Dynamics of Networks between Order and Randomness , 2001 .

[92]  S. Hakimi On the degrees of the vertices of a directed graph , 1965 .

[93]  Introduction to graph grammars with applications to semantic networks , 1992 .

[94]  Matthias Dehmer,et al.  Classification of Large Graphs by a Local Tree Decomposition , 2005, DMIN.

[95]  Reinhard Wilhelm,et al.  Grundlagen der Dokumentenverarbeitung , 1996 .

[96]  Kaizhong Zhang,et al.  On the Editing Distance Between Unordered Labeled Trees , 1992, Inf. Process. Lett..

[97]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[98]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[99]  P. David Stotts,et al.  Petri-net-based hypertext: document structure with browsing semantics , 1989, TOIS.

[100]  Stefan Kramer,et al.  Frequent free tree discovery in graph data , 2004, SAC '04.

[101]  H. Van Dyke Parunak,et al.  Don't link me in: set based hypermedia for taxonomic reasoning , 1991, HYPERTEXT '91.

[102]  Klaus Langer,et al.  Clusteranalyse : Einführung in Methoden und Verfahren der automatischen Klassifikation : mit zahlreichen Algorithmen, FORTRAN-Programmen, Anwendungsbeispielen und einer Kurzdarstellung der multivariaten statistischen Verfahren , 1977 .

[103]  D. C. Englebart,et al.  Augmenting human intellect: a conceptual framework , 1962 .

[104]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[105]  Uwe Schning GRAPH ISOMORPHISM IS IN THE LOW HIERARCHY , 2022 .

[106]  P. Erdös Graph Theory and Probability. II , 1961, Canadian Journal of Mathematics.

[107]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[108]  Paul D. Seymour,et al.  Graph Minors. II. Algorithmic Aspects of Tree-Width , 1986, J. Algorithms.

[109]  Horst Bunke,et al.  On a relation between graph edit distance and maximum common subgraph , 1997, Pattern Recognit. Lett..

[110]  G. De Soete,et al.  Clustering and Classification , 2019, Data-Driven Science and Engineering.

[111]  Martin Hofmann Benutzerunterstützung in Hypertextsystemen durch private Kontexte , 1991 .

[112]  Thomas Gärtner,et al.  Cyclic pattern kernels for predictive graph mining , 2004, KDD.

[113]  Max Mühlhäuser Hypermedia-Konzepte zur Verarbeitung multimedialer Information , 1991, Inform. Spektrum.

[114]  Peter J. Cameron,et al.  Spectral graph theory , 2004 .

[115]  Vladimir Batagelj,et al.  Similarity measures between structured objects , 1989 .

[116]  Rolf Schulmeister,et al.  Grundlagen hypermedialer Lernsysteme - Theorie, Didaktik, Design , 1996 .

[117]  H. Wiener Structural determination of paraffin boiling points. , 1947, Journal of the American Chemical Society.

[118]  Ben Shneiderman,et al.  Structural analysis of hypertexts: identifying hierarchies and useful metrics , 1992, TOIS.

[119]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[120]  Volker Turau Algorithmische Graphentheorie (2. Aufl.) , 2004 .

[121]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[122]  Rick Kazman,et al.  WebQuery: Searching and Visualizing the Web Through Connectivity , 1997, Comput. Networks.

[123]  Klaus Tochtermann,et al.  The Dortmund Family of Hypermedia Models - Concepts and their Application , 1996, J. Univers. Comput. Sci..

[124]  Mark Fischetti,et al.  Weaving the web - the original design and ultimate destiny of the World Wide Web by its inventor , 1999 .

[125]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[126]  Manfred Nagl Graph-Grammatiken: Theorie, Anwendungen, Implementierung , 1979 .

[127]  Tamás Horváth,et al.  Cyclic Pattern Kernels Revisited , 2005, PAKDD.

[128]  L. Foulds,et al.  Graph Theory Applications , 1991 .

[129]  Robert M. Haralick,et al.  Organization of Relational Models for Scene Analysis , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[130]  Michael Luck,et al.  Formal framework for hypertext systems , 1997, IEE Proc. Softw. Eng..

[131]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[132]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[133]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[134]  Robert A. M. Gregson,et al.  Psychometrics of similarity , 1975 .

[135]  T. Joachims Support Vector Machines , 2002 .

[136]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[137]  Kevin S. McCurley,et al.  Untangling compound documents on the web , 2003, HYPERTEXT '03.

[138]  Peter Willett,et al.  The Representation and Comparison of Hypertext Structures using Graphs in Information Retrieval and , 1996 .

[139]  Koichi Takeda,et al.  Information retrieval on the web , 2000, CSUR.

[140]  S. B. Needleman,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 1989 .

[141]  Johannes Fürnkranz,et al.  Hyperlink ensembles: a case study in hypertext classification , 2002, Inf. Fusion.

[142]  Otto Haupt,et al.  Differential- und Integralrechnung : unter besonderer Berücksichtigung neuerer Ergebnisse , 1938 .

[143]  Hans Hermann Bock,et al.  Automatische Klassifikation. Theoretische und praktische Methoden zur Gruppierung und Strukturierung von Daten , 1975 .

[144]  R. Kuhlen Hypertext : ein nicht-lineares Medium zwischen Buch und Wissensbank , 1991 .

[145]  Ben Shneiderman,et al.  Identifying aggregates in hypertext structures , 1991, HYPERTEXT '91.

[146]  Chris Coulston,et al.  A hypertext metric based on huffman coding , 2001, HYPERTEXT '01.

[147]  Eiichi Tanaka,et al.  The Tree-to-Tree Editing Problem , 1988, Int. J. Pattern Recognit. Artif. Intell..

[148]  Alexander Mehler,et al.  Towards Logical Hypertext Structure A Graph-Theoretic Perspective , 2006 .

[149]  Claude Berge,et al.  Graphs and Hypergraphs , 2021, Clustering.

[150]  Myra Spiliopoulou,et al.  Web usage mining for Web site evaluation , 2000, CACM.

[151]  Horst Bunke,et al.  Recent developments in graph matching , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[152]  Ellen Spertus,et al.  ParaSite: Mining Structural Information on the Web , 1997, Comput. Networks.

[153]  Pankaj Gupta,et al.  World Wide Web: A Graph-Theoretic Perspective , 2001 .

[154]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[155]  Seongbin Park Structural properties of hypertext , 1998, HYPERTEXT '98.

[156]  Sachindra Joshi,et al.  A bag of paths model for measuring structural similarity in Web documents , 2003, KDD '03.

[157]  King-Sun Fu,et al.  A distance measure between attributed relational graphs for pattern recognition , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[158]  David Buttler,et al.  A Short Survey of Document Structure Similarity Algorithms , 2004, International Conference on Internet Computing.

[159]  Alexander Mehler,et al.  Hierarchical Orderings of Textual Units , 2002, COLING.

[160]  B. John Oommen,et al.  Numerical Similarity and Dissimilarity Measures Between Two Trees , 1996, IEEE Trans. Computers.

[161]  John E. McEneaney Navigational correlates of comprehension in hypertext , 2000, HYPERTEXT '00.

[162]  A. Tversky Features of Similarity , 1977 .

[163]  Paul De Bra Using Hypertext Metrics to Measure Research Output Levels , 2000 .

[164]  Ben Shneiderman,et al.  Navigating in hyperspace: designing a structure-based toolbox , 1994, CACM.

[165]  Uwe Schöning Algorithmen - kurz gefasst , 1997, Hochschultaschenbuch.

[166]  Ladislav A. Novak,et al.  Hybrid Graph Theory and Network Analysis , 1999 .

[167]  Prabhakar Raghavan,et al.  Graph Structure of the Web: A Survey , 2000, LATIN.

[168]  Jakob Nielsen,et al.  Multimedia, Hypertext und Internet , 1996 .

[169]  Lusheng Wang,et al.  Parametric alignment of ordered trees , 2003, Bioinform..

[170]  Antonio Robles-Kelly,et al.  Edit distance from graph spectra , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[171]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.