WIM: an information mining model for the Web

This paper presents a model to mine information in applications involving Web and graph analysis, referred to as WIM - Web information mining - model. We demonstrate the model characteristics using a Web warehouse. The Web data in the warehouse is modeled as a graph, where nodes represent Web pages and edges represent hyperlinks. In the model, objects are always sets of nodes and belong to one class. We have physical objects containing attributes directly obtained from Web pages and links, as the title of a Web page or the start and end pages of a link. Logical objects can be created by performing predefined operations on any existing object. In this paper we present the model components, propose a set of eleven operators and give examples of views. A view is a sequence of operations on objects, and it represents a way to mine information in the graph. As practical examples, we present views for clustering nodes and for identifying related item sets.

[1]  Ricardo A. Baeza-Yates,et al.  New approaches to information management: attribute-centric data systems , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[2]  Nivio Ziviani,et al.  Retrieving Similar Documents from the Web , 2003, J. Web Eng..

[3]  Ricardo A. Baeza-Yates,et al.  Applications of an Web information mining model to data mining and information retrieval tasks , 2005, 16th International Workshop on Database and Expert Systems Applications (DEXA'05).

[4]  Philippe Pucheral,et al.  Database Graph Views: A Practical Model to Manage Persistent Graphs , 1994, VLDB.

[5]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases, and webs , 1999 .

[6]  G. H. Gonnet,et al.  Handbook of algorithms and data structures: in Pascal and C (2nd ed.) , 1991 .

[7]  Jaideep Srivastava,et al.  Discovery of Interesting Usage Patterns from Web Data , 1999, WEBKDD.

[8]  Sourav S. Bhowmick,et al.  Web Warehousing: Design and Issues , 1998, ER Workshops.

[9]  Alberto O. Mendelzon,et al.  Querying the World Wide Web , 1997, International Journal on Digital Libraries.

[10]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[11]  Dan Suciu,et al.  A query language for a Web-site management system , 1997, SGMD.

[12]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[13]  Sourav S. Bhowmick,et al.  Web Bags - Are They Useful in A Web Warehouse? , 1998, FODO.

[14]  Ricardo A. Baeza-Yates,et al.  Information retrieval in the Web: beyond current search engines , 2003, Int. J. Approx. Reason..

[15]  Serge Abiteboul,et al.  Incremental Maintenance for Materialized Views over Semistructured Data , 1998, VLDB.

[16]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[17]  Sourav S. Bhowmick,et al.  Data Visualization in a Web Warehouse , 1998, ER Workshops.

[18]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[19]  Jaideep Srivastava,et al.  Web usage mining: discovery and application of interesting patterns from web data , 2000 .

[20]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[21]  Sourav S. Bhowmick,et al.  Web warehousing: an algebra for web information , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[22]  Sourav S. Bhowmick,et al.  Constraint-Free Join Processing on Hyperlinked Web Data , 2002, DaWaK.

[23]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[24]  Elke A. Rundensteiner Tools for view generation in object-oriented databases , 1993, CIKM '93.

[25]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[26]  Ricardo A. Baeza-Yates,et al.  WIM: an information mining model for the Web , 2005, 16th International Workshop on Database and Expert Systems Applications (DEXA'05).

[27]  Marco Gaertler,et al.  Clustering with Spectral Methods , 2002 .

[28]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[29]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[30]  Serge Abiteboul,et al.  The Xyleme project , 2002, Comput. Networks.

[31]  Sriram Raghavan,et al.  Complex Queries over Web Repositories , 2003, VLDB.