Web structure mining: an introduction

Due to the increasing amount of data available online, the World Wide Web has becoming one of the most valuable resources for information retrievals and knowledge discoveries. Web mining technologies are the right solutions for knowledge discovery on the Web. The knowledge extracted from the Web can be used to raise the performances for Web information retrievals, question answering, and Web based data warehousing. In this paper, we provide an introduction of Web mining as well as a review of the Web mining categories. Then we focus on one of these categories: the Web structure mining. Within this category, we introduce link mining and review two popular methods applied in Web structure mining: HITS and PageRank.

[1]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[2]  Donald Perlis,et al.  Information Retrieval on the World Wide Web and Active Logic: A Survey and Problem Definition , 2002 .

[3]  Wei-Ying Ma,et al.  Block-level link analysis , 2004, SIGIR '04.

[4]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[5]  Wei-Ying Ma,et al.  Block-based web search , 2004, SIGIR '04.

[6]  Jörg Rech,et al.  Knowledge Discovery in Databases , 2001, Künstliche Intell..

[7]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[8]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[9]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[10]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[11]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[12]  Huang Yuan,et al.  Web mining: knowledge discovery on the Web , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[13]  Wei-Ying Ma,et al.  Ranking user's relevance to a topic through link analysis on web logs , 2002, WIDM '02.

[14]  Thorsten Brants,et al.  Topic-based document segmentation with probabilistic latent semantic analysis , 2002, CIKM '02.

[15]  Oren Etzioni,et al.  The World-Wide Web: quagmire or gold mine? , 1996, CACM.

[16]  Franco Scarselli,et al.  Inside PageRank , 2005, TOIT.

[17]  Lise Getoor,et al.  Link mining: a new data mining challenge , 2003, SKDD.

[18]  Man Ieee Systems IEEE SMC'99 conference proceedings : 1999 IEEE International Conference on Systems, Man, and Cybernetics : October 12-15, 1999, Tokyo, Japan : conference theme : Human communication and cybernetics , 1999 .

[19]  Xiang Ji,et al.  Domain-independent text segmentation using anisotropic diffusion and dynamic programming , 2003, SIGIR.

[20]  Jaideep Srivastava,et al.  Web mining: information and pattern discovery on the World Wide Web , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[21]  William Frawley,et al.  Knowledge Discovery in Databases , 1991 .

[22]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[23]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[24]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[25]  Prabhakar Raghavan,et al.  Mining the Link Structure of the World Wide Web , 1998 .

[26]  Manabu Okumura,et al.  Text Segmentation with Multiple Surface Linguistic Cues , 1999, COLING.

[27]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.