A Graph based Methodology for Web Structure Mining - with a Case Study on the Webs of UK Universities

Web structure mining is to extract knowledge from the hyperlink structure data of world wide webs for improving web design for clear content presentation and easy navigation. This paper presents a graph-based methodology for web structure mining. The structure of a website is firstly mapped onto a graph with its nodes representing web pages and links representing hyperlinks between pages and other websites. Then the characteristics of the web graph, such as, the degree of each node, density, connectivity, the closeness centralisation, and the node clusters, can be analysed quantitatively. The methodology is tested on the web structural data collected from 110 UK's university websites. After cleansing and pre-processing the data, the graphs were constructed and analysed to obtain the aforementioned properties for each web and other useful information, such as page size and the length of the optimal path as they both affect the navigability. Based on the evaluation of the properties, some guidelines and criteria are devised for quantifying the structural quality of the webs into five categories from very poor to very good. The average degree and the percentage of strongly connected component (SCC) pages together with the average distance were found to be the most important properties in determining the structural quality of a web.

[1]  Ingemar J. Cox,et al.  The web structure of e-government - developing a methodology for quantitative evaluation , 2006, WWW '06.

[2]  Sergey N. Dorogovtsev,et al.  Critical phenomena in complex networks , 2007, ArXiv.

[3]  Anthony Scime,et al.  Web Mining: Applications and Techniques , 2004 .

[4]  Mike Thelwall The top 100 linked-to pages on UK university web sites: high inlink counts are not usually associated with quality scholarly content , 2002, J. Inf. Sci..

[5]  Alireza Noruzi,et al.  The web impact factor: a critical review , 2006, Electron. Libr..

[6]  Gert Sabidussi,et al.  The centrality index of a graph , 1966 .

[7]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[8]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[9]  Einat Amitay Link Analysis: An Information Science Approach , 2005 .

[10]  K. Holmberg Webometric network analysis : mapping cooperation and geopolitical connections between local government administration on the web , 2009 .

[11]  Mike Thelwall,et al.  An initial exploration of the link relationship between UK university Web sites , 2002, Aslib Proc..

[12]  Mike Thelwall,et al.  A Statistical Analysis of UK Academic Web Links , 2004 .

[13]  Victor J. Rayward-Smith,et al.  Building the KDD Roadmap: A Methodology for Knowledge Discovery , 2001 .

[14]  Peter Ingwersen,et al.  The calculation of web impact factors , 1998, J. Documentation.

[15]  Bethany S. Dohleman Exploratory social network analysis with Pajek , 2006 .

[16]  Mike Thelwall,et al.  Conceptualizing documentation on the Web: An evaluation of different heuristic-based models for counting links between university Web sites , 2002, J. Assoc. Inf. Sci. Technol..