Surveying the World Wide Web

The World Wide Web (the Web) is the main driving force behind the rapid diffusion of Internet technology. As a result, we are beginning to live a significant part of our lives in Cyberspace. Measuring and monitoring our surroundings is an essential human activity that helps us both to understand and shape the world we live in. Substantial efforts have in the past years been invested into further understanding the Internet in general and the Web in particular through, for example, surveys of user attitude and behaviour, maps of Internet traffic, and indexing of content. Very little research has, however, investigated how to measure and monitor the contents of Web sites based on a combination of linguistics and data visualisation measures. Many efforts have demonstrated the use of techniques from within a particular discipline such as information retrieval, data mining, or autonomous agents. This paper, however, explores issues related to the monitoring of contents and changes to the Web based on a range of measures. The paper aims to demonstrate the principles behind the application of semiautomatic measurement instruments to forward our understanding of the Web as a body of textual traces of human activity. The paper suggests five basic types of measures for studying the Web: volume, density, vocabulary, structure, and relative measures. A survey of 82 Swedish Web sites was conducted using semi-autonomous Web robots for information retrieval and filtering based on techniques from linguistics and information visualisation. Examples demonstrate how such data can be applied to summarise site contents, identify site topic, map site structure, and compare Web sites. The results are discussed and related to emergent issues, such as Web navigation, electronic commerce and the management of knowledge.

[1]  Tim Bray,et al.  Measuring the Web , 1996, World Wide Web J..

[2]  Paul P. Maglio,et al.  How to Build Modeling Agents to Support Web Searchers , 1997 .

[3]  Loren G. Terveen,et al.  The dynamics of mass interaction , 1998, CSCW '98.

[4]  Michael Batty,et al.  The electronic frontier: Exploring and mapping cyberspace , 1994 .

[5]  L. Rein XML-Enabled Tools [New Products] , 1998, IEEE Internet Computing.

[6]  Oren Etzioni,et al.  The World-Wide Web: quagmire or gold mine? , 1996, CACM.

[7]  Marie Tesitelová I. Quantitative Linguistics , 1992 .

[8]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[9]  Michael Batty The Geography of Cyberspace , 1993 .

[10]  Rüdiger Zarnekow,et al.  A methodology for classifying intelligent software agents , 1998, ECIS.

[11]  Tim Berners-Lee,et al.  The World-Wide Web , 1994, CACM.

[12]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[13]  Munindar P. Singh,et al.  Agents on the Web: Mobile Agents , 1997, IEEE Internet Comput..

[14]  David Eichmann,et al.  2 – Background : Agents in General and Spiders in Particular , 1994 .

[15]  L HoffmanDonna,et al.  Internet and Web use in the U.S. , 1996 .

[16]  Lars Bo Eriksen,et al.  Digital newspapers explore marketing on the Internet , 1999, CACM.

[17]  John D. Garofalakis,et al.  Web Site Optimization Using Page Popularity , 1999, IEEE Internet Comput..

[18]  Jon Guice,et al.  Looking Backward and Forward at the Internet , 1998, Inf. Soc..

[19]  Nicholas R. Jennings,et al.  Software Engineering with Agents: Pitfalls and Pratfalls , 1999, IEEE Internet Comput..

[20]  B. Marx The Visual Display of Quantitative Information , 1985 .

[21]  Ora Lassila,et al.  WEB METADATA : A Matter of Semantics , 1998 .

[22]  Robert J. Glushko,et al.  An XML framework for agent-based E-commerce , 1999, CACM.

[23]  Pattie Maes,et al.  Agents that reduce work and information overload , 1994, CACM.

[24]  Joe Podolsky,et al.  City of bits: space, place and the infobahn , 1995, CSOC.

[25]  Saul Greenberg,et al.  Revisitation patterns in World Wide Web navigation , 1997, CHI.

[26]  Robert H. Zakon,et al.  Hobbes' Internet Timeline , 1997, RFC.

[27]  Dick Stenmark Capturing Tacit Knowledge using Recommender Systems , 1999 .

[28]  Kam-Fai Wong,et al.  KPS: a Web Information Mining Algorithm , 1999, Comput. Networks.

[29]  Luc Girardin Mapping the virtual geography of the World-Wide Web , 1996, WWW 1996.

[30]  Vijay V. Raghavan,et al.  Information Retrieval on the World Wide Web , 1997, IEEE Internet Comput..

[31]  Pattie Maes,et al.  Agents that buy and sell , 1999, CACM.

[32]  Alessandro Micarelli,et al.  A Hybrid Architecture for User-Adapted Information Filtering on the World Wide Web , 1997 .

[33]  Martin Dodge,et al.  Mapping the World-Wide Web , 2000 .

[34]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[35]  Sougata Mukherjea,et al.  Visualizing the World-Wide Web with the Navigational View Builder , 1995, Comput. Networks ISDN Syst..

[36]  Lars Erik Holmquist,et al.  Navigating Cyberspace with CyberGeo Maps , 1998 .

[37]  Yanhong Li Toward A Qualitative Search Engine , 1998, IEEE Internet Comput..

[38]  William D. Kalsbeek,et al.  Internet and Web use in the U.S. , 1996, CACM.

[39]  Paul Resnick,et al.  Recommender systems , 1997, CACM.

[40]  Jakob Nielsen,et al.  User interface directions for the Web , 1999, CACM.

[41]  Adam Rifkin,et al.  XML: A Door to Automated Web Applications , 1997, IEEE Internet Comput..

[42]  Toshiharu Hasegawa,et al.  Mondou: interface with text data mining for Web search engine , 1998, Proceedings of the Thirty-First Hawaii International Conference on System Sciences.

[43]  Jacky Swan,et al.  Facilitating knowledge creation with GroupWare: a case study of a knowledge intensive firm , 2000, Proceedings of the 33rd Annual Hawaii International Conference on System Sciences.

[44]  James D. Hollan,et al.  Edit wear and read wear , 1992, CHI.

[45]  Robert R. Korfhage,et al.  Visualization of a Document Collection with Implicit and Explicit Links: the ViBe System , 1993, Scand. J. Inf. Syst..

[46]  Johan Hagman,et al.  Brute facts vs. institutional facts of language as a foundation for IR , 1996 .

[47]  Robert B. Allen,et al.  User Models: Theory, Method, and Practice , 1990, Int. J. Man Mach. Stud..

[48]  Andrew Leonard,et al.  Bots: The Origin of New Species , 1997 .

[49]  Roy T. Fielding,et al.  Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web , 1994, Comput. Networks ISDN Syst..

[50]  Robert R. Korfhage,et al.  Visualization of a Document Collection: The VIBE System , 1993, Inf. Process. Manag..

[51]  Robert J. Hendley,et al.  Visualising complex interacting systems , 1995, CHI 95 Conference Companion.

[52]  Bruce Krulwich Automating the Internet: Agents as User Surrogates , 1997, IEEE Internet Comput..

[53]  Chris Greenhalgh Analysing movement and world transitions in virtual reality tele-conferencing , 1997, ECSCW.