Sampling the Web: The Development of a Custom Search Tool for Research

Research designed to study the Internet is beset with challenges. One of these challenges involves obtaining samples of Web pages. Methodologies used in previous studies may be categorized into random, purposeful, and purposeful random types of sampling. This paper contains an outline of these methodologies and information about the development of a custom sampling tool that may be used to obtain purposeful random samples of Web page links. The custom search application called Web Sampler works through the Google Web APIs service to collect a random sample of pages from search results returned from the Google index. Web Sampler is inexpensive to develop and may be easily customized for specialized search needs required by researchers who are investigating Web page content. The Internet is a vast network of interconnected computers that supports rapid access to and exchange of digitally encoded information including e-mail and Web pages. Working on top of the Internet is the World Wide Web, which is a hypertext system that enables retrieval and display of Web pages through the use of browser software. During the years since the idea for the Web was initially conceptualized by Berners Lee (1989, 1990), it has grown to include several billion publicly accessible pages that may be indexed by search engines (Gulli & Signorini, 2005). The Web has become a huge gateway to information that may be accessed by anyone with a computer, Web browser software, and Internet access. The quality and scope of information available on the Web is not fully documented through research due to the continual growth and dynamic nature of Web-based content. Web research comes with its own set of unique methodological challenges such as access issues, population definitions, sampling procedures, and selection or development of appropriate software applications for individual research needs. The purpose of this paper is to articulate the challenges associated with Web research, summarize several strategies that have been used to obtain samples of Web pages in previous studies, and describe the development of a custom search application designed to obtain purposeful random samples of Web pages. The Challenges of Web Research A variety of challenges need to be addressed in any research study designed to explore the Web. The process of finding solutions to these challenges may involve an evolution of traditional research methodologies. Consider the basic research procedure of defining a population and selecting a suitable sample to study. In Web-based research, complex problems emerge when defining populations or selecting samples. For example, if the population is defined to be all existing Web pages, then access problems are encountered. A portion of the total population of all existing Web pages is private, and public access is restricted. If the private pages are eliminated from consideration for research, then the population under study may be redefined as the set of Web pages open to the general public. Samples may be collected from the set of public Web pages without access problems since they are available to everyone. Unfortunately, removing the access barrier does not solve the entire sampling problem. Web page sampling has not proven to be simple or straightforward. Several variations of sampling procedures have been used or proposed in order to extract information from the Web (Henzinger & Lawrence, 2004). As shown in Table 1, an examination of sampling techniques that have been used in Web research reveals that they tend to fall within the broad categories of random sampling, purposive sampling, and purposive random sampling.

[1]  Leanne Bowler,et al.  Using the web for canadian history projects : What will children find? , 2004 .

[2]  Edward T. O'Neill,et al.  Web Characterization Project , 2001 .

[3]  Walter R. Borg,et al.  Educational research: An introduction, 6th ed. , 1996 .

[4]  Frank Parry,et al.  The Invisible Web: Uncovering Information Sources Search Engines Can’t See , 2002 .

[5]  Chareen Snelson Online Mathematics Instruction: An Analysis of Content , 2002 .

[6]  Basmat Parsad,et al.  Internet Access in U.S. Public Schools and Classrooms: 1994?2003. ED TAB. NCES 2005-015. , 2005 .

[7]  K. Haver,et al.  Cystic fibrosis on the Internet: a survey of site adherence to AMA guidelines. , 2004, Pediatrics.

[8]  John Paul Mueller Mining Google Web Services: Building Applications with the Google API , 2004 .

[9]  Ethan Cerami,et al.  Web Services Essentials , 2002 .

[10]  Marc Najork,et al.  Measuring Index Quality Using Random Walks on the Web , 1999, Comput. Networks.

[11]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[12]  Janet Morahan-Martin,et al.  How Internet Users Find, Evaluate, and Use Online Health Information: A Cross-Cultural Review , 2004, Cyberpsychology Behav. Soc. Netw..

[13]  T. D. Wilson Review of: Calishain, T. and Dornfest, R. Google hacks: tips and tools for smarter searching. (2nd ed.) Sebastopol, CA: O'Reilly, 2005 , 2005, Inf. Res..

[14]  Edward T. O'Neill,et al.  A Methodology for Sampling the World Wide Web , 2001 .

[15]  Lisa Zhao Jump Higher: Analyzing Web-Site Rank in Google , 2004 .

[16]  Laurie Lewis,et al.  Internet Access in U.S. Public Schools and Classrooms: 1994-2005. Highlights. NCES 2007-020. , 2006 .

[17]  M. Patton Qualitative evaluation and research methods, 2nd ed. , 1990 .

[18]  Don E. Descy Searching the web: From the visible to the invisible , 2004 .

[19]  Tim Berners-Lee,et al.  Information Management: A Proposal , 1990 .

[20]  M. Patton,et al.  Qualitative evaluation and research methods , 1992 .

[21]  Steve Lawrence,et al.  Extracting knowledge from the World Wide Web , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Alessandro D'Atri,et al.  A quality evaluation methodology of health web-pages for non-professionals , 2004, Medical informatics and the Internet in medicine.

[23]  Meredith D. Gall,et al.  Educational Research: An Introduction , 1965 .

[24]  Robert F. Potter,et al.  Give the People What They Want: A Content Analysis of FM Radio Station Home Pages , 2002 .

[25]  C. Escoffery,et al.  Internet Use for Health Information Among College Students , 2005, Journal of American college health : J of ACH.

[26]  U. Schmidt,et al.  An evaluation of web-based information. , 2004, The International journal of eating disorders.