Extracting accurate and complete results from search engines: Case study windows live

Although designed for general Web searching, Webometrics and related research commercial search engines are also used to produce estimated hit counts or lists of URLs matching a query. Unfortunately, however, they do not return all matching URLs for a search and their hit count estimates are unreliable. In this article, we assess whether it is possible to obtain complete lists of matching URLs from Windows Live, and whether any of its hit count estimates are robust. As part of this, we introduce two new methods to extract extra URLs from search engines: automated query splitting and automated domain and TLD searching. Both methods successfully identify additional matching URLs but the findings suggest that there is no way to get complete lists of matching URLs or accurate hit counts from Windows Live, although some estimating suggestions are provided. © 2008 Wiley Periodicals, Inc.

[1]  Kirsten A. Foot,et al.  Analyzing Linking Practices: Candidate Sites in the 2002 US Electoral Web Sphere , 2006, J. Comput. Mediat. Commun..

[2]  Giles,et al.  Searching the world wide Web , 1998, Science.

[3]  Judit Bar-Ilan,et al.  The use of web search engines in information science research , 2005, Annu. Rev. Inf. Sci. Technol..

[4]  Mike Thelwall,et al.  Methodologies for crawler based Web surveys , 2002, Internet Res..

[5]  Junghoo Cho,et al.  Impact of search engines on page popularity , 2004, WWW '04.

[6]  Mike Thelwall,et al.  Web issue analysis: An integrated water resource management case study: Research Articles , 2006 .

[7]  Judit Bar-Ilan,et al.  Evolution, continuity, and disappearance of documents on a specific topic on the Web: A longitudinal study of informetrics , 2004, J. Assoc. Inf. Sci. Technol..

[8]  Monika Henzinger,et al.  Hyperlink Analysis for the Web , 2001, IEEE Internet Comput..

[9]  Peter Ingwersen,et al.  Informetric analyses on the world wide web: methodological approaches to 'webometrics' , 1997, J. Documentation.

[10]  Junghoo Cho,et al.  Page quality: in search of an unbiased web ranking , 2005, SIGMOD '05.

[11]  José Luis Ortega,et al.  Scientific research activity and communication measured with cybermetrics indicators , 2006, J. Assoc. Inf. Sci. Technol..

[12]  Mike Thelwall,et al.  Web issue analysis: An integrated water resource management case study , 2006, J. Assoc. Inf. Sci. Technol..

[13]  Peter Ingwersen,et al.  The calculation of web impact factors , 1998, J. Documentation.

[14]  Judit Bar-Ilan,et al.  Search Engine Ability to Cope With the Changing Web , 2004, Web Dynamics.

[15]  Philipp Mayr,et al.  Google Web APIs - an Instrument for Webometric Analyses? , 2006, ArXiv.

[16]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[17]  Mike Thelwall,et al.  Search engine coverage bias: evidence and possible causes , 2004, Inf. Process. Manag..

[18]  Ronald Rousseau,et al.  Daily time series of common single word searches in AltaVista and NorthernLight , 1998 .

[19]  Peter Ingwersen,et al.  Characteristics of scientific Web publications: Preliminary data gathering and analysis , 2004, J. Assoc. Inf. Sci. Technol..

[20]  Judit Bar-Ilan,et al.  Data collection methods on the Web for infometric purposes — A review and analysis , 2004, Scientometrics.

[21]  Alastair Smith,et al.  A Tale of Two Web Spaces: Comparing Sites Using Web Impact Factors. , 1999 .

[22]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[23]  Mike Thelwall The Responsiveness of Search Engine Indexes , 2001 .

[24]  Paul Nieuwenhuysen,et al.  The reliability of Internet search engines: Fluctuations in Document Accessibility , 2000 .

[25]  Judit Bar-Ilan How much information do search engines disclose on the links to a web page? A longitudinal case study of the ‘cybermetrics’ home page , 2002, J. Inf. Sci..

[26]  Judit Bar-Ilan Search engine results over time-a case study on search engine stability , 1998 .

[27]  José Luis Ortega,et al.  Scientific research activity and communication measured with cybermetrics indicators: Research Articles , 2006 .

[28]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[29]  Paul Nieuwenhuysen,et al.  Internet search engines - fluctuations in document accessibility , 2001, J. Documentation.

[30]  David M. Pennock,et al.  Winners don't take all: Characterizing the competition for links on the web , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Lennart Björneborn,et al.  'Mini small worlds' of shortest link paths crossing domain boundaries in an academic Web space , 2006, Scientometrics.

[32]  Howard Rosenbaum,et al.  Can search engines be used as tools for web-link analysis? A critical view , 1999, J. Documentation.

[33]  Mike Thelwall,et al.  Conceptualizing documentation on the Web: An evaluation of different heuristic-based models for counting links between university Web sites , 2002, J. Assoc. Inf. Sci. Technol..