NET Objectivity , Reliability , and Validity of Search Engine Count Estimates

Count estimates (“hits”) provided by Web search engines have received much attention as a yardstick to measure a variety of phenomena of interest as diverse as, e.g., language statistics, popularity of authors, or similarity between words. Common to these activities is the intention to use Web search engines not only for search but for ad hoc measurement. Using search engine count estimates (SECEs) in this way means that a phenomenon of interest, e.g., the popularity of an author, is conceived of as a measurand, and SECEs are taken to be its quantitative measures. However, the data quality of SECEs has not yet been studied systematically, and concerns have been raised against the use of this kind of data. This article examines the data quality of SECEs focusing on classical goodness criteria, i.e., objectivity, reliability, and validity. The results of a series of studies indicate that with the exception of Boolean queries that use disjunction or negation objectivity as well as testretest reliability and parallel-test reliability of SECEs is good for most types of browsers and search engines examined. Estimation of validity required model development (all-subsets regression) revealing satisfying results by using an explorative approach to feature selection. The findings are discussed in the light of previous objections and perspectives for using Web search count estimates are delineated.

[1]  Edward G. Carmines,et al.  Reliability and Validity Assessment , 1979 .

[2]  Loet Leydesdorff,et al.  Internet time and the reliability of search engines , 2004, First Monday.

[3]  Ronald Rousseau,et al.  Daily time series of common single word searches in AltaVista and NorthernLight , 1998 .

[4]  C. P. Whaley Word–nonword classification time. , 1978 .

[5]  Dirk Lewandowski,et al.  The freshness of web search engine databases , 2006, J. Inf. Sci..

[6]  Giles,et al.  Searching the world wide Web , 1998, Science.

[7]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[8]  Judit Bar-Ilan Search engine results over time-a case study on search engine stability , 1998 .

[9]  J. S. Long,et al.  Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model , 2000 .

[10]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[11]  Frank van Harmelen,et al.  Using Google distance to weight approximate ontology matches , 2007, WWW '07.

[12]  John E. Hunter,et al.  Methods of Meta-Analysis , 1989 .

[13]  Wentian Li,et al.  Random texts exhibit Zipf's-law-like word frequency distribution , 1992, IEEE Trans. Inf. Theory.

[14]  Marc Najork,et al.  Measuring Index Quality Using Random Walks on the Web , 1999, Comput. Networks.

[15]  Amanda Spink,et al.  A study of results overlap and uniqueness among major Web search engines , 2006, Inf. Process. Manag..

[16]  Marianne Hundt,et al.  Corpus Linguistics and the Web , 2006 .

[17]  David M. Pennock,et al.  Methods for Sampling Pages Uniformly from the World Wide Web , 2001 .

[18]  Dato N.M. De Gruijter,et al.  Statistical Test Theory for the Behavioral Sciences , 2007 .

[19]  James P. Bagrow,et al.  On the Google‐fame of scientists and other populations , 2005 .

[20]  Guy Lebanon,et al.  Linear Regression , 2010 .

[21]  Alexander Schill,et al.  NL sampler: random sampling of web documents based on natural language with query hit estimation , 2007, SAC '07.

[22]  Arnold Zellner,et al.  Simplicity, Inference and Modelling: Keeping it Sophisticatedly Simple , 2009 .

[23]  B. Sinha,et al.  Statistical Meta-Analysis with Applications , 2008 .

[24]  V. Flack,et al.  Frequency of Selecting Noise Variables in Subset Regression Analysis: A Simulation Study , 1987 .

[25]  Leslie R. Odom,et al.  What's this r? A Correlational Approach to Explaining Validity, Reliability and Objectivity Coefficients , 2006 .

[26]  Philipp Mayr,et al.  Google Web APIs - an Instrument for Webometric Analyses? , 2006, ArXiv.

[27]  J. Fox Applied Regression Analysis, Linear Models, and Related Methods , 1997 .

[28]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[29]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[30]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[31]  W. Shadish,et al.  Experimental and Quasi-Experimental Designs for Generalized Causal Inference , 2001 .

[32]  A. Atkinson Subset Selection in Regression , 1992 .