论文信息 - We Don't Know What We Don't Know: When and How the Use of Twitter's Public APIs Biases Scientific Inference

We Don't Know What We Don't Know: When and How the Use of Twitter's Public APIs Biases Scientific Inference

Though Twitter research has proliferated, no standards for data collection have crystallized. When using keyword queries, the most common data sources—the Search and Streaming APIs—rarely return the full population of tweets, and scholars do not know whether their data constitute a representative sample. This paper seeks to provide the most comprehensive look to-date at the potential biases that may result. Employing data derived from four identical keyword queries to the Firehose (which provides the full population of tweets but is cost-prohibitive), Streaming, and Search APIs, we use Kendall’s-tau and logit regression analyses to understand the differences in the datasets, including what user and content characteristics make a tweet more or less likely to appear in sampled results. We find that there are indeed systematic differences that are likely to bias scholars’ findings in almost all datasets we examine, and we recommend significant caution in future Twitter research.

[1] Michael Zimmer,et al. A topology of Twitter research: disciplines, methods, and ethics , 2014, Aslib J. Inf. Manag..

[2] Ning Wang,et al. Assessing the bias in samples of large online networks , 2014, Soc. Networks.

[3] Kevin Driscoll,et al. Big Data, Big Questions| Working Within a Black Box: Transparency in the Collection and Production of Big Twitter Data , 2014 .

[4] Axel Bruns,et al. Twitter data analytics - or: the pleasures and perils of studying Twitter , 2014, Aslib J. Inf. Manag..