Beyond traditional survey taking : adapting to a changing world Big Data as a Data Source for Official Statistics : experiences at Statistics Netherlands

More and more data are being produced by an increasing number of electronic devices physically surrounding us and on the internet. The large amount of data and the high frequency at which they are produced have resulted in the introduction of the term ‘Big Data’. Because of the fact that these data reflect many different aspects of our daily lives and because of their abundance and availability, Big Data sources are very interesting from an official statistics point of view. However, first experiences obtained with analyses of large amounts of Dutch traffic loop detection records, call detail records of mobile phones and Dutch social media messages reveal that a number of challenges need to be addressed to enable the application of these data sources for official statistics. These and the lessons learned during these initial studies will be addressed and illustrated by examples. More specifically, the following topics are discussed: the three general types of Big Data discerned, the need to access and analyse large amounts of data, how we deal with noisy data and look at selectivity (and our own bias towards this topic), how to go beyond correlation, how we found people with the right skills and mindset to perform the work, and how we have dealt with privacy and security issues.

[1]  Gilbert Saporta,et al.  Data Mining and Official Statistics , 2000 .

[2]  Piet Daas,et al.  Official statistics and Big Data , 2014 .

[3]  P. Daas,et al.  Social media sentiment and consumer confidence , 2014 .

[4]  Piet J. H. Daas,et al.  Big Data as a Source of Statistical Information , 2014 .

[5]  E. J. G. Pitman,et al.  STATISTICS AND SCIENCE , 1957 .

[6]  Joyce Neroni,et al.  Twitter as a potential data source for statistics , 2012 .

[7]  Bo Sundgren,et al.  Using Text Mining in Official Statistics , 2005 .

[8]  Rob Kitchin What does big data mean for official statistics , 2015 .

[9]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.

[10]  Nate Silver,et al.  The signal and the noise : why so many predictions fail but some don't , 2012 .

[11]  Alex Priem,et al.  Innovation of tourism statistics through the use of new big data sources , 2014 .

[12]  Dong Nguyen,et al.  "TweetGenie: automatic age prediction from tweets" by D. Nguyen, R. Gravel, D. Trieschnigg, and T. Meder; with Ching-man Au Yeung as coordinator , 2013, LINK.

[13]  Jelke Bethlehem,et al.  The rise of survey sampling , 2009 .

[14]  Piet Daas,et al.  Selectivity of Big data , 2014 .

[15]  J. Lerner,et al.  Feelings and Consumer Decision Making: The Appraisal-Tendency Framework , 2007 .

[16]  Gerhard Wellein,et al.  Introduction to High Performance Computing for Scientists and Engineers , 2010, Chapman and Hall / CRC computational science series.