Learning about individuals' health from aggregate data

There is growing awareness that user-generated social media content contains valuable health-related information and is more convenient to collect than typical health data. For example, Twitter has been employed to predict aggregate-level outcomes, such as regional rates of diabetes and child poverty, and to identify individual cases of depression and food poisoning. Models which make aggregate-level inferences can be induced from aggregate data, and consequently are straightforward to build. In contrast, learning models that produce individual-level (IL) predictions, which are more informative, usually requires a large number of difficult-to-acquire labeled IL examples. This paper presents a new machine learning method which achieves the best of both worlds, enabling IL models to be learned from aggregate labels. The algorithm makes predictions by combining unsupervised feature extraction, aggregate-based modeling, and optimal integration of aggregate-level and IL information. Two case studies illustrate how to learn health-relevant IL prediction models using only aggregate labels, and show that these models perform as well as state-of-the-art models trained on hundreds or thousands of labeled individuals.

[1]  Aron Culotta,et al.  Inferring the origin locations of tweets with quantitative confidence , 2013, CSCW.

[2]  Mark Dredze,et al.  How Social Media Will Change Public Health , 2012, IEEE Intelligent Systems.

[3]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Eric Horvitz,et al.  Predicting Depression via Social Media , 2013, ICWSM.

[5]  Aron Culotta,et al.  Estimating county health statistics with twitter , 2014, CHI.

[6]  May D. Wang,et al.  A semi-supervised method for predicting cancer survival using incomplete clinical data , 2015, 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  Gregory J. Park,et al.  Psychological Language on Twitter Predicts County-Level Heart Disease Mortality , 2015, Psychological science.

[9]  Henry A. Kautz,et al.  Predicting Disease Transmission from Geo-Tagged Micro-Blog Data , 2012, AAAI.

[10]  Yoram Bachrach,et al.  Studying User Income through Language, Behaviour and Affect in Social Media , 2015, PloS one.

[11]  Richard Colbaugh,et al.  Proactive defense for evolving cyber threats , 2011, Proceedings of 2011 IEEE International Conference on Intelligence and Security Informatics.

[12]  Richard Colbaugh,et al.  Early warning analysis for social diffusion events , 2010, ISI.