Imputing Missing Social Media Data Stream in Multisensor Studies of Human Behavior

The ubiquitous use of social media enables researchers to obtain self-recorded longitudinal data of individuals in real-time. Because this data can be collected in an inexpensive and unobtrusive way at scale, social media has been adopted as a “passive sensor” to study human behavior. However, such research is impacted by the lack of homogeneity in the use of social media, and the engineering challenges in obtaining such data. This paper proposes a statistical framework to leverage the potential of social media in sensing studies of human behavior, while navigating the challenges associated with its sparsity. Our framework is situated in a large-scale in-situ study concerning the passive assessment of psychological constructs of 757 information workers wherein of four sensing streams was deployed - bluetooth beacons, wearable, smartphone, and social media. Our framework includes principled feature transformation and machine learning models that predict latent social media features from the other passive sensors. We demonstrate the efficacy of this imputation framework via a high correlation of 0.78 between actual and imputed social media features. With the imputed features we test and validate predictions on psychological constructs like personality traits and affect. We find that adding the social media data streams, in their imputed form, improves the prediction of these measures. We discuss how our framework can be valuable in multimodal sensing studies that aim to gather comprehensive signals about an individual's state or situation.

[1]  T. Graepel,et al.  Private traits and attributes are predictable from digital records of human behavior , 2013, Proceedings of the National Academy of Sciences.

[2]  Thomas R Sullivan,et al.  Bias and Precision of the "Multiple Imputation, Then Deletion" Method for Dealing With Missing Outcome Data. , 2015, American journal of epidemiology.

[3]  福田 博一 State-Trait Anxiety Inventoryによるペインクリニック外来患者の不安の評価 , 1994 .

[4]  Andrew Campbell,et al.  The Rise of People-Centric Sensing , 2008, IEEE Internet Computing.

[5]  Gregory D. Abowd,et al.  Inferring Mood Instability on Social Media by Leveraging Ecological Momentary Assessments , 2017, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[6]  Rui Wang,et al.  Sensing Behavioral Change over Time , 2018, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[7]  Scott A. Golder,et al.  Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures , 2011 .

[8]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[9]  Mi Zhang,et al.  BodyBeat: a mobile system for sensing non-speech body sounds , 2014, MobiSys.

[10]  Q. Raaijmakers,et al.  Effectiveness of Different Missing Data Treatments in Surveys with Likert-Type Data: Introducing the Relative Mean Substitution Approach , 1999 .

[11]  Akane Sano,et al.  Multimodal autoencoder: A deep learning approach to filling in missing sensor data and enabling better mood prediction , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).

[12]  Nitesh V. Chawla,et al.  Social Media as a Passive Sensor in Longitudinal Studies of Human Behavior and Wellbeing , 2019, CHI Extended Abstracts.

[13]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[14]  Ravi Kumar,et al.  Influence and correlation in social networks , 2008, KDD.

[15]  Nitesh V. Chawla,et al.  Differentiating Higher and Lower Job Performers in the Workplace Using Mobile Sensing , 2019, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[16]  Mariella Dimiccoli,et al.  Mitigating Bystander Privacy Concerns in Egocentric Activity Recognition with Deep Learning and Intentional Image Degradation , 2018, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[17]  David Watson,et al.  The PANAS-X manual for the positive and negative affect schedule , 1994 .

[18]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[19]  J A Walcott-McQuigg,et al.  An ecological approach to physical activity in African American women. , 2001, Medscape women's health.

[20]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[21]  J L Schafer,et al.  Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective. , 1998, Multivariate behavioral research.

[22]  Fanglin Chen,et al.  StudentLife: assessing mental health, academic performance and behavioral trends of college students using smartphones , 2014, UbiComp.

[23]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[24]  Munmun De Choudhury,et al.  A Social Media Study on the Effects of Psychiatric Medication Use , 2019, ICWSM.

[25]  Lu Bai,et al.  OSN mood tracking: exploring the use of online social network activity as an indicator of mood changes , 2016, UbiComp Adjunct.

[26]  Munmun De Choudhury,et al.  Modeling Stress with Social Media Around Incidents of Gun Violence on College Campuses , 2017, Proc. ACM Hum. Comput. Interact..

[27]  Damaris Zurell,et al.  Collinearity: a review of methods to deal with it and a simulation study evaluating their performance , 2013 .

[28]  M. Larsen,et al.  The Psychology of Survey Response , 2002 .

[29]  Kilian Q. Weinberger,et al.  Web-Search Ranking with Initialized Gradient Boosted Regression Trees , 2010, Yahoo! Learning to Rank Challenge.

[30]  John B. Carlin,et al.  Bias and efficiency of multiple imputation compared with complete‐case analysis for missing covariate values , 2010, Statistics in medicine.

[31]  Oscar Mayora-Ibarra,et al.  Smartphone-Based Recognition of States and State Changes in Bipolar Disorder Patients , 2015, IEEE Journal of Biomedical and Health Informatics.

[32]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[33]  William C. Little,et al.  Introduction to Sociology - 2nd Canadian Edition , 2016 .

[34]  J. Pennebaker,et al.  The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods , 2010 .

[35]  O. John,et al.  The Next Big Five Inventory (BFI-2): Developing and Assessing a Hierarchical Model With 15 Facets to Enhance Bandwidth, Fidelity, and Predictive Power , 2017, Journal of personality and social psychology.

[36]  Akane Sano,et al.  Stress Recognition Using Wearable Sensors and Mobile Phones , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[37]  E. Diener,et al.  Experience Sampling: Promises and Pitfalls, Strengths and Weaknesses , 2003 .

[38]  J Elith,et al.  A working guide to boosted regression trees. , 2008, The Journal of animal ecology.

[39]  J. Millán,et al.  A Probabilistic Approach to Handle Missing Data for Multi-Sensory Activity Recognition , 2010, Ubicomp 2010.

[40]  Ralph Catalano,et al.  Health, Behavior and the Community: An Ecological Perspective , 1978 .

[41]  Ling Chen,et al.  AROMA , 2018, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[42]  Bruno D. Zumbo,et al.  A New Nonparametric Levene Test for Equal Variances , 2010 .

[43]  Nitesh V. Chawla,et al.  The Tesserae Project: Large-Scale, Longitudinal, In Situ, Multimodal Sensing of Information Workers , 2019, CHI Extended Abstracts.

[44]  R. Henson,et al.  Use of Exploratory Factor Analysis in Published Research , 2006 .

[45]  Rob J Hyndman,et al.  Another look at measures of forecast accuracy , 2006 .

[46]  Eric Horvitz,et al.  Predicting Depression via Social Media , 2013, ICWSM.

[47]  Song Yang,et al.  Imputation of missing data when measuring physical activity by accelerometry. , 2005, Medicine and science in sports and exercise.

[48]  VALENTIN RADU,et al.  Multimodal Deep Learning for Activity and Context Recognition , 2018, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[49]  Bruce W. Suter,et al.  The multilayer perceptron as an approximation to a Bayes optimal discriminant function , 1990, IEEE Trans. Neural Networks.

[50]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[51]  R. Perera Research methods journal club: a gentle introduction to imputation of missing values , 2008, Evidence-based medicine.