Tweedle: Sensitivity Check in Health-related Social Short Texts based on Regret Theory

Abstract Twitter helps us to know what is happening in the world and what people are talking right now. Every day, millions of Twitteraties tweet something personal or impersonal to express their emotions and valuable knowledge. In the health domain, disclosure of personal health information will have a long-term effect to common individuals either directly or indirectly, which emphasize the presence of unrealistic social boundaries and the need of sensitivity analysis in social media. The proposed Tweedle framework was built with 100K tweets extracted based on a set of 20 health-related cyber-keywords. The framework of Tweedle was bounded with Regret Theory for tweet annotation, content and contextual feature scores for feature selection and various machine learning algorithms for sensitivity classification. The tweets annotated in accordance with Regret Theory by domain experts of Amazon Mechanical Turkresulted in 61.5% of sensitive tweets with health data. The context and content-oriented features scoresare introduced in terms of Primary / Secondary tweet score, Named Entity Recognition Score of tweets, Term Frequency-Inverse Document Frequency(TF-IDF), Cyber-KeywordRatioin tweets, hashtag mentions, user mentions as features for classification.The Tweedle experimented Regret Theory in combination with various classifiers like Support Vector Machine, Naive Bayes, Random Forest, Decision Tree, Logistic Regression and Recurrent Neural Network + Long Short-Term Memory for sensitivity classificationin health domain tweets.The training and testing results proved RNN + LSTM as the better performing model to identify tweets with sensitive health data.

[1]  Laura J. Bierut,et al.  A content analysis of depression-related tweets , 2016, Comput. Hum. Behav..

[2]  Gan Keng Hoon,et al.  Review of short-text classification , 2019, Int. J. Web Inf. Syst..

[3]  Khalid A Alnemer,et al.  Are Health-Related Tweets Evidence Based? Review and Analysis of Health-Related Tweets on Twitter , 2015, Journal of Medical Internet Research.

[4]  Rachel Greenstadt,et al.  Privacy Detective: Detecting Private Information and Collective Privacy Behavior in a Large Social Network , 2014, WPES.

[5]  A. Joinson Self‐disclosure in computer‐mediated communication: The role of self‐awareness and visual anonymity , 2001 .

[6]  John Cromby,et al.  Emotional inhibition: A discourse analysis of disclosure , 2012, Psychology & health.

[7]  Li Li,et al.  Combining Lexical and Semantic Features for Short Text Classification , 2013, KES.

[8]  Hakan Ferhatosmanoglu,et al.  Short text classification in twitter to improve information filtering , 2010, SIGIR.

[9]  Donald E. Brown,et al.  Text Classification Algorithms: A Survey , 2019, Inf..

[10]  Nemanja Spasojevic,et al.  Klout score: Measuring influence across multiple social networks , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[11]  Blase Ur,et al.  "i read my Twitter the next morning and was astonished": a conversational perspective on Twitter regrets , 2013, CHI.

[12]  D. Heron,et al.  Twitter and brachytherapy: An analysis of "tweets" over six years by patients and health care professionals. , 2018, Brachytherapy.

[13]  Cyril Labbé,et al.  Named Entity Recognition Over Electronic Health Records Through a Combined Dictionary-based Approach , 2016, CENTERIS/ProjMAN/HCist.

[14]  Víctor M. Prieto,et al.  Twitter: A Good Place to Detect Health Conditions , 2014, PloS one.

[15]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..