Archetype-Based Modeling and Search of Social Media

Existing keyword-based search techniques suffer from limitations owing to unknown, mismatched, and obscure vocabulary. These challenges are particularly prevalent in social media, where slang, jargon, and memetics are abundant. We develop a new technique, Archetype-Based Modeling and Search, that can mitigate these challenges as they are encountered in social media. This technique learns to identify new relevant documents based on a specified set of archetypes from which both vocabulary and relevance information are extracted. We present a case study from the social media data from Reddit, by using authors from /r/Opiates to characterize discourse around opioid use and to find additional relevant authors on this topic.

[1]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[2]  Murhaf Fares,et al.  Word vectors, reuse, and replicability: Towards a community repository of large-text resources , 2017, NODALIDA.

[3]  Byron C. Wallace,et al.  Modelling Context with User Embeddings for Sarcasm Detection in Social Media , 2016, CoNLL.

[4]  Yoav Goldberg,et al.  Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them , 2019, NAACL-HLT.

[5]  Graciela Gonzalez-Hernandez,et al.  Utilizing social media data for pharmacovigilance: A review , 2015, J. Biomed. Informatics.

[6]  Zaïd Harchaoui,et al.  Fast and Robust Archetypal Analysis for Representation Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[8]  C. Ji An Archetypal Analysis on , 2005 .

[9]  Duc Minh Nguyen,et al.  Multiview Deep Learning for Predicting Twitter Users' Location , 2017, ArXiv.

[10]  J. Nathan Matias,et al.  Caveat emptor, computational social science: Large-scale missing data in a widely-published Reddit corpus , 2018, PloS one.

[11]  Zhang Rui,et al.  A Survey on Biometric Authentication: Toward Secure and Privacy-Preserving Identification , 2019, IEEE Access.

[12]  Gary Marchionini,et al.  Exploratory search , 2006, Commun. ACM.

[13]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[14]  Marwan Bikdash,et al.  From social media to public health surveillance: Word embedding based clustering method for twitter classification , 2017, SoutheastCon 2017.

[15]  Mouzhi Ge,et al.  Big Data for Internet of Things: A Survey , 2018, Future Gener. Comput. Syst..

[16]  Byron C. Wallace,et al.  Quantifying Mental Health from Social Media with Neural User Embeddings , 2017, MLHC.

[17]  Lars Kai Hansen,et al.  Archetypal analysis for machine learning , 2010, 2010 IEEE International Workshop on Machine Learning for Signal Processing.

[18]  Carmen C. Y. Poon,et al.  Big Data for Health , 2015, IEEE Journal of Biomedical and Health Informatics.

[19]  Tālis J. Putniņš,et al.  Sex, Drugs, and Bitcoin: How Much Illegal Activity Is Financed Through Cryptocurrencies? , 2018, The Review of Financial Studies.

[20]  José Camacho-Collados,et al.  From Word to Sense Embeddings: A Survey on Vector Representations of Meaning , 2018, J. Artif. Intell. Res..

[21]  Ahmet Emre Aladağ,et al.  Detecting Suicidal Ideation on Forums: Proof-of-Concept Study , 2018, Journal of medical Internet research.

[22]  Qing Wang,et al.  Modern Text Mining Framework for R [R package text2vec version 0.6] , 2020 .

[23]  Stefan Stieglitz,et al.  Social media analytics - Challenges in topic discovery, data collection, and data preparation , 2018, Int. J. Inf. Manag..

[24]  Y. Wang,et al.  Timely, Granular, and Actionable: Informatics in the Public Health 3.0 Era , 2018, American journal of public health.

[25]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[26]  Jenna Jacobson,et al.  The State of Social Media in Canada 2017 , 2018, SSRN Electronic Journal.

[27]  Kira Radinsky,et al.  Learning Word Relatedness over Time , 2017, EMNLP.

[28]  Thar Baker,et al.  The Security of Big Data in Fog-Enabled IoT Applications Including Blockchain: A Survey , 2019, Sensors.

[29]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[30]  Vagelis Hristidis,et al.  Pharmaceutical drugs chatter on Online Social Networks , 2014, J. Biomed. Informatics.