Inferring Individual Attributes from Search Engine Queries and Auxiliary Information

Internet data has surfaced as a primary source for investigation of different aspects of human behavior. A crucial step in such studies is finding a suitable cohort (i.e., a set of users) that shares a common trait of interest to researchers. However, direct identification of users sharing this trait is often impossible, as the data available to researchers is usually anonymized to preserve user privacy. To facilitate research on specific topics of interest, especially in medicine, we introduce an algorithm for identifying a trait of interest in anonymous users. We illustrate how a small set of labeled examples, together with statistical information about the entire population, can be aggregated to obtain labels on unseen examples. We validate our approach using labeled data from the political domain. We provide two applications of the proposed algorithm to the medical domain. In the first, we demonstrate how to identify users whose search patterns indicate they might be suffering from certain types of cancer. This shows, for the first time, that search queries can be used as a screening device for diseases that are currently often discovered too late, because no early screening tests exists. In the second, we detail an algorithm to predict the distribution of diseases given their incidence in a subset of the population at study, making it possible to predict disease spread from partial epidemiological data.

[1]  Charles E. Kahn,et al.  How users search and what they search for in the medical domain , 2015, Information Retrieval Journal.

[2]  E. Gabrilovich,et al.  Postmarket Drug Surveillance Without Trial Costs: Discovery of Adverse Drug Reactions Through Large-Scale Analysis of Web Search Queries , 2013, Journal of medical Internet research.

[3]  Alexander J. Smola,et al.  Who Supported Obama in 2012?: Ecological Inference through Distribution Regression , 2015, KDD.

[4]  Diana Borsa,et al.  Automatic Identification of Web-Based Risk Markers for Health Events , 2015, Journal of medical Internet research.

[5]  Yin Yang,et al.  A study of medical and health queries to web search engines. , 2004, Health information and libraries journal.

[6]  Michael J. Paul,et al.  Twitter Improves Influenza Forecasting , 2014, PLoS currents.

[7]  S. Sathiya Keerthi,et al.  Semi-supervised SVMs for classification with unknown class proportions and a small labeled dataset , 2011, CIKM '11.

[8]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[9]  Ophir Frieder,et al.  Enhancing web search in the medical domain via query clarification , 2016, Information Retrieval Journal.

[10]  Sharad Goel,et al.  Who Does What on the Web: A Large-Scale Study of Browsing Behavior , 2012, ICWSM.

[11]  Sujith Ravi,et al.  Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation , 2015, AISTATS.

[12]  Pietro Perona,et al.  Automated analysis of radar imagery of Venus: handling lack of ground truth , 1994, Proceedings of 1st International Conference on Image Processing.

[13]  Nando de Freitas,et al.  Learning about Individuals from Group Statistics , 2005, UAI.

[14]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[15]  Joydeep Ghosh,et al.  LUDIA: an aggregate-constrained low-rank reconstruction algorithm to leverage publicly released health data , 2014, KDD.

[16]  Ryen W. White,et al.  Seeking Insights About Cycling Mood Disorders via Anonymized Search Logs , 2014, Journal of medical Internet research.

[17]  Elad Yom-Tov,et al.  Linguistic Factors Associated With Propagation of Political Opinions in Twitter , 2014 .

[18]  D. Freedman,et al.  A solution to the ecological inference problem , 1997 .

[19]  Jacob Ratkiewicz,et al.  Predicting the Political Alignment of Twitter Users , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[20]  Kevin L. Priddy,et al.  Artificial Neural Networks: An Introduction (SPIE Tutorial Texts in Optical Engineering, Vol. TT68) , 2005 .

[21]  Hua Li,et al.  Demographic prediction based on user's browsing behavior , 2007, WWW '07.

[22]  Gideon S. Mann,et al.  Learning from labeled features using generalized expectation criteria , 2008, SIGIR '08.

[23]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[24]  Ryen W. White,et al.  Screening for Pancreatic Adenocarcinoma Using Signals From Web Search Logs: Feasibility Study and Results. , 2016, Journal of oncology practice.

[25]  J. Gohagan,et al.  Effect of screening on ovarian cancer mortality: the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Randomized Controlled Trial. , 2011, JAMA.

[26]  Aron Culotta,et al.  Predicting Twitter User Demographics using Distant Supervision from Website Traffic Data , 2016, J. Artif. Intell. Res..

[27]  Nazli Goharian,et al.  ADRTrace: Detecting Expected and Unexpected Adverse Drug Reactions from User Reviews on Social Media Sites , 2013, ECIR.

[28]  David M. Pennock,et al.  Using internet searches for influenza surveillance. , 2008, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[29]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[30]  D. Pelleg,et al.  Patterns of Information-Seeking for Cancer on the Internet: An Analysis of Real World Data , 2012, PloS one.

[31]  Alexander J. Smola,et al.  Estimating labels from label proportions , 2008, ICML '08.

[32]  V. Vemuri,et al.  Artificial neural networks: an introduction , 1988 .

[33]  Pablo Barberá Birds of the Same Feather Tweet Together: Bayesian Ideal Point Estimation Using Twitter Data , 2015, Political Analysis.

[34]  References , 1971 .

[35]  John S. Brownstein,et al.  Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time , 2014, PLoS Comput. Biol..

[36]  Ryen W. White,et al.  Search and Breast Cancer , 2016, ACM Trans. Web.