Social media sites are a major source for non-curated, usergenerated feedback on virtually all products and services. Users increasingly rely on social media to disclose sometimes serious real-life incidents rather than visiting official communication channels. This valuable, actionable, user-generated information, if extracted reliably and robustly from the social media sites, has the potential to have a positive impact on critical applications related to public health and safety, and beyond. Unfortunately, the extraction and presentation of actionable information from social media—where the output of the extraction process is used to take concrete actions in the real world—are not well supported by existing technology. Traditional information extraction approaches do not work well over the highly informal, noisy, and ungrammatical text common in social media, and they do not handle the extraction and aggregation of the rare content that important applications need to extract from high-volume streaming sources. In our ongoing collaborative project between Columbia University and the New York City Department of Health and Mental Hygiene (DOHMH), we aim to address these gaps in research and technology for one important public health application, namely, detecting and acting on foodborne outbreaks in New York City restaurants. Thus far, we have been able to address these issues successfully and have used one social media site to identify and follow up on several foodborne outbreaks that had not been reported through conventional channels. 1. DETECTING AND ACTING ON FOODBORNE OUTBREAKS The Centers for Disease Control and Prevention (CDC) estimates that 1 in 6 Americans, or 48 million people, get sick from a foodborne disease each year. Of confirmed foodborne outbreaks investigated nationally, 45% are restaurantrelated [20]. The New York City DOHMH is the agency with primary responsibility for foodborne disease detection and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD at Bloomberg: The Data Frameworks Track (KDD 2014) New York, New York, USA Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. outbreak investigations in New York City. New York City hosts approximately 24,000 restaurants and 15,000 food retail establishments (e.g., grocery stores, delis). Citizens can report illnesses associated with such venues through the official 311 telephone line and its associated website and app. Once an outbreak is identified, DOHMH launches an investigation that includes a restaurant inspection, testing food and clinical specimens, and collecting symptom and food exposure data from restaurant patrons, as well as conducting statistical analysis to implicate a food item; finally, DOHMH takes action to stop the outbreak (e.g., by removing a contaminated food item) and prevent its recurrence (e.g., through education). Unfortunately, only about 30 restaurant-associated foodborne outbreaks are reported annually, so almost certainly many such outbreaks are not reported to the government through the official channels, which has potentially serious public health repercussions. In fact, investigated outbreaks are thought to represent only a small fraction of all foodborne outbreaks [6]. As we reported in [10], in a study of 294,000 New York City restaurant reviews on the Yelp website we discovered that only 3% of the illnesses that we identified had been reported through the official New York City channels, which highlights the importance of extracting this valuable, otherwiseunreported outbreak-related information from social media, so that the government can analyze it and launch investigations when appropriate. Figure 1(a) shows Yelp and Twitter posts about a potential food poisoning incident at a New York City restaurant. Since January 2012, we have had a fruitful, ongoing collaboration between the Computer Science Department at Columbia University and the New York City DOHMH, hence combining Columbia’s expertise in Computer Science with the DOHMH’s domain knowledge and infrastructure for the important application at hand. Overall, the goal of this collaboration is to identify and analyze the unprecedented volumes of user-contributed opinions and comments about restaurants on social media sites, to extract reliable indicators of disease outbreaks associated with the restaurants, an important public health task. We have already produced a proof-of-concept, operational prototype over Yelp data. (Yelp has been graciously producing now-daily feeds of New York City restaurant reviews for our use in our project.) Our prototype has been used by the DOHMH since July 2012. Our system processes each Yelp review with multiple classifiers, developed through supervised machine learning to detect (1) whether the review discusses a potential food poisoning incident; (2) whether the review hints at an incu-
[1]
J. Brownstein,et al.
Digital disease detection--harnessing the Web for public health surveillance.
,
2009,
The New England journal of medicine.
[2]
Henry A. Kautz,et al.
nEmesis: Which Restaurants Should You Avoid Today?
,
2013,
HCOMP.
[3]
Jure Leskovec,et al.
The bursty dynamics of the Twitter information network
,
2014,
WWW.
[4]
Mark Dredze,et al.
How Social Media Will Change Public Health
,
2012,
IEEE Intelligent Systems.
[5]
Mark Dredze,et al.
Separating Fact from Fear: Tracking Flu Infections on Twitter
,
2013,
NAACL.
[6]
Hila Becker,et al.
Hip and trendy: Characterizing emerging trends on Twitter
,
2011,
J. Assoc. Inf. Sci. Technol..
[7]
Hila Becker,et al.
Selecting Quality Twitter Content for Events
,
2011,
ICWSM.
[8]
R. Tauxe,et al.
Foodborne illness acquired in the United States--unspecified agents.
,
2011,
Emerging infectious diseases.
[9]
Prasenjit Mitra,et al.
Temporal and Information Flow Based Event Detection from Social Text Streams
,
2007,
AAAI.
[10]
Luis Gravano,et al.
Using Online Reviews by Restaurant Patrons to Identify Unreported Cases of Foodborne Illness — New York City, 2012–2013
,
2014,
MMWR. Morbidity and mortality weekly report.
[11]
Yejin Choi,et al.
Where Not to Eat? Improving Public Policy by Predicting Hygiene Inspections Using Online Reviews
,
2013,
EMNLP.
[12]
Marcello Pagano,et al.
Using temporal context to improve biosurveillance
,
2003,
Proceedings of the National Academy of Sciences of the United States of America.
[13]
Fotis Psallidas,et al.
Effective Event Identification in Social Media
,
2013,
IEEE Data Eng. Bull..
[14]
Mark Dredze,et al.
You Are What You Tweet: Analyzing Twitter for Public Health
,
2011,
ICWSM.
[15]
Hila Becker,et al.
Identifying content for planned events across social media sites
,
2012,
WSDM '12.
[16]
Matthew Mohebbi,et al.
Assessing Google Flu Trends Performance in the United States during the 2009 Influenza Virus A (H1N1) Pandemic
,
2011,
PloS one.
[17]
Miles Osborne,et al.
Streaming First Story Detection with application to Twitter
,
2010,
NAACL.
[18]
Hila Becker,et al.
Beyond Trending Topics: Real-World Event Identification on Twitter
,
2011,
ICWSM.
[19]
Jing Jiang,et al.
A Unified Model for Topics, Events and Users on Twitter
,
2013,
EMNLP.
[20]
Jeremy Ginsberg,et al.
Detecting influenza epidemics using search engine query data
,
2009,
Nature.