DiTeX: Disease-related topic extraction system through internet-based sources

This paper describes the web-based automated disease-related topic extraction system, called to DiTeX, which monitors important disease-related topics and provides associated information. National disease surveillance systems require a considerable amount of time to inform people of recent outbreaks of diseases. To solve this problem, many studies have used Internet-based sources such as news and Social Network Service (SNS). However, these sources contain many intentional elements that disturb extracting important topics. To address this challenge, we employ Natural Language Processing and an effective ranking algorithm, and develop DiTeX that provides important disease-related topics. This report describes the web front-end and back-end architecture, implementation, performance of the ranking algorithm, and captured topics of DiTeX. We describe processes for collecting Internet-based data and extracting disease-related topics based on search keywords. Our system then applies a ranking algorithm to evaluate the importance of disease-related topics extracted from these data. Finally, we conduct analysis based on real-world incidents to evaluate the performance and the effectiveness of DiTeX. To evaluate DiTeX, we analyze the ranking of well-known disease-related incidents for various ranking algorithms. The topic extraction rate of our ranking algorithm is superior to those of others. We demonstrate the validity of DiTeX by summarizing the disease-related topics of each day extracted by our system. To our knowledge, DiTeX is the world’s first automated web-based real-time service system that extracts and presents disease-related topics, trends and related data through web-based sources. DiTeX is now available on the web through http://epidemic.co.kr/media/topics.

[1]  Christopher C. Yang,et al.  Proceedings of the 2012 International Workshop on Smart Health and Wellbeing, SHB 2012, October 29, 2012, Maui, HI, USA , 2012, SHB.

[2]  Bear Bibeault,et al.  jQuery in Action , 2008 .

[3]  Guy L. Steele,et al.  The Java Language Specification , 1996 .

[4]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[5]  Michael J. Paul,et al.  Carmen: A Twitter Geolocation System with Applications to Public Health , 2013 .

[6]  James Bucanek Model-View-Controller Pattern , 2009 .

[7]  Barbara Poblete,et al.  Twitter under crisis: can we trust what we RT? , 2010, SOMA '10.

[8]  Iryna Gurevych,et al.  Answering Learners’ Questions by Retrieving Question Paraphrases from Social Q&A Sites , 2008 .

[9]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[10]  Chang Sup Park,et al.  Does Twitter motivate involvement in politics? Tweeting, opinion leadership, and political engagement , 2013, Comput. Hum. Behav..

[11]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[12]  Rizal Setya Perdana What is Twitter , 2013 .

[13]  Son Doan,et al.  BioCaster: detecting public health rumors with a Web-based text mining system , 2008, Bioinform..

[14]  John R. Koza,et al.  Genetic programming as a means for programming computers by natural selection , 1994 .

[15]  Isabell M. Welpe,et al.  Election Forecasts With Twitter , 2011 .

[16]  Lei Zhang,et al.  Combining lexicon-based and learning-based methods for twitter sentiment analysis , 2011 .

[17]  Pengzhu Zhang,et al.  Health-Related Hot Topic Detection in Online Communities Using Text Clustering , 2013, PloS one.

[18]  Colin J. Ihrig JavaScript Object Notation , 2013 .

[19]  Bradford G. Nickerson,et al.  Communicating and Displaying Real-Time Data with WebSocket , 2012, IEEE Internet Computing.

[20]  Herman D. Tolentino,et al.  Use of Unstructured Event-Based Reports for Global Infectious Disease Surveillance , 2009, Emerging infectious diseases.

[21]  N. Heaivilin,et al.  Public Health Surveillance of Dental Pain via Twitter , 2011, Journal of dental research.

[22]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[23]  James Allan,et al.  INQUERY and TREC-8 , 1998, TREC.

[24]  Anubhav Jain,et al.  The Materials Application Programming Interface (API): A simple, flexible and efficient API for materials data based on REpresentational State Transfer (REST) principles , 2015 .

[25]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[26]  Fan Yu,et al.  Towards large-scale twitter mining for drug-related adverse events , 2012, SHB '12.

[27]  Eleftherios Mylonakis,et al.  Google trends: a web-based tool for real-time surveillance of disease outbreaks. , 2009, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[28]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[29]  E. Nsoesie,et al.  Monitoring Influenza Epidemics in China with Search Query from Baidu , 2013, PloS one.

[30]  Kwok-Leung Tsui,et al.  Forecasting influenza in Hong Kong with Google search queries and statistical model fusion , 2017, PloS one.

[31]  J S Brownstein,et al.  Cloud-based Electronic Health Records for Real-time, Region-specific Influenza Surveillance , 2016, Scientific reports.

[32]  Christopher M. Danforth,et al.  The Geography of Happiness: Connecting Twitter Sentiment and Expression, Demographics, and Objective Characteristics of Place , 2013, PloS one.

[33]  C. Schmidt,et al.  When to use the odds ratio or the relative risk? , 2008, International Journal of Public Health.

[34]  Mark Dredze,et al.  You Are What You Tweet: Analyzing Twitter for Public Health , 2011, ICWSM.

[35]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[36]  Daniel L. Ayres,et al.  BEAGLE: An Application Programming Interface and High-Performance Computing Library for Statistical Phylogenetics , 2011, Systematic biology.

[37]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[38]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[39]  Rod Johnson,et al.  Professional Java Development with the Spring Framework , 2005 .

[40]  D. Laskin Dealing with information overload. , 1994, Journal of oral and maxillofacial surgery : official journal of the American Association of Oral and Maxillofacial Surgeons.

[41]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[42]  Jeffrey Shaman,et al.  Forecasting Influenza Outbreaks in Boroughs and Neighborhoods of New York City , 2016, PLoS Comput. Biol..

[43]  Jungwon Yoon,et al.  Characteristics Analysis of Data From News and Social Network Services , 2018, IEEE Access.

[44]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[45]  M. Williams,et al.  Who Tweets? Deriving the Demographic Characteristics of Age, Occupation and Social Class from Twitter User Meta-Data , 2015, PloS one.

[46]  Starr Roxanne Hiltz,et al.  Structuring computer-mediated communication systems to avoid information overload , 1985, CACM.

[47]  Pável Calado,et al.  A combined component approach for finding collection-adapted ranking functions based on genetic programming , 2007, SIGIR.

[48]  Isabell M. Welpe,et al.  Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment , 2010, ICWSM.

[49]  Bruce Momjian,et al.  PostgreSQL: Introduction and Concepts , 2000 .

[50]  Christophe G. Giraud-Carrier,et al.  Identifying Health-Related Topics on Twitter - An Exploration of Tobacco-Related Tweets as a Test Topic , 2011, SBP.

[51]  Juan Enrique Ramos,et al.  Using TF-IDF to Determine Word Relevance in Document Queries , 2003 .

[52]  Kenneth D. Mandl,et al.  HealthMap: Global Infectious Disease Monitoring through Automated Classification and Visualization of Internet Media Reports , 2008, Journal of the American Medical Informatics Association.

[53]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[54]  Michelle R. Guy,et al.  Twitter earthquake detection: earthquake monitoring in a social world , 2012 .

[55]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .