Are raw RSS feeds suitable for broad issue scanning? A science concern case study

Broad issue scanning is the task of identifying important public debates arising in a given broad issue; really simple syndication (RSS) feeds are a natural information source for investigating broad issues. RSS, as originally conceived, is a method for publishing timely and concise information on the Internet, for example, about the main stories in a news site or the latest postings in a blog. RSS feeds are potentially a nonintrusive source of high-quality data about public opinion: Monitoring a large number may allow quantitative methods to extract information relevant to a given need. In this article we describe an RSS feed-based coword frequency method to identify bursts of discussion relevant to a given broad issue. A case study of public science concerns is used to demonstrate the method and assess the suitability of raw RSS feeds for broad issue scanning (i.e., without data cleansing). An attempt to identify genuine science concern debates from the corpus through investigating the top 1,000 “burst” words found only two genuine debates, however. The low success rate was mainly caused by a few pathological feeds that dominated the results and obscured any significant debates. The results point to the need to develop effective data cleansing procedures for RSS feeds, particularly if there is not a large quantity of discussion about the broad issue, and a range of potential techniques is suggested. Finally, the analysis confirmed that the time series information generated by real-time monitoring of RSS feeds could usefully illustrate the evolution of new debates relevant to a broad issue. © 2006 Wiley Periodicals, Inc.

[1]  Chih-Ping Wei,et al.  Event detection from online news documents for supporting environmental scanning , 2004, Decis. Support Syst..

[2]  Seda Özmutlu,et al.  Neural network applications for automatic new topic identification , 2005, Online Inf. Rev..

[3]  Rebecca Blood,et al.  How blogging software reshapes the online community , 2004, CACM.

[4]  Bernardo A. Huberman,et al.  The laws of the web - patterns in the ecology of information , 2001 .

[5]  Ron Miller Ebooks worm their way into the reference market , 2005 .

[6]  Chris Clifton,et al.  TopCat: Data Mining for Topic Identification in a Text Corpus , 2004, IEEE Trans. Knowl. Data Eng..

[7]  Loet Leydesdroff Words and co-words as indicators of intellectual organization , 1989 .

[8]  Irene Wormell,et al.  Critical Aspects of the Danish Welfare State — as Revealed by Issue Tracking , 2000, Scientometrics.

[9]  T. V. Beken,et al.  Risky business: A risk-based methodology to measure organized crime , 2004 .

[10]  Alexandre Caldas Are newsgroups extending "invisible colleges" into the digital infrastructure of science? , 2003 .

[11]  F. W. Lancaster,et al.  Bibliometric techniques applied to issues management: A case study , 1985, J. Am. Soc. Inf. Sci..

[12]  Loet Leydesdorff,et al.  Multiple presents: how search engines rewrite the past , 2006, New Media Soc..

[13]  Susan C. Herring,et al.  Micro-Longitudinal Analysis of Web News Updates , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[14]  Donald Matheson,et al.  Weblogs and the Epistemology of the News: Some Trends in Online Journalism , 2004, New Media Soc..

[15]  Eytan Adar,et al.  Implicit Structure and the Dynamics of Blogspace , 2004 .

[16]  Jörg Meibauer,et al.  Dynamic aspects of German -er-nominals: a probe into the interrelation of language change and language acquisition , 2004 .

[17]  Mike Thelwall,et al.  Link Analysis: An Information Science Approach , 2004 .

[18]  Blaise Cronin,et al.  The citation process: The role and significance of citations in scientific communication , 1984 .

[19]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[20]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[21]  J. Corbett,et al.  Testing Public (Un)Certainty of Science , 2004 .

[22]  Christina K. Pikas Blog Searching for Competitive Intelligence, Brand Image, and Reputation Management , 2005 .

[23]  Mikael Klintman,et al.  The Genetically Modified (GM) Food Labelling Controversy , 2002 .

[24]  Ben Hammersley,et al.  Developing Feeds With RSS And Atom , 2005 .

[25]  Ahmad Abdollahzadeh Barforoush,et al.  A Flexible Fuzzy Expert System for Fuzzy Duplicate Elimination in Data Cleaning , 2004, DEXA.

[26]  L. Leydesdorff,et al.  Mapping university-industry-government relations on the Internet: The construction of indicators for a knowledge-based economy. , 2000 .

[27]  Judit Bar-Ilan,et al.  The “mad cow disease”, Usenet Newsgroups and bibliometric laws , 1997, Scientometrics.

[28]  Monika Henzinger,et al.  Hyperlink Analysis for the Web , 2001, IEEE Internet Comput..

[29]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[30]  David R. Karger,et al.  What Would It Mean to Blog on the Semantic Web? , 2004, International Semantic Web Conference.

[31]  Howard D. White,et al.  Author cocitation: A literature measure of intellectual structure , 1981, J. Am. Soc. Inf. Sci..

[32]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[33]  Vivian Cothey,et al.  Web-crawling reliability , 2004, J. Assoc. Inf. Sci. Technol..

[34]  Henry Small Visualizing science by citation mapping , 1999 .

[35]  Robert P. Colwell Trusting a Chaotic Future , 2004, Computer.

[36]  Mike Thelwall,et al.  Web issue analysis: An integrated water resource management case study , 2006, J. Assoc. Inf. Sci. Technol..

[37]  B. Wellman,et al.  Netting Scholars , 2001 .

[38]  D. Price Little Science, Big Science , 1965 .

[39]  Judit Bar-Ilan,et al.  Evolution, continuity, and disappearance of documents on a specific topic on the Web: A longitudinal study of informetrics , 2004, J. Assoc. Inf. Sci. Technol..

[40]  Doheon Lee,et al.  A Taxonomy of Dirty Data , 2004, Data Mining and Knowledge Discovery.

[41]  Albert-László Barabási,et al.  Linked: The New Science of Networks , 2002 .

[42]  Henry Etzkowitz,et al.  Can ‘the public’ be considered as a fourth helix in university-industry-government relations? Report on the Fourth Triple Helix Conference, 2002 , 2003 .

[43]  Bonnie A. Nardi,et al.  Why we blog , 2004, CACM.

[44]  P. David Marshall,et al.  Web Theory: An Introduction , 2002 .

[45]  Jonathan Furner,et al.  Scholarly communication and bibliometrics , 2005, Annu. Rev. Inf. Sci. Technol..

[46]  George Wright,et al.  Exploring e-government futures through the application of scenario planning , 2004 .

[47]  Loet Leydesdorff,et al.  Metaphors and Diaphors in Science Communication , 2005 .

[48]  Loet Leydesdorff,et al.  Internet time and the reliability of search engines , 2004, First Monday.

[49]  Yuefeng Li,et al.  Cooperative strategy for web data mining and cleaning , 2003, Appl. Artif. Intell..

[50]  D. E. Stokes Pasteur's Quadrant: Basic Science and Technological Innovation , 1997 .

[51]  Mike Thelwall,et al.  Conceptualizing documentation on the Web: An evaluation of different heuristic-based models for counting links between university Web sites , 2002, J. Assoc. Inf. Sci. Technol..

[52]  Wolfgang Glänzel,et al.  National characteristics in international scientific co-authorship relations , 2004, Scientometrics.

[53]  Cass R. Sunstein,et al.  Democracy and filtering , 2004, CACM.