Improving Categorisation in Social Media Using Hyperlinks to Structured Data Sources

Social media presents unique challenges for topic classification, including the brevity of posts, the informal nature of conversations, and the frequent reliance on external hyperlinks to give context to a conversation. In this paper we investigate the usefulness of these external hyperlinks for categorising the topic of individual posts. We focus our analysis on objects that have related metadata available on the Web, either via APIs or as Linked Data. Our experiments show that the inclusion of metadata from hyperlinked objects in addition to the original post content significantly improved classifier performance on two disparate datasets. We found that including selected metadata from APIs and Linked Data gave better results than including text from HTML pages. We investigate how this improvement varies across different topics. We also make use of the structure of the data to compare the usefulness of different types of external metadata for topic classification in a social media dataset.

[1]  Matthew Rowe,et al.  Mapping tweets to conference talks: a goldmine for semantics , 2010 .

[2]  Barbara Poblete,et al.  Twitter under crisis: can we trust what we RT? , 2010, SOMA '10.

[3]  Lora Aroyo,et al.  The Semantic Web: Research and Applications , 2009, Lecture Notes in Computer Science.

[4]  Andreas Harth,et al.  Towards Semantically-Interlinked Online Communities , 2005, ESWC.

[5]  John G. Breslin,et al.  Topic Classification in Social Media Using Metadata from Hyperlinked Objects , 2011, ECIR.

[6]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[7]  Flavio Figueiredo,et al.  Evidence of quality of textual features on the web 2.0 , 2009, CIKM.

[8]  Jure Leskovec,et al.  Patterns of temporal variation in online media , 2011, WSDM '11.

[9]  Bernard J. Jansen,et al.  Twitter power: Tweets as electronic word of mouth , 2009, J. Assoc. Inf. Sci. Technol..

[10]  Calton Pu,et al.  Study of Trend-Stuffing on Twitter through Text Classification , 2010 .

[11]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[12]  Gilad Mishne,et al.  Finding high-quality content in social media , 2008, WSDM '08.

[13]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[14]  Barry Smyth,et al.  Towards tagging and categorization for micro-blogs , 2010, AAAI 2010.

[15]  Brian D. Davison,et al.  Classifiers without borders: incorporating fielded text from neighboring web pages , 2008, SIGIR '08.

[16]  John G. Breslin,et al.  Using hyperlinks to enrich message board content with linked data , 2010, I-SEMANTICS '10.

[17]  Rui Li,et al.  Exploring social tagging graph for web object classification , 2009, KDD.

[18]  Aixin Sun,et al.  Blog Classification Using Tags: An Empirical Study , 2007, ICADL.

[19]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[20]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[21]  Hamed Haddadi,et al.  Flash floods and ripples: The spread of media content through the blogosphere , 2009, ICWSM 2009.

[22]  Gerhard Weikum,et al.  Graph-based text classification: learn from your neighbors , 2006, SIGIR.

[23]  Bettina Berendt,et al.  Tags are not metadata, but "just more content" - to some people , 2007, ICWSM.