SABINE: A Multi-purpose Dataset of Semantically-Annotated Social Content

Social Business Intelligence (SBI) is the discipline that combines corporate data with social content to let decision makers analyze the trends perceived from the environment. SBI poses research challenges in several areas, such as IR, data mining, and NLP; unfortunately, SBI research is often restrained by the lack of publicly-available, real-world data for experimenting approaches, and by the difficulties in determining a ground truth. To fill this gap we present SABINE, a modular dataset in the domain of European politics. SABINE includes 6 millions bilingual clips crawled from 50 000 web sources, each associated with metadata and sentiment scores; an ontology with 400 topics, their occurrences in the clips, and their mapping to DBpedia; two multidimensional cubes for analyzing and aggregating sentiment and semantic occurrences. We also propose a set of research challenges that can be addressed using SABINE; remarkably, the presence of an expert-validated ground truth ensures the possibility of testing approaches to the whole SBI process as well as to each single task.

[1]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[2]  Heiner Stuckenschmidt,et al.  Results of the Ontology Alignment Evaluation Initiative 2007 , 2006, OM.

[3]  Chenhui Chu,et al.  Iterative Bilingual Lexicon Extraction from Comparable Corpora with Topical and Contextual Knowledge , 2014, CICLing.

[4]  Matteo Golfarelli,et al.  Social Business Intelligence in Action , 2016, CAiSE.

[5]  Carlo Aliprandi,et al.  Sentiment Analysis on Social Media , 2012, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

[6]  Silvana Castano,et al.  Matching Ontologies in Open Networked Systems: Techniques and Applications , 2006, J. Data Semant..

[7]  Massih-Reza Amini,et al.  Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization , 2009, NIPS.

[8]  Preslav Nakov,et al.  SemEval-2016 Task 4: Sentiment Analysis in Twitter , 2016, *SEMEVAL.

[9]  Matteo Golfarelli,et al.  Data Warehouse Design: Modern Principles and Methodologies , 2009 .

[10]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[11]  Silvana Castano,et al.  Combining crowd consensus and user trustworthiness for managing collective tasks , 2016, Future Gener. Comput. Syst..

[12]  Matteo Golfarelli,et al.  Advanced topic modeling for social business intelligence , 2015, Inf. Syst..

[13]  Lei Zhang,et al.  A Survey of Opinion Mining and Sentiment Analysis , 2012, Mining Text Data.

[14]  Matteo Golfarelli,et al.  A methodology for social BI , 2014, IDEAS.