Of Cubes, DAGs and Hierarchical Correlations: A Novel Conceptual Model for Analyzing Social Media Data

With the advent of social media there is an ever increasing amount of unstructured data that can be analyzed to obtain insights. Two prominent examples are sentiment analysis and the discovery of correlated concepts. A convenient representation of information in such scenarios is in terms of concepts extracted from the unstructured data, and measures, such as sentiment scores, associated with these concepts. Typically, social media analysis reports these concepts and their associated measures. We argue that much richer insights can be obtained through the use of OLAP-style multidimensional analysis. It is fairly straightforward to see how to add traditional dimension hierarchies such as time and geography, and to analyze the data along these dimensions using traditional OLAP operations such as roll-up; for instance, to answer queries of the form "What was the average sentiment for X in Europe during the past month?" However, it is trickier to answer queries of the form "What was the average sentiment for concepts related to X in Europe during the past month?" We introduce a conceptual modeling framework that extends traditional multidimensional models and OLAP operators to address the new set of requirements for data extracted from social media. In this model, we organize data along both traditional dimensions (we call these metadata dimensions) and concept dimensions, which model relationships among concepts using parent-child hierarchies. Specifically: (i) we allow operations on parent-child hierarchies to be treated in a uniform way as operations on traditional dimension hierarchies; (ii) to model the rich relationships that can exist among concepts, we extend the parent-child hierarchies to be rooted level-DAGs rather than simply trees; and (iii) we introduce new equivalence classes that allow us to reason with "similar" concepts in new ways. We show that our modeling and operator framework facilitates multidimensional analysis to gain further insights from social media data than is possible with existing methods.

[1]  Esteban Zimányi,et al.  Hierarchies in a multidimensional model: From conceptual modeling to logical representation , 2006, Data Knowl. Eng..

[2]  Esteban Zimányi,et al.  OLAP Hierarchies: A Conceptual Perspective , 2004, CAiSE.

[3]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[4]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[5]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[6]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[7]  Lei Zhang,et al.  LCI: a social channel analysis platform for live customer intelligence , 2011, SIGMOD '11.

[8]  Bo Zhao,et al.  Text Cube: Computing IR Measures for Multidimensional Text Database Analysis , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[9]  Jiawei Han,et al.  Topic modeling for OLAP on multidimensional text databases: topic cube and its applications , 2009 .

[10]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[11]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[12]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near duplicate detection , 2008, WWW.

[13]  Raghu Ramakrishnan,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999, SIGMOD '99.

[14]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[15]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.