Analyzing data at a massive scale is one of the biggest challenges that Last.fm
is facing. Interpreting patterns in user behaviour becomes a challenge when
millions of users interact in billions of combinations; the data sets must be
analyzed, summarized and presented visually.
This thesis describes a data store for multi-dimensional time-series-based
data. Measurements are summarized across multiple dimensions. The data
store is optimized for speed of data retrieval: one of the design goals is to serve
data at mouse-click rate to promote real-time data exploration.
Similar data stores do exist but they generally use relational database systems
as their backing database. The novelty of our approach is to model multidimensional
data cubes on top of a distributed, column-oriented database to
reap the scalability benefits of such databases.
------------------------------------------------------------
//Sammanfattning//
Att analysera data pa en massiv skala ar en av de storsta utmaningarna som
Last.fm star infor. Att tolka monster i anvandarbeteende blir en utmaning
nar miljoner anvandare samspelar i miljarder kombinationer. Datamangderna
maste analyseras, summeras och presenteras visuellt.
Detta examensarbete beskriver ett datalager for multidimensionell tidsseriebaserad data. Matt ar summerade over multipla dimensioner. Datalagret ar
optimerat for dataextraheringshastighet: Ett av designmalen ar att servera data
i musklickshastighet for att framja utforskning av data i realtid.
Liknande datalager existerar men de anvander oftast relationella databassystem
som databas for back-end. Nyheten i vart angripssatt ar att modellera multidimensionella
datakuber ovanpa en distribuerad, kolumnorienterad databas for
att utnyttja skalbarhetsfordelarna av sadana databaser.
[1]
Erik Thomsen,et al.
OLAP Solutions - Building Multidimensional Information Systems
,
1997
.
[2]
Jiawei Han,et al.
Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs
,
1998,
Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.
[3]
GhemawatSanjay,et al.
The Google file system
,
2003
.
[4]
Jennifer Widom,et al.
Maintenance of Materialized Views: Problems, Techniques, and Applications
,
1999,
IEEE Data Eng. Bull..
[5]
Douglas Stott Parker,et al.
Map-reduce-merge: simplified relational data processing on large clusters
,
2007,
SIGMOD '07.
[6]
Wilson C. Hsieh,et al.
Bigtable: A Distributed Storage System for Structured Data
,
2006,
TOCS.
[7]
David Wai-Lok Cheung,et al.
DROLAP - A Dense-Region Based Approach to On-Line Analytical Processing
,
1999,
DEXA.
[8]
Ravi Kumar,et al.
Pig latin: a not-so-foreign language for data processing
,
2008,
SIGMOD Conference.
[9]
Chuan Zhang,et al.
HDW: A High Performance Large Scale Data Warehouse
,
2008,
2008 International Multi-symposiums on Computer and Computational Sciences.
[10]
Yannis Sismanis,et al.
Dwarf: shrinking the PetaCube
,
2002,
SIGMOD '02.