Tassonomy and Review of Big Data Solutions Navigation

In recent years more and more often, we hear about Big Data and issues related to the management of these huge volume of data, and considering that there is no definitive solution for their storage, querying, analysis. This discipline is in great evolution, multidisciplinary and very complex to be dominated for the several aspects to be addressed: architectural, data structure, data management, data analytics, protection and security, computational as parallel and distributed processing, etc. To cope with these problems a basic survey of main tips, which would allow the developer to orient themselves in the choice of the best solution, for the development of an architecture for the management of Big Data, in each specific case could be of great support and help. In fact a large number of fields, from industry to scientific research, are inquiring information and hints about what can be obtained through the analysis of Big Data and how these infrastructures can be created, by using what, etc. And, therefore, more powerful solutions are needed to cope with increasing complexity and variety of problems, for their better management and exploitation of the open accessible, produced and integrated data. In this paper, starting from the analysis of the existing solutions, particularly interesting and well documented use cases, we identified a group of main differentiating features which can mainly influence the choice of the solution to be set up; then we looked at different types of existing solutions and products to see how they are handled the identified features. Lastly, the results obtained with the analysis of the desirable features for each main domain and then the identification of the most suitable and/or adopted products for the application domains. The work cannot be exhaustive, and in many cases we had to decide to include and to exclude aspects and tools. The obtained results can be regarded as a model and main guidelines for big data solution navigation.

[1]  Wolfgang Meier,et al.  eXist: An Open Source Native XML Database , 2002, Web, Web-Services, and Database Systems.

[2]  Peter Baumann,et al.  The multidimensional database system RasDaMan , 1998, SIGMOD '98.

[3]  Jin Xiong,et al.  Building Highly Available Cluster File System Based on Replication , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[4]  Eric A. Brewer,et al.  Harvest, yield, and scalable tolerant systems , 1999, Proceedings of the Seventh Workshop on Hot Topics in Operating Systems.

[5]  Alexandre de Spindler,et al.  Semantic Data Management for db4o , 2008, ICOODB.

[6]  Thorsten Schütt,et al.  ConPaaS: an integrated runtime environment for elastic cloud applications , 2011, PDT '11.

[7]  Ian T. Foster,et al.  Grid Services for Distributed System Integration , 2002, Computer.

[8]  Thomas Sauerwald,et al.  Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms , 2010, SODA 2010.

[9]  Miguel A. Martínez-Prieto,et al.  RDF Visualization using a Three-Dimensional Adjacency Matrix , 2011 .

[10]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[11]  Eugene J. Shekita,et al.  Beyond basic faceted search , 2008, WSDM '08.

[12]  David J. DeWitt,et al.  Clustera: an integrated computation and data management system , 2008, Proc. VLDB Endow..

[13]  E. Brewer,et al.  CAP twelve years later: How the "rules" have changed , 2012, Computer.

[14]  Yang Jin,et al.  A Distributed Storage Model for EHR Based on HBase , 2011, 2011 International Conference on Information Management, Innovation Management and Industrial Engineering.

[15]  Pierfrancesco Bellini,et al.  Micro Grids for Scalable Media Computing and Intelligence in Distributed Scenarios , 2012, IEEE MultiMedia.

[16]  M. Hanna Data mining in the e‐learning domain , 2004 .

[17]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[18]  Renato Ianella Open Digital Rights Language (ODRL) , 2007 .

[19]  Jennifer Widom,et al.  Indexing Semistructured Data , 1998 .

[20]  Pierfrancesco Bellini,et al.  On the Effectiveness and Optimization of Information Retrieval for Cross Media Content , 2012, KDIR.

[21]  Patrick Valduriez,et al.  Data Management in Large-Scale P2P Systems , 2004, VECPAR.

[22]  Pierfrancesco Bellini,et al.  Mobile Medicine: semantic computing management for health care applications on desktop and mobile devices , 2012, Multimedia Tools and Applications.

[23]  Djoerd Hiemstra A Big Data Platform for Large Scale Event Processing , 2012 .

[24]  Mary K Obenshain Application of Data Mining Techniques to Healthcare Data , 2004, Infection Control & Hospital Epidemiology.

[25]  Pierfrancesco Bellini,et al.  Exploiting P2P scalability for grant authorization in digital rights management solutions , 2013, Multimedia Tools and Applications.

[26]  Eric A. Brewer,et al.  Lessons from Giant-Scale Services , 2001, IEEE Internet Comput..

[27]  Michael Wilson,et al.  Managing Large Data Volumes from Scientific Facilities , 2012, ERCIM News.

[28]  Jennifer Widom,et al.  Adaptive filters for continuous queries over distributed data streams , 2003, SIGMOD '03.

[29]  RadhaKanta Mahapatra,et al.  Business data mining - a machine learning perspective , 2001, Inf. Manag..

[30]  Thierry Lecroq,et al.  A Scalable Indexing Solution to Mine Huge Genomic Sequence Collections , 2012, ERCIM News.

[31]  Dave Dykstra Comparison of the Frontier Distributed Database Caching System to NoSQL Databases , 2012 .

[32]  Mike Thelwall,et al.  A web crawler design for data mining , 2001, J. Inf. Sci..

[33]  Rick Cattell,et al.  Scalable SQL and NoSQL data stores , 2011, SGMD.

[34]  S. Sudarshan,et al.  Incremental Organization for Data Recording and Warehousing , 1997, VLDB.

[35]  Krishna P. Gummadi,et al.  Exploiting Social Networks for Internet Search , 2006, HotNets.

[36]  Pierfrancesco Bellini,et al.  AXMEDIS architectural solution for interoperable content and DRM on multichannel distribution , 2006, AIAI.

[37]  Mikhail Bautin,et al.  Storage Infrastructure Behind Facebook Messages: Using HBase at Scale , 2012, IEEE Data Eng. Bull..

[38]  Randal E. Bryant,et al.  From Data to Knowledge to Action : Enabling Advanced Intelligence and Decision-Making for America ’ s Security , 2010 .

[39]  Ryan S. Baker,et al.  From Data to Knowledge to Action: Enabling Personalized Education , 2010 .

[40]  Salvatore Iaconesi,et al.  The Co-Creation of the City , 2013 .

[41]  Peter Zinterhof Computer-Aided Diagnostics , 2012, ERCIM News.

[42]  Debanjan Ghosh,et al.  Self-healing systems - survey and synthesis , 2007, Decis. Support Syst..

[43]  Sebnem Rusitschka,et al.  Smart Grid Data Cloud: A Model for Utilizing Cloud Computing in the Smart Grid Domain , 2010, 2010 First IEEE International Conference on Smart Grid Communications.

[44]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[45]  Adam Jacobs,et al.  The pathologies of big data , 2009, Commun. ACM.

[46]  Z. Vale,et al.  An electric energy consumer characterization framework based on data mining techniques , 2005, IEEE Transactions on Power Systems.

[47]  Paul Zikopoulos,et al.  Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data , 2011 .