Hybrid storage architecture and efficient MapReduce processing for unstructured data

Abstract As we are now entering the era of data deluge, how to efficiently manage these massive data is becoming a great challenge, especially for the exponentially growing unstructured data, which is far more than structured and semi-structured data. However, unstructured data is more complex for its variety. That is to say, different types of unstructured data have different file size, type and usage, which need different storage and processing for high efficiency. In this paper, we propose a hybrid storage architecture to store the pervasive unstructured data. This hybrid architecture integrates various kinds of data stores within a unified framework, where each type of unstructured data can find its suitable placement policy and it is transparent to users. In addition, we present several partitioning strategies based on the unified framework, which are beneficial to the MapReduce-based batch processing for these unstructured data. The experiments demonstrate that it is possible to build an efficient and smart system through the hybrid architecture and the partitioning strategies.

[1]  Marcin Zukowski,et al.  DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing , 2008, DaMoN '08.

[2]  Philipp Rösch,et al.  A Storage Advisor for Hybrid-Store Databases , 2012, Proc. VLDB Endow..

[3]  Wolfgang Lehner,et al.  SAP HANA database: data management for modern business applications , 2012, SGMD.

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  Hakan Hacigümüs,et al.  Odyssey: A Multi-Store System for Evolutionary Analytics , 2013, Proc. VLDB Endow..

[6]  Kenneth Salem,et al.  Hybrid Storage Management for Database Systems , 2013, Proc. VLDB Endow..

[7]  Jorge-Arnulfo Quiané-Ruiz,et al.  Trojan data layouts: right shoes for a running elephant , 2011, SoCC.

[8]  Alexander Zeier,et al.  A Hybrid Row-Column OLTP Database Architecture for Operational Reporting , 2008, BIRTE.

[9]  David J. DeWitt,et al.  Split query processing in polybase , 2013, SIGMOD '13.

[10]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[11]  Jignesh M. Patel,et al.  Data Morphing: An Adaptive, Cache-Conscious Storage Technique , 2003, VLDB.

[12]  Tian Luo,et al.  hStorage-DB: Heterogeneity-aware Data Management to Exploit the Full Capability of Hybrid Storage Systems , 2012, Proc. VLDB Endow..

[13]  Hong Min,et al.  Octopus: Hybrid Big Data Integration Engine , 2015, 2015 IEEE 7th International Conference on Cloud Computing Technology and Science (CloudCom).

[14]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[15]  Vivek S. Pai,et al.  SSDAlloc: Hybrid SSD/RAM Memory Management Made Easy , 2011, NSDI.

[16]  Jorge-Arnulfo Quiané-Ruiz,et al.  WWHow! Freeing Data Storage from Cages , 2013, CIDR.

[17]  David J. DeWitt,et al.  Weaving Relations for Cache Performance , 2001, VLDB.

[18]  Michael Stonebraker,et al.  The BigDAWG Polystore System , 2015, SGMD.

[19]  Alexander Zeier,et al.  HYRISE - A Main Memory Hybrid Storage Engine , 2010, Proc. VLDB Endow..

[20]  Eno Thereska,et al.  Multi-structured Redundancy , 2012, HotStorage.

[21]  Hakan Hacigümüs,et al.  MISO: souping up big data query processing with a multistore system , 2014, SIGMOD Conference.

[22]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..