D-Ocean: an unstructured data management system for data ocean environment

Together with the big datamovement,many organizations collect their own big data and build distinctive applications. In order to provide smart services upon big data, massive variable data should be well linked and organized to form Data Ocean, which specially emphasizes the deep exploration of the relationships among unstructured data to support smart services. Currently, almost all of these applications have to deal with unstructured data by integrating various analysis and search techniques upon massive storage and processing infrastructure at the application level, which greatly increase the difficulty and cost of application development.This paper presents D-Ocean, an unstructured data management system for data ocean environment. D-Ocean has an open and scalable architecture, which consists of a core platform, pluggable components and auxiliary tools. It exploits a unified storage framework to store data in different kinds of data stores, integrates batch and incremental processing mechanisms to process unstructured data, and provides a combined search engine to conduct compound queries. Furthermore, a so-called RAISE process modeling is proposed to support the whole process of Repository, Analysis, Index, Search and Environment modeling, which can greatly simplify application development. The experiments and use cases in production demonstrate the efficiency and usability of D-Ocean.

[1]  Jens Dittrich,et al.  iDM: a unified and versatile data model for personal dataspace management , 2006, VLDB.

[2]  Thorsten Brants,et al.  Natural Language Processing in Information Retrieval , 2003, CLIN.

[3]  Jimmy J. Lin,et al.  Summingbird: A Framework for Integrating Batch and Online MapReduce Computations , 2014, Proc. VLDB Endow..

[4]  Zhang Xin,et al.  OrientX: An integrated, schema based native XML database system , 2008, Wuhan University Journal of Natural Sciences.

[5]  Jim Melton,et al.  SQL multimedia and application packages (SQL/MM) , 2001, SGMD.

[6]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[7]  Michael Stonebraker,et al.  SciDB DBMS Research at M.I.T , 2013, IEEE Data Eng. Bull..

[8]  Hakan Hacigümüs,et al.  MISO: souping up big data query processing with a multistore system , 2014, SIGMOD Conference.

[9]  Yueting Zhuang,et al.  Digital Library Engine: Adapting Digital Library for Cloud Computing , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.

[10]  Baogang Wei,et al.  Transactional Multi-row Access Guarantee in the Key-Value Store , 2012, 2012 IEEE International Conference on Cluster Computing.

[11]  Fernando Pereira,et al.  MPEG-7 the generic multimedia content description standard, part 1 - Multimedia, IEEE , 2001 .

[12]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[13]  Mathias Lux,et al.  Lire: lucene image retrieval: an extensible java CBIR library , 2008, ACM Multimedia.

[14]  Yanlei Diao,et al.  High-performance complex event processing over streams , 2006, SIGMOD Conference.

[15]  Johannes Gehrke,et al.  Cayuga: a high-performance event processing engine , 2007, SIGMOD '07.

[16]  Beng Chin Ooi,et al.  Big data: the driver for innovation in databases , 2014 .

[17]  Chen Li,et al.  AsterixDB: A Scalable, Open Source BDMS , 2014, Proc. VLDB Endow..

[18]  Dan Suciu,et al.  A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[19]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[20]  Michael Stonebraker,et al.  The VoltDB Main Memory DBMS , 2013, IEEE Data Eng. Bull..

[21]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[22]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[23]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[24]  David Maier,et al.  Principles of dataspace systems , 2006, PODS '06.

[25]  Kian-Lee Tan,et al.  epiC: an extensible and scalable system for processing Big Data , 2014, The VLDB Journal.

[26]  Kyoungro Yoon,et al.  The MPEG Query Format: Unifying Access to Multimedia Retrieval Systems , 2008, IEEE MultiMedia.

[27]  Yueting Zhuang,et al.  Hypergraph spectral hashing for similarity search of social image , 2011, ACM Multimedia.

[28]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[29]  Matthias Jarke,et al.  Query Optimization in Database Systems , 1984, CSUR.

[30]  Klara Nahrstedt,et al.  Multimedia: Computing, Communications and Applications , 1994 .

[31]  Yun-he Pan Important developments for the digital library: Data Ocean and Smart Library , 2010, Journal of Zhejiang University SCIENCE C.