Scaling of Complex Calculations over Big Data-Sets

This article introduces a novel approach to scale complex calculations in extensive IT infrastructures and presents significant case studies in SONCA and DISESOR projects. Described system is enabling parallelism of calculations by providing dynamic data sharding without necessity of direct integration with storage repositories. Presented solution doesn’t require to complete a single phase of processing before starting the next one, hence it is suitable for supporting many dependent calculations and can be used to provide scalability and robustness of whole data processing pipelines. Introduced mechanism is designed to support case of still emerging data, thereby it is suitable for data streams e.g. transformation and analysis of data collected from multiple sensors. As will be shown in this article, this approach scales well and is very attractive because can be easily applied to data processing between heterogeneous systems.

[1]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[2]  Pearl Brereton,et al.  Service-based software: the future for flexible software , 2000, Proceedings Seventh Asia-Pacific Software Engeering Conference. APSEC 2000.

[3]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[4]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[5]  GhemawatSanjay,et al.  The Google file system , 2003 .

[6]  Dominik Slezak,et al.  Unsupervised Similarity Learning from Textual Data , 2012, Fundam. Informaticae.

[7]  Frank Dabek,et al.  Large-scale Incremental Processing Using Distributed Transactions and Notifications , 2010, OSDI.

[8]  Andrzej Janusz,et al.  Semantic Analytics of PubMed Content , 2011, USAB.

[9]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[10]  Krzysztof Stencel,et al.  On redundant data for faster recursive querying via ORM systems , 2013, 2013 Federated Conference on Computer Science and Information Systems.

[11]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[12]  Jim Gray,et al.  A critique of ANSI SQL isolation levels , 1995, SIGMOD '95.

[13]  Michael Burrows,et al.  The Chubby Lock Service for Loosely-Coupled Distributed Systems , 2006, OSDI.

[14]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[15]  Robert Bembenik,et al.  Intelligent Tools for Building a Scientific Information Platform , 2013, Intelligent Tools for Building a Scientific Information Platform.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Marek Grzegorowski,et al.  SONCA: Scalable Semantic Processing of Rapidly Growing Document Stores , 2012, ADBIS Workshops.

[18]  Patrick Valduriez,et al.  Principles of Distributed Database Systems, Third Edition , 2011 .

[19]  GrayJim,et al.  A critique of ANSI SQL isolation levels , 1995 .

[20]  Andreas Holzinger,et al.  Information Quality in e-Health - 7th Conference of the Workgroup Human-Computer Interaction and Usability Engineering of the Austrian Computer Society, USAB 2011, Graz, Austria, November 25-26, 2011. Proceedings , 2011, USAB.

[21]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .