Data Model for Analysis of Scholarly Documents in the MapReduce Paradigm

At CeON ICM UW we are in possession of a large collection of scholarly documents that we store and process using MapReduce paradigm. One of the main challenges is to design a simple, but effective data model that fits various data access patterns and allows us to perform diverse analysis efficiently. In this paper, we will describe the organization of our data and explain how this data is accessed and processed by open-source tools from Apache Hadoop Ecosystem.

[1]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[2]  Bhavani M. Thuraisingham,et al.  Jena-HBase: A Distributed, Scalable and Effcient RDF Triple Store , 2012, SEMWEB.

[3]  Jianling Sun,et al.  Scalable RDF store based on HBase and MapReduce , 2010, 2010 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE).

[4]  Lars George,et al.  HBase: The Definitive Guide , 2011 .

[5]  Kevin Wilkinson,et al.  Jena Property Table Implementation , 2006 .

[6]  Georg Lausen,et al.  PigSPARQL: mapping SPARQL to Pig Latin , 2011, SWIM '11.

[7]  Alan Gates Programming Pig , 2011 .

[8]  Abraham Bernstein,et al.  Hexastore: sextuple indexing for semantic web data management , 2008, Proc. VLDB Endow..

[9]  Daniel J. Abadi,et al.  Scalable Semantic Web Data Management Using Vertical Partitioning , 2007, VLDB.

[10]  Richard E. Schantz,et al.  Clause-iteration with MapReduce to scalably query datagraphs in the SHARD graph-store , 2011, DIDC '11.

[11]  Dave Reynolds,et al.  Efficient RDF Storage and Retrieval in Jena2 , 2003, SWDB.

[12]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[13]  Jakub Jurkiewicz,et al.  Detailed Presentation versus Ease of Search - Towards the Universal Format of Bibliographic Metadata. Case Study of Dealing with Different Metadata Kinds during Import to Virtual Library of Science , 2011, MTSR.

[14]  Ioannis Konstantinou,et al.  H2RDF: adaptive query processing on RDF data in the cloud. , 2012, WWW.

[15]  Ioannis N. Athanasiadis,et al.  Metadata and Semantic Research - 4th International Conference, MTSR 2010, Alcalá de Henares, Spain, October 20-22, 2010. Proceedings , 2010, MTSR.