SmartFetch: Efficient Support for Selective Queries

The paper proposes SmartFetch, a storage strategy that relies on a combination of techniques aimed at efficiently supporting selective jobs that are only concerned with a subset of the entire dataset in systems such as Hadoop and Spark. We combine the use of an appropriate data-layout with data indexing tools to improve the data access speed and significantly shorten total job execution time. An extensive experimental evaluation of SmartFetch shows that, by avoiding reading irrelevant blocks, it can provide significant speedups when compared to the basic Hadoop and Spark implementations. Further, our system also outperforms other implementations that use several variants of the techniques we have embedded in SmartFetch.

[1]  Jorge-Arnulfo Quiané-Ruiz,et al.  Trojan data layouts: right shoes for a running elephant , 2011, SoCC.

[2]  Vinay Setty,et al.  Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..

[3]  Zhiwei Xu,et al.  RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[4]  Christopher M. Danforth,et al.  Happiness and the Patterns of Life: A Study of Geolocated Tweets , 2013, Scientific Reports.

[5]  Rajeev Gandhi,et al.  An Analysis of Traces from a Production MapReduce Cluster , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[6]  Peter J. Haas,et al.  Eagle-eyed elephant: split-oriented indexing in Hadoop , 2013, EDBT '13.

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Felix Naumann,et al.  The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[9]  Jimmy Lin,et al.  Full-text indexing for optimizing selection operations in large-scale data analytics , 2011, MapReduce '11.

[10]  Jorge-Arnulfo Quiané-Ruiz,et al.  Only Aggressive Elephants are Fast Elephants , 2012, Proc. VLDB Endow..

[11]  Ling Liu,et al.  Purlieus: Locality-aware resource allocation for MapReduce in a cloud , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Abraham Silberschatz,et al.  HadoopDB in action: building real world applications , 2010, SIGMOD Conference.

[13]  George Kingsley Zipf,et al.  The Psychobiology of Language , 2022 .

[14]  Minghong Lin,et al.  Joint optimization of overlapping phases in MapReduce , 2013, PERV.

[15]  Beng Chin Ooi,et al.  Llama: leveraging columnar storage for scalable join processing in the MapReduce framework , 2011, SIGMOD '11.

[16]  Maya Bialik,et al.  Sentiment in New York City: A High Resolution Spatial and Temporal View , 2013, ArXiv.

[17]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[18]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[19]  Christopher M. Danforth,et al.  The Geography of Happiness: Connecting Twitter Sentiment and Expression, Demographics, and Objective Characteristics of Place , 2013, PloS one.

[20]  Christopher M. Danforth,et al.  Positivity of the English Language , 2011, PloS one.

[21]  Daniel J. Abadi,et al.  Column-stores vs. row-stores: how different are they really? , 2008, SIGMOD Conference.

[22]  Jorge-Arnulfo Quiané-Ruiz,et al.  Towards Zero-Overhead Adaptive Indexing in Hadoop , 2012, ArXiv.

[23]  Jignesh M. Patel,et al.  Column-Oriented Storage Techniques for MapReduce , 2011, Proc. VLDB Endow..

[24]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[25]  R. Schiffer Psychobiology of Language , 1986 .

[26]  Songting Chen,et al.  Cheetah , 2010, Proc. VLDB Endow..