Massive Data Analysis: Tasks, Tools, Applications, and Challenges

In this study, we provide an overview of the state-of-the-art technologies in programming, computing, and storage of the massive data analytics landscape. We shed light on different types of analytics that can be performed on massive data. For that, we first provide a detailed taxonomy on different analytic types along with examples of each type. Next, we highlight technology trends of massive data analytics that are available for corporations, government agencies, and researchers. In addition, we enumerate several instances of opportunities that exist for turning massive data into knowledge. We describe and position two distinct case studies of massive data analytics that are being investigated in our research group: recommendation systems in e-commerce applications; and link discovery to predict unknown association of medical concepts. Finally, we discuss the lessons we have learnt and open challenges faced by researchers and businesses in the field of massive data analytics.

[1]  Saman A. Zonouz,et al.  RESeED: Regular Expression Search over Encrypted Data in the Cloud , 2014, 2014 IEEE 7th International Conference on Cloud Computing.

[2]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[3]  Melnned M. Kantardzic Big Data Analytics , 2013, Lecture Notes in Computer Science.

[4]  Rajkumar Buyya,et al.  Adapting Market-Oriented Scheduling Policies for Cloud Computing , 2010, ICA3PP.

[5]  Tolga Könik,et al.  Recommending similar items in large-scale online marketplaces , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[6]  Joseph E. Gonzalez,et al.  GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[7]  David Cunningham,et al.  M3R: Increased performance for in-memory Hadoop jobs , 2012, Proc. VLDB Endow..

[8]  Ramesh C. Jain,et al.  Situation recognition: an evolving problem for heterogeneous dynamic big multimedia data , 2012, ACM Multimedia.

[9]  Ahmed Abbasi,et al.  MetaFraud: A Meta-Learning Framework for Detecting Financial Fraud , 2012, MIS Q..

[10]  Roland H. C. Yap,et al.  Tagged-MapReduce: A General Framework for Secure Computing with Mixed-Sensitivity Data on Hybrid Clouds , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[11]  GhemawatSanjay,et al.  The Google file system , 2003 .

[12]  Anthony M. Middleton HPCC Systems: Data Intensive Supercomputing Solutions , 2011 .

[13]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[14]  Xiao Liu,et al.  A data placement strategy in scientific cloud workflows , 2010, Future Gener. Comput. Syst..

[15]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[16]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[17]  Changsheng Xie,et al.  Reducing Storage Overhead with Small Write Bottleneck Avoiding in Cloud RAID System , 2012, 2012 ACM/IEEE 13th International Conference on Grid Computing.

[18]  Özge Alaçam,et al.  A Usability Study of WebMaps with Eye Tracking Tool: The Effects of Iconic Representation of Information , 2009, HCI.

[19]  Vijay V. Raghavan,et al.  Big Data: Promises and Problems , 2015, Computer.

[20]  S. Wolpin An exploratory study of an intranet dashboard in a multi-state healthcare system. , 2003, Studies in health technology and informatics.

[21]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[22]  Vijay V. Raghavan,et al.  Hypotheses generation as supervised link discovery with automated class labeling on large-scale biomedical concept networks , 2012, BMC Genomics.

[23]  Andrey Brito,et al.  Low-Overhead Fault Tolerance for High-Throughput Data Processing Systems , 2011, 2011 31st International Conference on Distributed Computing Systems.

[24]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[25]  Rajkumar Buyya,et al.  Big Data computing and clouds: Trends and future directions , 2013, J. Parallel Distributed Comput..

[26]  Dirk Van den Poel,et al.  Predicting online-purchasing behaviour , 2005, Eur. J. Oper. Res..

[27]  Chuck Lam,et al.  Hadoop in Action , 2010 .

[28]  Ck Cheng,et al.  The Age of Big Data , 2015 .

[29]  Jeffrey Dean,et al.  Keynote talk: Experiences with MapReduce, an abstraction for large-scale computation , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[30]  Tolga Könik,et al.  Subjective Similarity: Personalizing Alternative Item Recommendations , 2015, WWW.

[31]  John F. Canny,et al.  Recommending ephemeral items at web scale , 2011, SIGIR.

[32]  Mohammad Al Hasan,et al.  A Survey of Link Prediction in Social Networks , 2011, Social Network Data Analytics.

[33]  Xiao Liu,et al.  Cloud Data Management for Scientific Workflows: Research Issues, Methodologies, and State-of-the-Art , 2014, 2014 10th International Conference on Semantics, Knowledge and Grids.

[34]  Vijay V. Raghavan,et al.  NoSQL Systems for Big Data Management , 2014, 2014 IEEE World Congress on Services.

[35]  Dirk Van den Poel,et al.  Predicting customer loyalty using the internal transactional database , 2007, Expert Syst. Appl..

[36]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[37]  Indranil Gupta,et al.  Making cloud intermediate data fault-tolerant , 2010, SoCC '10.

[38]  Sachchidanand Singh,et al.  Big Data analytics , 2012 .

[39]  Vijay V. Raghavan,et al.  Renaissance in Data Management Systems : SQL , NoSQL , and NewSQL ∗ , 2015 .

[40]  Anthony M. Middleton,et al.  ECL/HPCC: A Unified Approach to Big Data , 2011 .

[41]  Yun Yang,et al.  A Novel Cost-Effective Dynamic Data Replication Strategy for Reliability in Cloud Data Centres , 2011, 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing.