Big Datasets for Research: A Survey on Flagship Conferences

It is obvious that big data can bring us new opportunities to discover valuable information. Apparently, corresponding big datasets are powerful tools for scholars, which connect theoretical studies to reality. They can help scholars to evaluate their achievements and find new problems. In recent years, there has been a significant growth in research data repositories and registries. However, these infrastructures are fragmented across institutions, countries and research domains. As such, finding research datasets is not a trivial task for many researchers. Thus we investigated 195 papers regarding big data on some notable international conferences in recent 3 years, and also gathered 285 datasets mentioned in them. In this paper, we present and analyze our survey results in terms of the status quo of big data research and datasets from different aspects. In particular, we propose two different taxonomies of big datasets and classify our surveyed datasets into them. In addition, we also give a brief introduction about 7 widely accepted data collections online. Finally, some basic principles for scholars in choosing and using big datasets are given.

[1]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[2]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[3]  Sindhu P. Menon,et al.  A survey of tools and applications in big data , 2015, 2015 IEEE 9th International Conference on Intelligent Systems and Control (ISCO).

[4]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[5]  Victoria L. Rubin,et al.  Veracity Roadmap: Is Big Data Objective, Truthful and Credible? , 2014 .

[6]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[7]  N. B. Anuar,et al.  The rise of "big data" on cloud computing: Review and open research issues , 2015, Inf. Syst..

[8]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[9]  Yunhao Liu,et al.  Big Data: A Survey , 2014, Mob. Networks Appl..

[10]  Tilmann Rabl,et al.  Benchmarking Big Data Systems and the BigData Top100 List , 2013, Big Data.

[11]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[12]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[13]  Muhammad Shiraz,et al.  Big Data: Survey, Technologies, Opportunities, and Challenges , 2014, TheScientificWorldJournal.

[14]  Tilmann Rabl,et al.  A Data Generator for Cloud-Scale Benchmarking , 2010, TPCTC.

[15]  Rajkumar Buyya,et al.  Big Data computing and clouds: Trends and future directions , 2013, J. Parallel Distributed Comput..