Big Data: A Survey

In this paper, we review the background and state-of-the-art of big data. We first introduce the general background of big data and review related technologies, such as could computing, Internet of Things, data centers, and Hadoop. We then focus on the four phases of the value chain of big data, i.e., data generation, data acquisition, data storage, and data analysis. For each phase, we introduce the general background, discuss the technical challenges, and review the latest advances. We finally examine the several representative applications of big data, including enterprise management, Internet of Things, online social networks, medial applications, collective intelligence, and smart grid. These discussions aim to provide a comprehensive overview and big-picture to readers of this exciting area. This survey is concluded with a discussion of open problems and future directions.

[1]  T. W. Anderson An Introduction to Multivariate Statistical Analysis , 1959 .

[2]  T. W. Anderson,et al.  An Introduction to Multivariate Statistical Analysis , 1959 .

[3]  Mahadev Satyanarayanan,et al.  Scale and performance in a distributed file system , 1987, SOSP '87.

[4]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[5]  David Konopnicki,et al.  W3QS: A Query System for the World-Wide Web , 1995, VLDB.

[6]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[7]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[8]  J. Anderson,et al.  IP over SONET , 1998 .

[9]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[10]  Soumen Chakrabarti,et al.  Data mining for hypertext: a tutorial survey , 2000, SKDD.

[11]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis 1 , 2000 .

[12]  Nasir Ghani,et al.  On IP-over-WDM integration , 2000, IEEE Commun. Mag..

[13]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis , 2000, IQ.

[14]  Eric A. Brewer,et al.  Towards robust distributed systems (abstract) , 2000, PODC '00.

[15]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[16]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[17]  Sankar K. Pal,et al.  Web mining in soft computing framework: relevance, state of the art and future directions , 2002, IEEE Trans. Neural Networks.

[18]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[19]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[20]  Yannis Manolopoulos,et al.  Indexing web access-logs for pattern queries , 2002, WIDM '02.

[21]  Kenneth J. Christensen,et al.  A first look at wired sensor networks for video surveillance systems , 2002, 27th Annual IEEE Conference on Local Computer Networks, 2002. Proceedings. LCN 2002..

[22]  Nancy A. Lynch,et al.  Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002, SIGA.

[23]  Duncan J. Watts,et al.  Six Degrees: The Science of a Connected Age , 2003 .

[24]  GhemawatSanjay,et al.  The Google file system , 2003 .

[25]  Rajesh Parekh,et al.  Lessons and Challenges from Mining Retail E-Commerce Data , 2004, Machine Learning.

[26]  Anupam Joshi,et al.  On Using a Warehouse to Analyze Web Logs , 2003, Distributed and Parallel Databases.

[27]  Elisa Bertino,et al.  State-of-the-art in privacy preserving data mining , 2004, SGMD.

[28]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[29]  Wei Hong,et al.  A macroscope in the redwoods , 2005, SenSys '05.

[30]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[31]  J. E. Hirsch,et al.  An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.

[32]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[33]  Alexander G. Gray,et al.  On-line anomaly detection of deployed software: a statistical machine learning approach , 2006, SOQUA '06.

[34]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[35]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[36]  Douglas Crockford,et al.  The application/json Media Type for JavaScript Object Notation (JSON) , 2006, RFC.

[37]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[38]  Sukun Kim,et al.  Health Monitoring of Civil Infrastructures Using Wireless Sensor Networks , 2007, 2007 6th International Symposium on Information Processing in Sensor Networks.

[39]  Bin Wu,et al.  Community detection in large-scale social networks , 2007, WebKDD/SNA-KDD '07.

[40]  Katherine G. Herbert,et al.  Biological data cleaning: a case study , 2007, Int. J. Inf. Qual..

[41]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[42]  John A. Stankovic,et al.  LUSTER: wireless sensor network for environmental research , 2007, SenSys '07.

[43]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[44]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[45]  Thorsten Meinl,et al.  KNIME: The Konstanz Information Miner , 2007, GfKl.

[46]  James Murty,et al.  Programming Amazon web services - S3, EC2, SQS, FPS, and SimpleDB: outsource your infrastructure , 2008 .

[47]  Mani B. Srivastava,et al.  NAWMS: nonintrusive autonomous water monitoring system , 2008, SenSys '08.

[48]  François Ingelrest,et al.  SensorScope: Out-of-the-Box Environmental Monitoring , 2008, 2008 International Conference on Information Processing in Sensor Networks (ipsn 2008).

[49]  Douglas Thain,et al.  All-pairs: An abstraction for data-intensive cloud computing , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[50]  Mohd Norzali Haji Mohd,et al.  Data pre-processing on web server logs for generalized association rules mining algorithm , 2008 .

[51]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[52]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[53]  Qiang Yang,et al.  Translated Learning: Transfer Learning across Different Feature Spaces , 2008, NIPS.

[54]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[55]  James Murty,et al.  Programming amazon web services , 2008 .

[56]  Dan Suciu,et al.  Probabilistic Event Extraction from RFID Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[57]  Jae-Gil Lee,et al.  Mining Massive RFID, Trajectory, and Traffic Data Sets , 2008, Knowledge Discovery and Data Mining.

[58]  Alon Y. Halevy,et al.  Data Integration for the Relational Web , 2009, Proc. VLDB Endow..

[59]  Albert G. Greenberg,et al.  VL2: a scalable and flexible data center network , 2009, SIGCOMM '09.

[60]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[61]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[62]  Sean Quinlan,et al.  GFS: Evolution on Fast-forward , 2009, ACM Queue.

[63]  Lise Getoor,et al.  Co-evolution of social and affiliation networks , 2009, KDD.

[64]  J. Armstrong,et al.  OFDM for Optical Communications , 2009, Journal of Lightwave Technology.

[65]  A. Jacobs The Pathologies of Big Data , 2009, ACM Queue.

[66]  Juyeon Lee,et al.  ON MODELINGA model of mobile community: designing user interfaces to support group interaction , 2009, INTR.

[67]  Amy L. Murphy,et al.  Monitoring heritage buildings with wireless sensor networks: The Torre Aquila deployment , 2009, 2009 International Conference on Information Processing in Sensor Networks.

[68]  You-Jin Park,et al.  Individual and group behavior-based customer profile model for personalized product recommendation , 2009, Expert Syst. Appl..

[69]  H. Takara,et al.  Dynamic optical mesh networks: Drivers, challenges and solutions for the future , 2009, 2009 35th European Conference on Optical Communication.

[70]  Prashant Malik,et al.  Cassandra: structured storage system on a P2P network , 2009, PODC '09.

[71]  Jimeng Sun,et al.  Social influence analysis in large-scale networks , 2009, KDD.

[72]  Douglas Stott Parker,et al.  Traverse: Simplified Indexing on Large Map-Reduce-Merge Clusters , 2009, DASFAA.

[73]  Niklas Carlsson,et al.  Evolution of an online social aggregation network: an empirical study , 2009, IMC '09.

[74]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[75]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[76]  Elena Console,et al.  Data Fusion , 2009, Encyclopedia of Database Systems.

[77]  Haitao Wu,et al.  BCube: a high performance, server-centric network architecture for modular data centers , 2009, SIGCOMM '09.

[78]  Haixun Wang,et al.  Leveraging spatio-temporal redundancy for RFID data cleansing , 2010, SIGMOD Conference.

[79]  Geoffrey C. Fox,et al.  Cloud computing paradigms for pleasingly parallel biomedical applications , 2010, HPDC '10.

[80]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[81]  Juan C. Burguillo,et al.  A hybrid content-based and item-based collaborative filtering approach to recommend TV programs enhanced with singular value decomposition , 2010, Inf. Sci..

[82]  Kristina Chodorow,et al.  MongoDB: The Definitive Guide , 2010 .

[83]  Hong Liu,et al.  Fiber optic communication technologies: What's needed for datacenter network operations , 2010, IEEE Communications Magazine.

[84]  Sanjeev Kumar,et al.  Finding a Needle in Haystack: Facebook's Photo Storage , 2010, OSDI.

[85]  Antony I. T. Rowstron,et al.  Symbiotic routing in future data centers , 2010, SIGCOMM '10.

[86]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[87]  Jure Leskovec,et al.  Empirical comparison of algorithms for network community detection , 2010, WWW '10.

[88]  Roberto Proietti,et al.  DOS - A scalable optical switch for datacenters , 2010, 2010 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS).

[89]  Konstantina Papagiannaki,et al.  c-Through: part-time optics in data centers , 2010, SIGCOMM '10.

[90]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[91]  Deepak S. Turaga,et al.  Multimodal analysis of body sensor network data streams for real-time healthcare , 2010, MIR '10.

[92]  Rami G. Melhem,et al.  Applying statistical machine learning to multicore voltage & frequency scaling , 2010, Conf. Computing Frontiers.

[93]  Amin Vahdat,et al.  Helios: a hybrid electrical/optical switch architecture for modular data centers , 2010, SIGCOMM '10.

[94]  J. Chris Anderson,et al.  CouchDB: The Definitive Guide , 2010 .

[95]  Atul Singh,et al.  Proteus: a topology malleable data center network , 2010, Hotnets-IX.

[96]  Bill Hostmann,et al.  Magic Quadrant for Business Intelligence Platforms , 2012 .

[97]  Koji Eguchi,et al.  Link prediction using probabilistic group models of network structure , 2010, SAC '10.

[98]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[99]  Paul Zikopoulos,et al.  Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data , 2011 .

[100]  Randal E. Bryant,et al.  Data-Intensive Scalable Computing for Scientific Applications , 2011, Computing in Science & Engineering.

[101]  Cecilia Mascolo,et al.  Exploiting place features in link prediction on location-based social networks , 2011, KDD.

[102]  Lars George,et al.  HBase: The Definitive Guide , 2011 .

[103]  Rick Cattell,et al.  Scalable SQL and NoSQL data stores , 2011, SGMD.

[104]  Predrag Tasevski PASSWORD ATTACKS AND GENERATION STRATEGIES , 2011 .

[105]  Charu C. Aggarwal,et al.  An Introduction to Social Network Data Analytics , 2011, Social Network Data Analytics.

[106]  B. S. Manjunath,et al.  The iPlant Collaborative: Cyberinfrastructure for Plant Biology , 2011, Front. Plant Sci..

[107]  Pramod Bhatotia,et al.  Incoop: MapReduce for incremental computations , 2011, SoCC.

[108]  Isabella Cerutti,et al.  Energy-Efficient Design of a Scalable Optical Multiplane Interconnection Architecture , 2011, IEEE Journal of Selected Topics in Quantum Electronics.

[109]  Erik Meijer The world according to LINQ , 2011, CACM.

[110]  Steven Hand,et al.  CIEL: A Universal Execution Engine for Distributed Data-Flow Computing , 2011, NSDI.

[111]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[112]  Avinash Karanth Kodi,et al.  Energy-Efficient and Bandwidth-Reconfigurable Photonic Networks for High-Performance Computing (HPC) Systems , 2011, IEEE Journal of Selected Topics in Quantum Electronics.

[113]  Li Li,et al.  A Survey on Visual Content-Based Video Indexing and Retrieval , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[114]  Tamara G. Kolda,et al.  Temporal Link Prediction Using Matrix and Tensor Factorizations , 2010, TKDD.

[115]  Zi Huang,et al.  Effective data co-reduction for multimedia similarity search , 2011, SIGMOD '11.

[116]  Feng Wang,et al.  Networked Wireless Sensor Data Collection: Issues, Challenges, and Approaches , 2011, IEEE Communications Surveys & Tutorials.

[117]  Kwong-Sak Leung,et al.  Data Mining on DNA Sequences of Hepatitis B Virus , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[118]  Surajit Chaudhuri,et al.  An overview of business intelligence technology , 2011, Commun. ACM.

[119]  W Shieh,et al.  OFDM for Flexible High-Speed Optical Networks , 2011, Journal of Lightwave Technology.

[120]  Charu C. Aggarwal,et al.  Social Network Data Analytics , 2011 .

[121]  Cecilia Mascolo,et al.  Evolution of a location-based online social network: analysis and models , 2012, IMC '12.

[122]  Wil M.P. van der Aalst Process Mining: Overview and Opportunities , 2012, TMIS.

[123]  Joydeep Ghosh,et al.  A probabilistic imputation framework for predictive analysis using variably aggregated, multi-source healthcare data , 2012, IHI '12.

[124]  Ling Huang,et al.  Evolution of social-attribute networks: measurements, modeling, and implications using google+ , 2012, Internet Measurement Conference.

[125]  Arshdeep Bahga,et al.  Analyzing Massive Machine Maintenance Data in a Computing Cloud , 2012, IEEE Transactions on Parallel and Distributed Systems.

[126]  Gregor von Bochmann,et al.  Crawling rich internet applications: the state of the art , 2012, CASCON.

[127]  Kenneth A. De Jong,et al.  An Evolutionary Algorithm Approach for Feature Generation from Sequence Data and Its Application to DNA Splice Site Prediction , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[128]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[129]  Florian Metze,et al.  Beyond audio and video retrieval: towards multimedia summarization , 2012, ICMR.

[130]  Wilfred Ng,et al.  A model-based approach for RFID data stream cleansing , 2012, CIKM.

[131]  Alexandros Labrinidis,et al.  Challenges and Opportunities with Big Data , 2012, Proc. VLDB Endow..

[132]  Nicu Sebe,et al.  Knowledge adaptation for ad hoc multimedia event detection with few exemplars , 2012, ACM Multimedia.

[133]  Tsung-Han Tsai,et al.  Exploring Contextual Redundancy in Improving Object-Based Video Coding for Video Sensor Networks Surveillance , 2012, IEEE Transactions on Multimedia.

[134]  Ben Y. Zhao,et al.  Mirror mirror on the ceiling: flexible wireless links for data centers , 2012, SIGCOMM '12.

[135]  Bingbing Ni,et al.  Assistive tagging: A survey of multimedia tagging with human-computer joint exploration , 2012, CSUR.

[136]  Min Chen,et al.  FAR: A fault-avoidance routing method for data center networks with regular topology , 2013, Architectures for Networking and Communications Systems.

[137]  Viktor Mayer-Schnberger,et al.  Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2013 .

[138]  Wei Chen,et al.  Influence diffusion dynamics and influence maximization in social networks with friend and foe relationships , 2011, WSDM.

[139]  Olha Buchel,et al.  Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2015 .

[140]  Ck Cheng,et al.  The Age of Big Data , 2015 .

[141]  Eric Gossett,et al.  Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2015 .

[142]  A Special Report on Managing Information , 2022 .