Big Data: Related Technologies, Challenges and Future Prospects

This Springer Brief provides a comprehensive overview of the background and recent developments of big data. The value chain of big data is divided into four phases: data generation, data acquisition, data storage and data analysis. For each phase, the book introduces the general background, discusses technical challenges and reviews the latest advances. Technologies under discussion include cloud computing, Internet of Things, data centers, Hadoop and more. The authors also explore several representative applications of big data such as enterprise management, online social networks, healthcare and medical applications, collective intelligence and smart grids. This book concludes with a thoughtful discussion of possible research directions and development trends in the field. Big Data: Related Technologies, Challenges and Future Prospects is a concise yet thorough examination of this exciting area. It is designed for researchers and professionals interested in big data or related research. Advanced-level students in computer science and electrical engineering will also find this book useful.

[1]  Lars George,et al.  HBase: The Definitive Guide , 2011 .

[2]  J. Anderson,et al.  IP over SONET , 1998 .

[3]  Luis Pedro Coelho,et al.  Building Machine Learning Systems with Python , 2013 .

[4]  J. Chris Anderson,et al.  CouchDB: The Definitive Guide , 2010 .

[5]  Pramod Bhatotia,et al.  Incoop: MapReduce for incremental computations , 2011, SoCC.

[6]  Douglas Thain,et al.  All-pairs: An abstraction for data-intensive cloud computing , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[7]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis 1 , 2000 .

[8]  Florian Metze,et al.  Beyond audio and video retrieval: towards multimedia summarization , 2012, ICMR.

[9]  Charu C. Aggarwal,et al.  An Introduction to Social Network Data Analytics , 2011, Social Network Data Analytics.

[10]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[11]  Paul Zikopoulos,et al.  Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data , 2011 .

[12]  Kenneth J. Christensen,et al.  A first look at wired sensor networks for video surveillance systems , 2002, 27th Annual IEEE Conference on Local Computer Networks, 2002. Proceedings. LCN 2002..

[13]  Xiaohua Hu,et al.  A Framework for Semisupervised Feature Generation and Its Applications in Biomedical Literature Mining , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[14]  Hong Cheng,et al.  Graph Clustering Based on Structural/Attribute Similarities , 2009, Proc. VLDB Endow..

[15]  Yannis Manolopoulos,et al.  Indexing web access-logs for pattern queries , 2002, WIDM '02.

[16]  Rick Cattell,et al.  Scalable SQL and NoSQL data stores , 2011, SGMD.

[17]  Katherine G. Herbert,et al.  Biological data cleaning: a case study , 2007, Int. J. Inf. Qual..

[18]  Witold Pedrycz,et al.  Face recognition: A study in information fusion using fuzzy integral , 2005, Pattern Recognit. Lett..

[19]  Antony I. T. Rowstron,et al.  Symbiotic routing in future data centers , 2010, SIGCOMM '10.

[20]  Roberto Proietti,et al.  DOS - A scalable optical switch for datacenters , 2010, 2010 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS).

[21]  James Murty,et al.  Programming Amazon web services - S3, EC2, SQS, FPS, and SimpleDB: outsource your infrastructure , 2008 .

[22]  Steven Hand,et al.  CIEL: A Universal Execution Engine for Distributed Data-Flow Computing , 2011, NSDI.

[23]  Elisa Bertino,et al.  State-of-the-art in privacy preserving data mining , 2004, SGMD.

[24]  Ben Y. Zhao,et al.  Mirror mirror on the ceiling: flexible wireless links for data centers , 2012, SIGCOMM.

[25]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[26]  Rami G. Melhem,et al.  Applying statistical machine learning to multicore voltage & frequency scaling , 2010, Conf. Computing Frontiers.

[27]  Cecilia Mascolo,et al.  Evolution of a location-based online social network: analysis and models , 2012, IMC '12.

[28]  Cecilia Mascolo,et al.  Exploiting place features in link prediction on location-based social networks , 2011, KDD.

[29]  Douglas Stott Parker,et al.  Traverse: Simplified Indexing on Large Map-Reduce-Merge Clusters , 2009, DASFAA.

[30]  Feng Wang,et al.  Networked Wireless Sensor Data Collection: Issues, Challenges, and Approaches , 2011, IEEE Communications Surveys & Tutorials.

[31]  Tsung-Han Tsai,et al.  Exploring Contextual Redundancy in Improving Object-Based Video Coding for Video Sensor Networks Surveillance , 2012, IEEE Transactions on Multimedia.

[32]  Theodoros Lappas,et al.  Finding a team of experts in social networks , 2009, KDD.

[33]  Mani B. Srivastava,et al.  NAWMS: nonintrusive autonomous water monitoring system , 2008, SenSys '08.

[34]  Li Li,et al.  A Survey on Visual Content-Based Video Indexing and Retrieval , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[35]  Wei Hong,et al.  A macroscope in the redwoods , 2005, SenSys '05.

[36]  Zi Huang,et al.  Effective data co-reduction for multimedia similarity search , 2011, SIGMOD '11.

[37]  Viktor Mayer-Schnberger,et al.  Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2013 .

[38]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[39]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[40]  Amin Vahdat,et al.  Helios: a hybrid electrical/optical switch architecture for modular data centers , 2010, SIGCOMM '10.

[41]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[42]  Antonio Iera,et al.  The Internet of Things: A survey , 2010, Comput. Networks.

[43]  Lise Getoor,et al.  Co-evolution of social and affiliation networks , 2009, KDD.

[44]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[45]  Thorsten Meinl,et al.  KNIME: The Konstanz Information Miner , 2007, GfKl.

[46]  Duncan J. Watts,et al.  Six Degrees: The Science of a Connected Age , 2003 .

[47]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[48]  Arshdeep Bahga,et al.  Analyzing Massive Machine Maintenance Data in a Computing Cloud , 2012, IEEE Transactions on Parallel and Distributed Systems.

[49]  Gregor von Bochmann,et al.  Crawling rich internet applications: the state of the art , 2012, CASCON.

[50]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[51]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[52]  John A. Stankovic,et al.  LUSTER: wireless sensor network for environmental research , 2007, SenSys '07.

[53]  Jae-Gil Lee,et al.  Mining Massive RFID, Trajectory, and Traffic Data Sets , 2008, Knowledge Discovery and Data Mining.

[54]  J. E. Hirsch,et al.  An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.

[55]  Sukun Kim,et al.  Health Monitoring of Civil Infrastructures Using Wireless Sensor Networks , 2007, 2007 6th International Symposium on Information Processing in Sensor Networks.

[56]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[57]  Atul Singh,et al.  Proteus: a topology malleable data center network , 2010, Hotnets-IX.

[58]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[59]  Anupam Joshi,et al.  On Using a Warehouse to Analyze Web Logs , 2003, Distributed and Parallel Databases.

[60]  Juan C. Burguillo,et al.  A hybrid content-based and item-based collaborative filtering approach to recommend TV programs enhanced with singular value decomposition , 2010, Inf. Sci..

[61]  Kenneth A. De Jong,et al.  An Evolutionary Algorithm Approach for Feature Generation from Sequence Data and Its Application to DNA Splice Site Prediction , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[62]  Isabella Cerutti,et al.  Energy-Efficient Design of a Scalable Optical Multiplane Interconnection Architecture , 2011, IEEE Journal of Selected Topics in Quantum Electronics.

[63]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[64]  François Ingelrest,et al.  SensorScope: Out-of-the-Box Environmental Monitoring , 2008, 2008 International Conference on Information Processing in Sensor Networks (ipsn 2008).

[65]  Nick McKeown,et al.  OpenFlow: enabling innovation in campus networks , 2008, CCRV.

[66]  Gustavo Alonso,et al.  Consistency Rationing in the Cloud: Pay only when it matters , 2009, Proc. VLDB Endow..

[67]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[68]  Alexander G. Gray,et al.  On-line anomaly detection of deployed software: a statistical machine learning approach , 2006, SOQUA '06.

[69]  Daniela Florescu,et al.  Rethinking cost and performance of database systems , 2009, SGMD.

[70]  Sanjeev Kumar,et al.  Finding a Needle in Haystack: Facebook's Photo Storage , 2010, OSDI.

[71]  David Konopnicki,et al.  W3QS: A Query System for the World-Wide Web , 1995, VLDB.

[72]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[73]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[74]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[75]  Prashant Malik,et al.  Cassandra: structured storage system on a P2P network , 2009, PODC '09.

[76]  Niklas Carlsson,et al.  Evolution of an online social aggregation network: an empirical study , 2009, IMC '09.

[77]  Sean Quinlan,et al.  GFS: evolution on fast-forward , 2010, Commun. ACM.

[78]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[79]  Joydeep Ghosh,et al.  A probabilistic imputation framework for predictive analysis using variably aggregated, multi-source healthcare data , 2012, IHI '12.

[80]  Juyeon Lee,et al.  ON MODELINGA model of mobile community: designing user interfaces to support group interaction , 2009, INTR.

[81]  Koji Eguchi,et al.  Link prediction using probabilistic group models of network structure , 2010, SAC '10.

[82]  Kimberly Keeton,et al.  LazyBase: freshness vs. performance in information management , 2010, OPSR.

[83]  Ling Huang,et al.  Evolution of social-attribute networks: measurements, modeling, and implications using google+ , 2012, Internet Measurement Conference.

[84]  Douglas Crockford,et al.  The application/json Media Type for JavaScript Object Notation (JSON) , 2006, RFC.

[85]  Rajesh Parekh,et al.  Lessons and Challenges from Mining Retail E-Commerce Data , 2004, Machine Learning.

[86]  Haitao Wu,et al.  BCube: a high performance, server-centric network architecture for modular data centers , 2009, SIGCOMM '09.

[87]  Jimeng Sun,et al.  Social influence analysis in large-scale networks , 2009, KDD.

[88]  Qiang Yang,et al.  Translated Learning: Transfer Learning across Different Feature Spaces , 2008, NIPS.

[89]  Andrew S. Tanenbaum,et al.  Distributed systems: Principles and Paradigms , 2001 .

[90]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[91]  Steven J. Simske,et al.  Automatic text summarization and small-world networks , 2011, DocEng '11.

[92]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[93]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[94]  Tamara G. Kolda,et al.  Temporal Link Prediction Using Matrix and Tensor Factorizations , 2010, TKDD.

[95]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[96]  Tong Zhang,et al.  Linear prediction models with graph regularization for web-page categorization , 2006, KDD '06.

[97]  Hong Liu,et al.  Fiber optic communication technologies: What's needed for datacenter network operations , 2010, IEEE Communications Magazine.

[98]  Mark T. Maybury New Directions in Question Answering , 2004 .

[99]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[100]  Kwong-Sak Leung,et al.  Data Mining on DNA Sequences of Hepatitis B Virus , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[101]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[102]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[103]  Bin Wu,et al.  Community detection in large-scale social networks , 2007, WebKDD/SNA-KDD '07.

[104]  Ck Cheng,et al.  The Age of Big Data , 2015 .

[105]  Mahadev Satyanarayanan,et al.  Scale and performance in a distributed file system , 1988, TOCS.

[106]  Jure Leskovec,et al.  Empirical comparison of algorithms for network community detection , 2010, WWW '10.

[107]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[108]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[109]  Geoffrey C. Fox,et al.  Cloud computing paradigms for pleasingly parallel biomedical applications , 2010, HPDC '10.

[110]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[111]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[112]  Mohd Norzali Haji Mohd,et al.  Data pre-processing on web server logs for generalized association rules mining algorithm , 2008 .

[113]  Wei Chen,et al.  Influence diffusion dynamics and influence maximization in social networks with friend and foe relationships , 2011, WSDM.

[114]  Chris H. Q. Ding,et al.  PageRank, HITS and a unified framework for link analysis , 2002, SIGIR '02.

[115]  Alexandros Labrinidis,et al.  Challenges and Opportunities with Big Data , 2012, Proc. VLDB Endow..

[116]  Nasir Ghani,et al.  On IP-over-WDM integration , 2000, IEEE Commun. Mag..

[117]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[118]  Nicu Sebe,et al.  Knowledge adaptation for ad hoc multimedia event detection with few exemplars , 2012, ACM Multimedia.

[119]  Alon Y. Halevy,et al.  Data Integration for the Relational Web , 2009, Proc. VLDB Endow..

[120]  Kristina Chodorow,et al.  MongoDB: The Definitive Guide , 2010 .

[121]  Wil M. P. van der Aalst,et al.  Process Mining: Overview and Opportunities , 2012, ACM Trans. Manag. Inf. Syst..

[122]  H. Takara,et al.  Dynamic optical mesh networks: Drivers, challenges and solutions for the future , 2009, 2009 35th European Conference on Optical Communication.

[123]  Nancy A. Lynch,et al.  Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002, SIGA.

[124]  Min Song,et al.  Biomedical text categorization with concept graph representations using a controlled vocabulary , 2012, BIOKDD '12.

[125]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[126]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[127]  Erik Meijer The world according to LINQ , 2011, CACM.

[128]  Yanlei Diao,et al.  High-performance complex event processing over streams , 2006, SIGMOD Conference.

[129]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[130]  J. Armstrong,et al.  OFDM for Optical Communications , 2009, Journal of Lightwave Technology.

[131]  T. W. Anderson An Introduction to Multivariate Statistical Analysis , 1959 .

[132]  Sankar K. Pal,et al.  Web mining in soft computing framework: relevance, state of the art and future directions , 2002, IEEE Trans. Neural Networks.

[133]  B. S. Manjunath,et al.  The iPlant Collaborative: Cyberinfrastructure for Plant Biology , 2011, Front. Plant Sci..

[134]  Eric A. Brewer,et al.  Towards robust distributed systems (abstract) , 2000, PODC '00.

[135]  Wilfred Ng,et al.  A model-based approach for RFID data stream cleansing , 2012, CIKM.

[136]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[137]  Deepak S. Turaga,et al.  Multimodal analysis of body sensor network data streams for real-time healthcare , 2010, MIR '10.

[138]  Konstantina Papagiannaki,et al.  c-Through: part-time optics in data centers , 2010, SIGCOMM '10.