Data mining with big data

Big Data concerns large-volume, complex, growing data sets with multiple, autonomous sources. With the fast development of networking, data storage, and the data collection capacity, Big Data is now rapidly expanding in all science and engineering domains, including physical, biological and biomedical sciences. This article presents a HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective. This data-driven model involves demand-driven aggregation of information sources, mining and analysis, user interest modeling, and security and privacy considerations. We analyze the challenging issues in the data-driven model and also in the Big Data revolution.

[1]  A. Yao,et al.  Fair exchange with a semi-trusted third party (extended abstract) , 1997, CCS '97.

[2]  Bradley Efron,et al.  Missing Data, Imputation, and the Bootstrap , 1994 .

[3]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[4]  Xindong Wu,et al.  Building Intelligent Learning Database Systems , 2000, AI Mag..

[5]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[6]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[7]  Nikolaos M. Avouris,et al.  The Role of Domain Knowledge in a Large Scale Data Mining Project , 2002, SETN.

[8]  K. Sivakumar,et al.  Collective mining of Bayesian networks from distributed heterogeneous data , 2003, Knowledge and Information Systems.

[9]  Xindong Wu,et al.  Synthesizing High-Frequency Rules from Different Data Sources , 2003, IEEE Trans. Knowl. Data Eng..

[10]  Xindong Wu,et al.  Database classification for multi-database mining , 2005, Inf. Syst..

[11]  Arlo Faria,et al.  MapReduce : Distributed Computing for Machine Learning , 2006 .

[12]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[13]  Xindong Wu,et al.  A logical framework for identifying quality knowledge from different data sources , 2006, Decis. Support Syst..

[14]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[15]  Xindong Wu,et al.  Mining With Noise Knowledge: Error-Aware Data Mining , 2008, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[16]  Jimeng Sun,et al.  DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[17]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[18]  Stefan Wrobel,et al.  Toolkit-Based High-Performance Data Mining of Large Data on MapReduce Clusters , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[19]  Adam Jacobs,et al.  The pathologies of big data , 2009, Commun. ACM.

[20]  Damon Centola,et al.  The Spread of Behavior in an Online Social Network Experiment , 2010, Science.

[21]  Raja Chiky,et al.  A clustering approach for sampling data streams in sensor networks , 2012, 2010 IEEE International Conference on Data Mining.

[22]  Rainer Beck,et al.  Square kilometre array , 2010, Scholarpedia.

[23]  Divesh Srivastava,et al.  Anonymized Data: Generation, models, usage , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[24]  George Karypis,et al.  Algorithms for Mining the Evolution of Conserved Relational States in Dynamic Networks , 2011, ICDM.

[25]  Ting Wang,et al.  Online active multi-field learning for efficient email spam filtering , 2011, Knowledge and Information Systems.

[26]  Chris H. Q. Ding,et al.  Parallelization with Multiplicative Algorithms for Big Data Mining , 2012, 2012 IEEE 12th International Conference on Data Mining.

[27]  SangKeun Lee,et al.  Novel approaches to crawling important pages early , 2012, Knowledge and Information Systems.

[28]  Ashwin Machanavajjhala,et al.  Big privacy: protecting confidentiality in big data , 2012, XRDS.

[29]  Sinan Aral,et al.  Identifying Influential and Susceptible Members of Social Networks , 2012, Science.

[30]  E. Schadt The changing privacy landscape in the era of big data , 2012, Molecular systems biology.

[31]  Suh-Yin Lee,et al.  Efficient algorithms for influence maximization in social networks , 2012, Knowledge and Information Systems.

[32]  Alexandros Labrinidis,et al.  Challenges and Opportunities with Big Data , 2012, Proc. VLDB Endow..

[33]  Nitin Agarwal,et al.  Analyzing collective behavior from blogs using swarm intelligence , 2012, Knowledge and Information Systems.

[34]  B. Huberman Sociology of science: Big data deserve a bigger audience , 2012, Nature.

[35]  E. Birney The making of ENCODE: Lessons for big-data projects , 2012, Nature.

[36]  Joshua Schiffman,et al.  Shroud: ensuring private access to large-scale data in the data center , 2013, FAST.

[37]  Hao Wang,et al.  Online Feature Selection with Streaming Features , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Xindong Wu,et al.  Anonymizing classification data using rough set theory , 2013, Knowl. Based Syst..

[39]  D. Pratiba,et al.  PRIVACY-PRESERVING PUBLIC AUDITING FOR DATA STORAGE SECURITY IN CLOUD COMPUTING , 2013 .