Data Mining Model for Big Data Analysis

Big Data concern large-volume, complex, growing data sets with multiple, autonomous sources. With the fast development of networking, data storage, and the data collection capacity, Big Data are now rapidly expanding in all science and engineering domains, including physical, biological and biomedical sciences. This paper presents a HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective. This data-driven model involves demand-driven aggregation of information sources, mining and analysis, user interest modeling, and security and privacy considerations. We analyze the challenging issues in the data-driven model and also in the Big Data revolution.

[1]  A. Yao,et al.  Fair exchange with a semi-trusted third party (extended abstract) , 1997, CCS '97.

[2]  Bradley Efron,et al.  Missing Data, Imputation, and the Bootstrap , 1994 .

[3]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[4]  Xindong Wu,et al.  Building Intelligent Learning Database Systems , 2000, AI Mag..

[5]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[6]  Rakesh Agrawal,et al.  Privacy-preserving data mining , 2000, SIGMOD 2000.

[7]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[8]  Nikolaos M. Avouris,et al.  The Role of Domain Knowledge in a Large Scale Data Mining Project , 2002, SETN.

[9]  K. Sivakumar,et al.  Collective mining of Bayesian networks from distributed heterogeneous data , 2003, Knowledge and Information Systems.

[10]  Xindong Wu,et al.  Synthesizing High-Frequency Rules from Different Data Sources , 2003, IEEE Trans. Knowl. Data Eng..

[11]  Xindong Wu,et al.  Database classification for multi-database mining , 2005, Inf. Syst..

[12]  Arlo Faria,et al.  MapReduce : Distributed Computing for Machine Learning , 2006 .

[13]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[14]  Xindong Wu,et al.  A logical framework for identifying quality knowledge from different data sources , 2006, Decis. Support Syst..

[15]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[16]  Xindong Wu,et al.  Mining With Noise Knowledge: Error-Aware Data Mining , 2008, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[17]  Jimeng Sun,et al.  DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[18]  M. Waldrop,et al.  Community cleverness required , 2008, Nature.

[19]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[20]  Tom M Mitchell,et al.  Mining Our Reality , 2009, Science.

[21]  Stefan Wrobel,et al.  Toolkit-Based High-Performance Data Mining of Large Data on MapReduce Clusters , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[22]  Daniel J. Brass,et al.  Network Analysis in the Social Sciences , 2009, Science.

[23]  E. Chang,et al.  Parallel algorithms for mining large-scale rich-media data , 2009, ACM Multimedia.

[24]  Adam Jacobs,et al.  The pathologies of big data , 2009, Commun. ACM.

[25]  Damon Centola,et al.  The Spread of Behavior in an Online Social Network Experiment , 2010, Science.

[26]  Xiaodong Lin,et al.  Active Learning From Stream Data Using Optimal Weight Classifier Ensemble , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[27]  Raja Chiky,et al.  A clustering approach for sampling data streams in sensor networks , 2012, 2010 IEEE International Conference on Data Mining.

[28]  Peter J. Haas,et al.  Ricardo: integrating R and Hadoop , 2010, SIGMOD Conference.

[29]  William H. Dutton,et al.  Clouds, big data, and smart assets: Ten tech-enabled business trends to watch , 2010 .

[30]  Peter Schaar,et al.  Privacy by Design , 2010 .

[31]  Rainer Beck,et al.  Square kilometre array , 2010, Scholarpedia.

[32]  Divesh Srivastava,et al.  Anonymized Data: Generation, models, usage , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[33]  Hui Xiong,et al.  Information propagation in online social networks: a tie-strength perspective , 2011, Knowledge and Information Systems.

[34]  George Karypis,et al.  Algorithms for mining the evolution of conserved relational states in dynamic networks , 2011, 2011 IEEE 11th International Conference on Data Mining.

[35]  Ting Wang,et al.  Online active multi-field learning for efficient email spam filtering , 2011, Knowledge and Information Systems.

[36]  Johan Bollen,et al.  Twitter mood predicts the stock market , 2010, J. Comput. Sci..

[37]  SangKeun Lee,et al.  Novel approaches to crawling important pages early , 2012, Knowledge and Information Systems.

[38]  Ashwin Machanavajjhala,et al.  Big privacy: protecting confidentiality in big data , 2012, XRDS.

[39]  Sinan Aral,et al.  Identifying Influential and Susceptible Members of Social Networks , 2012, Science.

[40]  E. Schadt The changing privacy landscape in the era of big data , 2012, Molecular systems biology.

[41]  Suh-Yin Lee,et al.  Efficient algorithms for influence maximization in social networks , 2012, Knowledge and Information Systems.

[42]  Alexandros Labrinidis,et al.  Challenges and Opportunities with Big Data , 2012, Proc. VLDB Endow..

[43]  J. Mervis U.S. science policy. Agencies rally to tackle big data. , 2012, Science.

[44]  Nitin Agarwal,et al.  Analyzing collective behavior from blogs using swarm intelligence , 2012, Knowledge and Information Systems.

[45]  B. Huberman Sociology of science: Big data deserve a bigger audience , 2012, Nature.

[46]  E. Birney The making of ENCODE: Lessons for big-data projects , 2012, Nature.

[47]  Joshua Schiffman,et al.  Shroud: ensuring private access to large-scale data in the data center , 2013, FAST.

[48]  Marc Langheinrich,et al.  Privacy By Design , 2013, IEEE Pervasive Comput..

[49]  Hao Wang,et al.  Online Feature Selection with Streaming Features , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Xindong Wu,et al.  Anonymizing classification data using rough set theory , 2013, Knowl. Based Syst..

[51]  Zoe L. Jiang,et al.  Privacy-Preserving Public Auditing for Secure Cloud Storage , 2013, IEEE Transactions on Computers.

[52]  Salve Bhagyashri Salve Bhagyashri,et al.  Privacy-Preserving Public Auditing For Secure Cloud Storage , 2014 .