Data Mining for the Internet of Things: Literature Review and Challenges

The massive data generated by the Internet of Things (IoT) are considered of high business value, and data mining algorithms can be applied to IoT to extract hidden information from data. In this paper, we give a systematic way to review data mining in knowledge view, technique view, and application view, including classification, clustering, association analysis, time series analysis and outlier analysis. And the latest application cases are also surveyed. As more and more devices connected to IoT, large volume of data should be analyzed, the latest algorithms should be modified to apply to big data. We reviewed these algorithms and discussed challenges and open research issues. At last a suggested big data mining system is proposed.

[1]  Eiman Kambal,et al.  Credit scoring using data mining techniques with particular reference to Sudanese banks , 2013, 2013 INTERNATIONAL CONFERENCE ON COMPUTING, ELECTRICAL AND ELECTRONIC ENGINEERING (ICCEEE).

[2]  Deng Pan,et al.  modeling the large-scale device control system based on pi-calculus , 2011 .

[3]  Jiafu Wan,et al.  M2M Communications for Smart City: An Event-Based Architecture , 2012, 2012 IEEE 12th International Conference on Computer and Information Technology.

[4]  Yunhao Liu,et al.  Big Data: A Survey , 2014, Mob. Networks Appl..

[5]  Sanjay Ranka,et al.  An effic ient k-means clustering algorithm , 1997 .

[6]  Jeng-Shyang Pan,et al.  A Novel Approach on Behavior of Sleepy Lizards Based on K-Nearest Neighbor Algorithm , 2014, Social Networks: A Framework of Computational Intelligence.

[7]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[8]  Li Yingxin and Ruan Xiaogang,et al.  Feature Selection for Cancer Classification Based on Support Vector Machine , 2005 .

[9]  Gang Wang,et al.  Automatically detecting deceptive criminal identities , 2004, CACM.

[10]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[11]  José Ramón Gil-García,et al.  Understanding the complexity of electronic government: Implications from the digital divide literature , 2005, Gov. Inf. Q..

[12]  Kemal Polat,et al.  A novel hybrid intelligent method based on C4.5 decision tree classifier and one-against-all approach for multi-class classification problems , 2009, Expert Syst. Appl..

[13]  Min Chen,et al.  NDNC-BAN: Supporting rich media healthcare services via named data networking in cloud-assisted wireless body area networks , 2014, Inf. Sci..

[14]  Philip Hans Franses,et al.  Evaluating chi-squared automatic interaction detection , 2006, Inf. Syst..

[15]  Paul Resnick,et al.  Recommender systems , 1997, CACM.

[16]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[17]  Thore Graepel,et al.  Large Margin Rank Boundaries for Ordinal Regression , 2000 .

[18]  Stelios C. A. Thomopoulos,et al.  Dignet: an unsupervised-learning clustering algorithm for clustering and data fusion , 1995 .

[19]  John J. Mentel,et al.  Patient note deidentification using a find-and-replace iterative process. , 2005, Journal of healthcare information management : JHIM.

[20]  Ira Assent,et al.  The TS-tree: efficient time series search and retrieval , 2008, EDBT '08.

[21]  David L. Dowe,et al.  Intrinsic classification by MML - the Snob program , 1994 .

[22]  Chi-Jie Lu,et al.  Combining independent component analysis and growing hierarchical self-organizing maps with support vector regression in product demand forecasting , 2010 .

[23]  Li,et al.  Mobile Internet WebRTC and Related Technologies , 2014 .

[24]  Wen-Jyi Hwang,et al.  Fast kNN classification algorithm based on partial distance search , 1998 .

[25]  Ntoulas Alexandros,et al.  Understanding Search Engines : Requirements for Explaining Search Results , 2001 .

[26]  Lu Huang,et al.  A survey of mass data mining based on cloud-computing , 2012, Anti-counterfeiting, Security, and Identification.

[27]  Hao Hu,et al.  An Efficient K-means Clustering Algorithm on MapReduce , 2014, DASFAA.

[28]  Jeng-Shyang Pan,et al.  An efficient encoding algorithm for vector quantization based on subvector technique , 2003, IEEE Trans. Image Process..

[29]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[30]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[31]  Yunhao Liu,et al.  Indexable PLA for Efficient Similarity Search , 2007, VLDB.

[32]  Blaz Zupan,et al.  Predictive data mining in clinical medicine: Current issues and guidelines , 2008, Int. J. Medical Informatics.

[33]  Tak-Chung Fu,et al.  A review on time series data mining , 2011, Eng. Appl. Artif. Intell..

[34]  Sinjini Mitra,et al.  Community Issues in American Metropolitan Cities: A Data Mining Case Study , 2014, J. Cases Inf. Technol..

[35]  Xindong Wu,et al.  A logical framework for identifying quality knowledge from different data sources , 2006, Decis. Support Syst..

[36]  Jianwei Zhang,et al.  A model of large-scale Device Collaboration system based on PI-Calculus for green communication , 2013, Telecommun. Syst..

[37]  Song Sun,et al.  Analysis and acceleration of data mining algorithms on high performance reconfigurable computing platforms , 2011 .

[38]  Chia-Hui Chang,et al.  PROWL: An Efficient Frequent continuity Mining Algorithm on Event Sequences , 2004, DaWaK.

[39]  M Ciotti,et al.  Health Security and Disease Detection in the European Union , 2012, Biopreparedness and Public Health.

[40]  Shikha Agrawal,et al.  Modification of Density Based Spatial Clustering Algorithm for Large Database Using Naive's Bayes' Theorem , 2014, 2014 Fourth International Conference on Communication Systems and Network Technologies.

[41]  Victor C. M. Leung,et al.  CAP: community activity prediction based on big data analysis , 2014, IEEE Network.

[42]  Min Chen,et al.  Enabling comfortable sports therapy for patient: A novel lightweight durable and portable ECG monitoring system , 2013, 2013 IEEE 15th International Conference on e-Health Networking, Applications and Services (Healthcom 2013).

[43]  Pat Langley,et al.  Induction of Selective Bayesian Classifiers , 1994, UAI.

[44]  Stan Uryasev,et al.  Value-at-risk support vector machine: stability to outliers , 2013, Journal of Combinatorial Optimization.

[45]  Christopher Wilson,et al.  Mining GPS Traces for Map Refinement , 2004, Data Mining and Knowledge Discovery.

[46]  Wei Luo,et al.  Feature Selection for Cancer Classification Based on Support Vector Machine , 2009, 2009 WRI Global Congress on Intelligent Systems.

[47]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[48]  Yong Zhang,et al.  An incident information management framework based on data integration, data mining, and multi-criteria decision making , 2011, Decis. Support Syst..

[49]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[50]  M. A. Dalal,et al.  A survey on clustering in data mining , 2011, ICWET.

[51]  Ma Shi-long A Survey of Device Collaboration Technology and System Software , 2011 .

[52]  Gang Wang,et al.  Crime data mining: a general framework and some examples , 2004, Computer.

[53]  Rong Zheng,et al.  Crime Data Mining: An Overview and Case Studies , 2003, DG.O.

[54]  Sanjay Ranka,et al.  CLOUDS: A Decision Tree Classifier for Large Datasets , 1998, KDD.

[55]  Qian Zhang,et al.  A 2G-RFID-based e-healthcare system , 2010, IEEE Wireless Communications.

[56]  David A. Padua,et al.  Parallel mining of closed sequential patterns , 2005, KDD '05.

[57]  Xiaofei Wang,et al.  The virtue of sharing: Efficient content delivery in Wireless Body Area Networks for ubiquitous healthcare , 2013, 2013 IEEE 15th International Conference on e-Health Networking, Applications and Services (Healthcom 2013).

[58]  Philip S. Yu,et al.  Mining Knowledge from Interconnected Data: A Heterogeneous Information Network Analysis Approach , 2012, Proc. VLDB Endow..

[59]  R. Shah,et al.  Data Mining Using Hierarchical Agglomerative Clustering Algorithm in Distributed Cloud Computing Environment , 2013 .

[60]  Clement T. Yu,et al.  Haar Wavelets for Efficient Similarity Search of Time-Series: With and Without Time Warping , 2003, IEEE Trans. Knowl. Data Eng..

[61]  Long Bing Study on a GA-based SVM Decision-tree Multi-Classification Strategy , 2008 .

[62]  Kin Keung Lai,et al.  Demand forecasting of perishable farm products using support vector machine , 2013, Int. J. Syst. Sci..

[63]  Hu-Sheng Guo,et al.  A novel learning model-Kernel Granular Support Vector Machine , 2009, 2009 International Conference on Machine Learning and Cybernetics.

[64]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[65]  Qiang Wang,et al.  A dimensionality reduction technique for efficient similarity analysis of time series databases , 2004, CIKM '04.

[66]  Anjana Gosain,et al.  A comprehensive survey of association rules on quantitative data in data mining , 2013, 2013 IEEE CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGIES.

[67]  He,et al.  E-Healthcare Supported by Big Data , 2014 .

[68]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[69]  C. May,et al.  Interaction between States and Citizens in the Age of the Internet: “e-Government” in the United States, Britain, and the European Union , 2003 .

[70]  Jiafu Wan,et al.  A novel multimedia device ability matching technique for ubiquitous computing environments , 2013, EURASIP J. Wirel. Commun. Netw..

[71]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[72]  Jos van Hillegersberg,et al.  Predicting Healthcare Fraud in Medicaid: A Multidimensional Data Model and Analysis Techniques for Fraud Detection , 2013 .

[73]  Zhang Li-wen,et al.  Appropriateness in Applying SVMs to Text Classification , 2010 .

[74]  Carlos Agón,et al.  Time-series data mining , 2012, CSUR.

[75]  Christian S. Jensen,et al.  Mining significant semantic locations from GPS data , 2010, Proc. VLDB Endow..

[76]  David Heckerman,et al.  Knowledge Representation and Inference in Similarity Networks and Bayesian Multinets , 1996, Artif. Intell..

[77]  Govardhan Hegde,et al.  An Overview of Clustering Analysis Techniques used in Data Miniing , 2013 .

[78]  Xuedong Liang,et al.  A Taxonomy of Agent Technologies for Ubiquitous Computing Environments , 2012, KSII Trans. Internet Inf. Syst..

[79]  B. Chandra,et al.  Fuzzy SLIQ Decision Tree Algorithm , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[80]  Ujjwal Maulik,et al.  A Survey of Multiobjective Evolutionary Algorithms for Data Mining: Part I , 2014, IEEE Transactions on Evolutionary Computation.

[81]  Shyam Varan Nath,et al.  Crime Pattern Detection Using Data Mining , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Workshops.

[82]  Sheng-De Wang,et al.  Fuzzy support vector machines , 2002, IEEE Trans. Neural Networks.

[83]  Zheng-ou Wang,et al.  Research on Shape-Based Time Series Similarity Measure , 2006, 2006 International Conference on Machine Learning and Cybernetics.

[84]  Lun-Ping Hung,et al.  A data driven ensemble classifier for credit scoring analysis , 2010, Expert Syst. Appl..

[85]  Liangxiao Jiang,et al.  Learning Tree Augmented Naive Bayes for Ranking , 2005, DASFAA.

[86]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[87]  Gianmarco De Francisci Morales SAMOA: a platform for mining big data streams , 2013, WWW '13 Companion.

[88]  Deng Pan,et al.  A Large-scale Device Collaboration Mechanism , 2011 .

[89]  Alok N. Choudhary,et al.  Adaptive Grids for Clustering Massive Data Sets , 2001, SDM.

[90]  Yong Shi,et al.  Robust twin support vector machine for pattern classification , 2013, Pattern Recognit..

[91]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 1999 .

[92]  Xindong Wu,et al.  Synthesizing High-Frequency Rules from Different Data Sources , 2003, IEEE Trans. Knowl. Data Eng..

[93]  S. Sukumaran,et al.  A study on classification techniques in data mining , 2013, 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT).

[94]  Chia-Hui Chang,et al.  Efficient mining of frequent episodes from complex sequences , 2008, Inf. Syst..

[95]  Beckie Kelly Schuerenberg An information excavation. Las Vegas payer uses data mining software to improve HEDIS reporting and provider profiling. , 2003, Health data management.

[96]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[97]  Jun Liu,et al.  Component analysis of Chinese medicine and advances in fuming-washing therapy for knee osteoarthritis via unsupervised data mining methods. , 2013, Journal of traditional Chinese medicine = Chung i tsa chih ying wen pan.

[98]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[99]  Jugal K. Kalita,et al.  A Survey of Outlier Detection Methods in Network Anomaly Identification , 2011, Comput. J..

[100]  Jian Pei,et al.  Mining frequent patterns by pattern-growth: methodology and implications , 2000, SKDD.

[101]  Neelamadhab Padhy,et al.  The Survey of Data Mining Applications And Feature Scope , 2012, ArXiv.

[102]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[103]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[104]  Nir Friedman,et al.  Learning Belief Networks in the Presence of Missing Values and Hidden Variables , 1997, ICML.

[105]  Christopher Wilson,et al.  Mining GPS data to augment road models , 1999, KDD '99.

[106]  Vinicius Cardoso Garcia,et al.  Smart cities software architectures: a survey , 2013, SAC '13.

[107]  William Nick Street,et al.  Healthcare information systems: data mining methods in the creation of a clinical recommender system , 2011, Enterp. Inf. Syst..

[108]  Isabel M. Ramos,et al.  Applying Data Mining to Software Development Projects: A Case Study , 2004, ICEIS.

[109]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[110]  Nitin Kumar,et al.  Time-series Bitmaps: a Practical Visualization Tool for Working with Large Time Series Databases , 2005, SDM.

[111]  Qiang Wang,et al.  A multiresolution symbolic representation of time series , 2005, 21st International Conference on Data Engineering (ICDE'05).

[112]  Dennis Shasha,et al.  High Performance Discovery In Time Series: Techniques And Case Studies (Monographs in Computer Science) , 2004 .

[113]  Xiaofei Wang,et al.  Cloud-enabled wireless body area networks for pervasive healthcare , 2013, IEEE Network.

[114]  Ahmed Elragal,et al.  Big Data Analytics: A Literature Review Paper , 2014, ICDM.

[115]  Qiang Liu,et al.  Cloud Manufacturing Service System for Industrial-Cluster-Oriented Application , 2014 .

[116]  Pan Deng,et al.  A Large-Scale Device Collaboration Resource Selection Method with Multi-QoS Constraint Supported , 2010 .

[117]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[118]  H. P. Huang,et al.  Fuzzy Support Vector Machines for Pattern Recognition and Data Mining , 2002 .

[119]  Erich Schikuta,et al.  The BANG-Clustering System: Grid-Based Data Analysis , 1997, IDA.

[120]  Jiafu Wan,et al.  Towards Key Issues of Disaster Aid based on Wireless Body Area Networks , 2013, KSII Trans. Internet Inf. Syst..

[121]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[122]  Christos Faloutsos,et al.  Pegasus: Mining billion-scale graphs in the cloud , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[123]  Jimeng Sun,et al.  Big data analytics for healthcare , 2013, KDD.

[124]  Yun Lei,et al.  Visual Tracker Using Sequential Bayesian Learning: Discriminative, Generative, and Hybrid , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[125]  Daqiang Zhang,et al.  VCMIA: A Novel Architecture for Integrating Vehicular Cyber-Physical Systems and Mobile Cloud Computing , 2014, Mobile Networks and Applications.

[126]  Jim Holmes,et al.  Readers' perspectives. "It is safe to transmit sensitive patient health care information over a virtual private network (VPN). Do you agree or disagree?". , 2003, Health data management.

[127]  Yong Shi,et al.  Structural twin support vector machine for classification , 2013, Knowl. Based Syst..

[128]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[129]  Geoffrey I. Webb,et al.  Tree Augmented Naive Bayes , 2017, Encyclopedia of Machine Learning and Data Mining.

[130]  Qiang He,et al.  Multi-class fuzzy support vector machine based on dismissing margin , 2009, 2009 International Conference on Machine Learning and Cybernetics.

[131]  Reshma Khemchandani,et al.  Twin Support Vector Machines for Pattern Classification , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[132]  Igor Kononenko,et al.  Semi-Naive Bayesian Classifier , 1991, EWSL.

[133]  Victor C. M. Leung,et al.  Big Data: Related Technologies, Challenges and Future Prospects , 2014 .

[134]  Shu-Meng Huang,et al.  A Study of the Application of Data Mining on the Spatial Landscape Allocation of Crime Hot Spots , 2013, GRMSE.

[135]  Konstantinos Kalpakis,et al.  Distance measures for effective clustering of ARIMA time-series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[136]  Enrique Herrera-Viedma,et al.  Integrating Quality Criteria in a Fuzzy Linguistic Recommender System for Digital Libraries , 2014, ITQM.

[137]  D. Larose k‐Nearest Neighbor Algorithm , 2005 .

[138]  Sergei Vassilvitskii,et al.  Scalable K-Means by ranked retrieval , 2014, WSDM.

[139]  Min Chen Towards smart city: M2M communications with software agent intelligence , 2012, Multimedia Tools and Applications.

[140]  Pilsung Kang,et al.  Pre-launch new product demand forecasting using the Bass model: : A statistical and machine learning-based approach , 2014 .

[141]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[142]  Ido Guy Tutorial on social recommender systems , 2014, WWW '14 Companion.

[143]  Soon Myoung Chung,et al.  Efficient Mining of Maximal Sequential Patterns Using Multiple Samples , 2005, SDM.

[144]  Mehran Sahami,et al.  Learning Limited Dependence Bayesian Classifiers , 1996, KDD.

[145]  Jeffrey Heer,et al.  Identification of Web User Traffic Composition using Multi-Modal Clustering and Information Scent , 2000 .

[146]  Liu Tong-ming An Improved Algorithm for Mining Sequential Patterns Based on CTID , 2005 .

[147]  T. Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1999, ECML.

[148]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[149]  Jeng-Shyang Pan,et al.  A Fast K Nearest Neighbors Classification Algorithm , 2004 .

[150]  H. Koh,et al.  Data mining applications in healthcare. , 2005, Journal of healthcare information management : JHIM.

[151]  Daqiang Zhang,et al.  Context-aware vehicular cyber-physical systems with cloud support: architecture, challenges, and solutions , 2014, IEEE Communications Magazine.

[152]  Yanqing Zhang,et al.  Granular support vector machines for medical binary classification problems , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[153]  Sheng-He Sun,et al.  Equal-Average Equal-Variance Equal-Norm Nearest Neighbor Search Algorithm for Vector Quantization , 2003 .

[154]  Min Chen,et al.  Green multimedia communications over Internet of Things , 2012, 2012 IEEE International Conference on Communications (ICC).

[155]  Athanasios V. Vasilakos,et al.  Future Internet of Things: open issues and challenges , 2014, Wireless Networks.

[156]  Victor C. M. Leung,et al.  Directional Controlled Fusion in Wireless Sensor Networks , 2008, QShine '08.

[157]  Zhenglu Yang,et al.  LAPIN-SPAM: An Improved Algorithm for Mining Sequential Pattern , 2005, 21st International Conference on Data Engineering Workshops (ICDEW'05).

[158]  Ahmed A. El-Masry,et al.  Citizens as consumers: Profiling e-government services' users in Egypt via data mining techniques , 2013, Int. J. Inf. Manag..

[159]  Joseph A. Konstan,et al.  Teaching Recommender Systems at Large Scale , 2015, ACM Trans. Comput. Hum. Interact..

[160]  Min Chen,et al.  Itinerary Planning for Energy-Efficient Agent Communications in Wireless Sensor Networks , 2011, IEEE Transactions on Vehicular Technology.

[161]  Athanasios V. Vasilakos,et al.  Security of the Internet of Things: perspectives and challenges , 2014, Wireless Networks.

[162]  Marco R. Spruit,et al.  Improving short-term demand forecasting for short-lifecycle consumer products with data mining techniques , 2014, Decis. Anal..

[163]  Dimitrios Gunopulos,et al.  INDEXING TIME-SERIES UNDER CONDITIONS OF NOISE , 2004 .

[164]  A. Valle,et al.  Diffusion of nuclear energy in some developing countries , 2014 .

[165]  Concha Bielza,et al.  Discrete Bayesian Network Classifiers , 2014, ACM Comput. Surv..

[166]  Charalampos Konstantopoulos,et al.  Mobile recommender systems in tourism , 2014, J. Netw. Comput. Appl..

[167]  David Heckerman,et al.  Empirical Analysis of Predictive Algorithms for Collaborative Filtering , 1998, UAI.

[168]  Hian Chye Koh,et al.  A Two-step Method to Construct Credit Scoring Models with Data Mining Techniques , 2006 .

[169]  Michael A. Trick,et al.  A data mining approach to forecast behavior , 2014, Ann. Oper. Res..

[170]  Adrian E. Raftery,et al.  MCLUST Version 3: An R Package for Normal Mixture Modeling and Model-Based Clustering , 2006 .

[171]  Xing Xie,et al.  Mining interesting locations and travel sequences from GPS trajectories , 2009, WWW '09.