Transforming big data into smart data: An insight on the use of the k‐nearest neighbors algorithm to obtain quality data

The k‐nearest neighbors algorithm is characterized as a simple yet effective data mining technique. The main drawback of this technique appears when massive amounts of data—likely to contain noise and imperfections—are involved, turning this algorithm into an imprecise and especially inefficient technique. These disadvantages have been subject of research for many years, and among others approaches, data preprocessing techniques such as instance reduction or missing values imputation have targeted these weaknesses. As a result, these issues have turned out as strengths and the k‐nearest neighbors rule has become a core algorithm to identify and correct imperfect data, removing noisy and redundant samples, or imputing missing values, transforming Big Data into Smart Data—which is data of sufficient quality to expect a good outcome from any data mining algorithm. The role of this smart data gleaning algorithm in a supervised learning context are investigated. This includes a brief overview of Smart Data, current and future trends for the k‐nearest neighbor algorithm in the Big Data context, and the existing data preprocessing techniques based on this algorithm. We present the emerging big data‐ready versions of these algorithms and develop some new methods to cope with Big Data. We carry out a thorough experimental analysis in a series of big datasets that provide guidelines as to how to use the k‐nearest neighbor algorithm to obtain Smart/Quality Data for a high‐quality data mining process. Moreover, multiple Spark Packages have been developed including all the Smart Data algorithms analyzed.

[1]  Stefan Jähnichen,et al.  Towards a taxonomy of standards in smart data , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[2]  Jianqing Fan,et al.  High Dimensional Classification Using Features Annealed Independence Rules. , 2007, Annals of statistics.

[3]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[4]  Arun Sharma,et al.  Scalable machine‐learning algorithms for big data analytics: a comprehensive review , 2016, Wiley Interdiscip. Rev. Data Min. Knowl. Discov..

[5]  Fabrizio Angiulli,et al.  Fast Nearest Neighbor Condensation for Large Data Sets Classification , 2007, IEEE Transactions on Knowledge and Data Engineering.

[6]  Francisco Herrera,et al.  A memetic algorithm for evolutionary prototype selection: A scaling up approach , 2008, Pattern Recognit..

[7]  James M. Keller,et al.  A fuzzy K-nearest neighbor algorithm , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[8]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[9]  Vasyl Lytvyn,et al.  Smart Data Integration by Goal Driven Ontology Learning , 2016, INNS Conference on Big Data.

[10]  James C. Bezdek,et al.  Nearest prototype classifier designs: An experimental study , 2001, Int. J. Intell. Syst..

[11]  I. Tomek An Experiment with the Edited Nearest-Neighbor Rule , 1976 .

[12]  Robert Ivor John,et al.  An Immune-Inspired Technique to Identify Heavy Goods Vehicles Incident Hot Spots , 2017, IEEE Transactions on Emerging Topics in Computational Intelligence.

[13]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[14]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[15]  Francisco Herrera,et al.  IPADE: Iterative Prototype Adjustment for Nearest Neighbor Classification , 2010, IEEE Transactions on Neural Networks.

[16]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[17]  Francisco Herrera,et al.  Data Preprocessing in Data Mining , 2014, Intelligent Systems Reference Library.

[18]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[19]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[20]  Luis de Marcos,et al.  Distributed ReliefF-based feature selection in Spark , 2018, Knowledge and Information Systems.

[21]  Aníbal R. Figueiras-Vidal,et al.  Pattern classification with missing data: a review , 2010, Neural Computing and Applications.

[22]  Verónica Bolón-Canedo,et al.  Fast‐mRMR: Fast Minimum Redundancy Maximum Relevance Algorithm for High‐Dimensional Big Data , 2017, Int. J. Intell. Syst..

[23]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[24]  Francisco Herrera,et al.  A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor Classification , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[25]  T. Schneider Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values. , 2001 .

[26]  Elisa Bertino,et al.  Indexing Techniques for Advanced Database Systems , 1997, The Springer International Series on Advances in Database Systems.

[27]  R. Little A Test of Missing Completely at Random for Multivariate Data with Missing Values , 1988 .

[28]  David B. Skalak,et al.  Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms , 1994, ICML.

[29]  David W. Aha,et al.  A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms , 1997, Artificial Intelligence Review.

[30]  Hammou Fadili,et al.  Towards an automatic analyze and standardization of unstructured data in the context of big and linked data , 2016, MEDES.

[31]  Bernard De Baets,et al.  Supervised distance metric learning through maximization of the Jeffrey divergence , 2017, Pattern Recognit..

[32]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.

[33]  Francisco Herrera,et al.  CNC-NOS: Class noise cleaning by ensemble filtering and noise scoring , 2018, Knowl. Based Syst..

[34]  Ho-Hyun Park,et al.  Tagging and classifying facial images in cloud environments based on KNN using MapReduce , 2015 .

[35]  Anil K. Ghosh,et al.  On some transformations of high dimension, low sample size data for nearest neighbor classification , 2015, Machine Learning.

[36]  Verónica Bolón-Canedo,et al.  Data discretization: taxonomy and big data challenge , 2016, WIREs Data Mining Knowl. Discov..

[37]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[38]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[39]  Francisco Herrera,et al.  IFS-CoCo: Instance and feature selection based on cooperative coevolution with nearest neighbor rule , 2010, Pattern Recognit..

[40]  Francisco Herrera,et al.  Fuzzy nearest neighbor algorithms: Taxonomy, experimental analysis and prospects , 2014, Inf. Sci..

[41]  E. Sivasankar,et al.  Framework for Smart Health: Toward Connected Data from Big Data , 2015 .

[42]  Francisco Herrera,et al.  An insight into imbalanced Big Data classification: outcomes and challenges , 2017 .

[43]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[44]  Francisco Herrera,et al.  Enabling Smart Data: Noise filtering in Big Data classification , 2017, Inf. Sci..

[45]  Andrew W. Moore,et al.  An Investigation of Practical Approximate Nearest Neighbor Algorithms , 2004, NIPS.

[46]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[47]  Maya R. Gupta,et al.  Completely Lazy Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[48]  Cheng Soon Ong,et al.  Multivariate spearman's ρ for aggregating ranks using copulas , 2016 .

[49]  Chin-Liang Chang,et al.  Finding Prototypes For Nearest Neighbor Classifiers , 1974, IEEE Transactions on Computers.

[50]  Francisco Herrera,et al.  On the choice of the best imputation methods for missing values considering three groups of classification methods , 2012, Knowledge and Information Systems.

[51]  María José del Jesús,et al.  Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks , 2014, WIREs Data Mining Knowl. Discov..

[52]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[53]  John L. Casti,et al.  A new initial-value method for on-line filtering and estimation (Corresp.) , 1972, IEEE Trans. Inf. Theory.

[54]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[55]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[56]  Yidi Wang,et al.  A new general nearest neighbor classification based on the mutual neighborhood information , 2017, Knowl. Based Syst..

[57]  Francisco Herrera,et al.  From Big Data to Smart Data with the K-Nearest Neighbours Algorithm , 2016, 2016 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData).

[58]  Francisco Herrera,et al.  MRPR: A MapReduce solution for prototype reduction in big data classification , 2015, Neurocomputing.

[59]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[60]  Roberto Alejo,et al.  Analysis of new techniques to obtain quality training sets , 2003, Pattern Recognit. Lett..

[61]  Nitin Narang,et al.  Imbalanced big data classification: a distributed implementation of SMOTE , 2018, ICDCN Workshops.

[62]  Luc Devroye,et al.  Lectures on the Nearest Neighbor Method , 2015 .

[63]  Veda C. Storey,et al.  Business Intelligence and Analytics: From Big Data to Big Impact , 2012, MIS Q..

[64]  Francisco Herrera,et al.  Big Data: Tutorial and guidelines on information and process fusion for analytics algorithms with MapReduce , 2018, Inf. Fusion.

[65]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[66]  Xin Yao,et al.  A Survey on Evolutionary Computation Approaches to Feature Selection , 2016, IEEE Transactions on Evolutionary Computation.

[67]  María José del Jesús,et al.  KEEL 3.0: An Open Source Software for Multi-Stage Analysis in Data Mining , 2017, Int. J. Comput. Intell. Syst..

[68]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[69]  Taghi M. Khoshgoftaar,et al.  Analyzing software measurement data with clustering techniques , 2004, IEEE Intelligent Systems.

[70]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[71]  Francisco Herrera,et al.  Big data preprocessing: methods and prospects , 2016 .

[72]  Francisco Herrera,et al.  kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data , 2017, Knowl. Based Syst..

[73]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[74]  Sergio Ramírez-Gallego,et al.  Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach , 2015 .

[75]  Álvar Arnaiz-González,et al.  MR-DIS: democratic instance selection for big data by MapReduce , 2017, Progress in Artificial Intelligence.

[76]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[77]  Btissam Zerhari Class noise elimination approach for large datasets based on a combination of classifiers , 2016, 2016 2nd International Conference on Cloud Computing Technologies and Applications (CloudTech).

[78]  Fernando Iafrate,et al.  A Journey from Big Data to Smart Data , 2014 .

[79]  Swagatam Das,et al.  A feature weighted penalty based dissimilarity measure for k-nearest neighbor classification with missing features , 2016, Pattern Recognit. Lett..

[80]  Filiberto Pla,et al.  Prototype selection for the nearest neighbour rule through proximity graphs , 1997, Pattern Recognit. Lett..

[81]  Francisco Herrera,et al.  Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study , 2003, IEEE Trans. Evol. Comput..

[82]  Francisco Herrera,et al.  Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification , 2011, Pattern Recognit..

[83]  Jiandong Wang,et al.  Margin distribution explanation on metric learning for nearest neighbor classification , 2016, Neurocomputing.

[84]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[85]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[86]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[87]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Effect of label noise in the complexity of classification problems , 2015, Neurocomputing.

[88]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[89]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[90]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2007 .

[91]  Ivor W. Tsang,et al.  Towards ultrahigh dimensional feature selection for big data , 2012, J. Mach. Learn. Res..

[92]  Francisco Herrera,et al.  Exact fuzzy k-nearest neighbor classification for big datasets , 2017, 2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[93]  Mohsen Guizani,et al.  Internet of Things: A Survey on Enabling Technologies, Protocols, and Applications , 2015, IEEE Communications Surveys & Tutorials.

[94]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[95]  Naftali Tishby,et al.  Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity , 2005, NIPS.