Smart Data: Where the Big Data Meets the Semantics

Big data technology is designed to address the challenges of the three Vs of big data, including volume (massive amount of data), variety (a range of data types and sources), and velocity (speed of data in and out). Big data is often captured without a specific purpose, leading to most of it being task-irrelevant data. The most important feature of data is neither the volume nor the other Vs, but its value. While big data is the technological foundation for data-driven business decision-making, smart data is an organized way to semantically compile, manipulate, correlate, and analyze different data sources. To deal with the volume, the semantics technology facilitates better decision-making by converting massive amount of data into abstraction, meanings, and insights. Neural network algorithms offer advantages for deep learning and exploit the whole, rather than parts, of the data. The article " A New Data Representation Based on Training Data Characteristics to Extract Drug Name Entity in Medical Text " by M. Sadikin et al. proposes three data representation techniques to analyze the characteristics of word distribution and word similarities as a result of word-embedding training. These techniques include multilayer perceptrons, deep-network classifiers (deep belief networks, stacked denoising encoders), and long short term memory. In the article " Objects Classification by Learning-Based Visual Saliency Model and Convolu-tional Neural Network " by N. Li et al., a neuroscience-inspired classification method is proposed to simulate the human visual information processing mechanism. This method combines both visual attention model and convolutional neural network to increase the accuracy of classifying objects, especially in biology. S. Bi et al. propose a force-directed method using a fracture mechanic model to learn word embedding in the article " Fracture Mechanics Method for Word Embedding Generation of Neural Probabilistic Linguistic Model. " The method aims to improve the accuracy, recall, and text visualization of traditional language models, and a word embedding, a semantic vector representation, could be generated via the neural linguistic model. For the variety, integrating heterogeneous data sources requires effective methods for providing well-defined ontolo-gies and natural language processing. In the article " A Character Level Based and Word Level Based Approach for Chinese-Vietnamese Machine Translation " by P. Tran et al., a hybrid method is proposed to translate one natural language to another (e.g., from Chinese to Vietnamese) by combining strengths of statistics-based and rule-based translation approaches at both character and word levels. In addition …

[1]  Antonio Iera,et al.  The Internet of Things: A survey , 2010, Comput. Networks.

[2]  E. Sivasankar,et al.  Framework for Smart Health: Toward Connected Data from Big Data , 2015 .

[3]  Fernando Iafrate,et al.  A Journey from Big Data to Smart Data , 2014 .

[4]  Francisco Herrera,et al.  An insight into imbalanced Big Data classification: outcomes and challenges , 2017 .

[5]  Francisco Herrera,et al.  MRPR: A MapReduce solution for prototype reduction in big data classification , 2015, Neurocomputing.

[6]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.

[7]  Weisong Shi,et al.  Edge Computing: Vision and Challenges , 2016, IEEE Internet of Things Journal.

[8]  Soundar R. T. Kumara,et al.  Cyber-physical systems in manufacturing , 2016 .

[9]  Francisco Herrera,et al.  Enabling Smart Data: Noise filtering in Big Data classification , 2017, Inf. Sci..

[10]  Verónica Bolón-Canedo,et al.  Data discretization: taxonomy and big data challenge , 2016, WIREs Data Mining Knowl. Discov..

[11]  Mario Piattini,et al.  From big data to smart data: a data quality perspective , 2018, EnSEmble@ESEC/SIGSOFT FSE.

[12]  Jay Lee,et al.  A Cyber-Physical Systems architecture for Industry 4.0-based manufacturing systems , 2015 .

[13]  Francisco Herrera,et al.  Tutorial on practical tips of the most influential data preprocessing algorithms in data mining , 2016, Knowl. Based Syst..

[14]  Francisco Herrera,et al.  ROSEFW-RF: The winner algorithm for the ECBDL'14 big data competition: An extremely imbalanced big data bioinformatics problem , 2015, Knowl. Based Syst..

[15]  Verónica Bolón-Canedo,et al.  Fast‐mRMR: Fast Minimum Redundancy Maximum Relevance Algorithm for High‐Dimensional Big Data , 2017, Int. J. Intell. Syst..

[16]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[17]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[18]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[19]  Jianqing Fan,et al.  High Dimensional Classification Using Features Annealed Independence Rules. , 2007, Annals of statistics.

[20]  Andrea Zanella,et al.  Internet of Things for Smart Cities , 2014, IEEE Internet of Things Journal.

[21]  Vasyl Lytvyn,et al.  Smart Data Integration by Goal Driven Ontology Learning , 2016, INNS Conference on Big Data.