Uncertainty in big data analytics: survey, opportunities, and challenges

Big data analytics has gained wide attention from both academia and industry as the demand for understanding trends in massive datasets increases. Recent developments in sensor networks, cyber-physical systems, and the ubiquity of the Internet of Things (IoT) have increased the collection of data (including health care, social media, smart cities, agriculture, finance, education, and more) to an enormous scale. However, the data collected from sensors, social media, financial records, etc. is inherently uncertain due to noise, incompleteness, and inconsistency. The analysis of such massive amounts of data requires advanced analytical techniques for efficiently reviewing and/or predicting future courses of action with high precision and advanced decision-making strategies. As the amount, variety, and speed of data increases, so too does the uncertainty inherent within, leading to a lack of confidence in the resulting analytics process and decisions made thereof. In comparison to traditional data techniques and platforms, artificial intelligence techniques (including machine learning, natural language processing, and computational intelligence) provide more accurate, faster, and scalable results in big data analytics. Previous research and surveys conducted on big data analytics tend to focus on one or two techniques or specific application domains. However, little work has been done in the field of uncertainty when applied to big data analytics as well as in the artificial intelligence techniques applied to the datasets. This article reviews previous work in big data analytics and presents a discussion of open challenges and future directions for recognizing and mitigating uncertainty in this domain.

[1]  Xizhao Wang,et al.  Maximum Ambiguity-Based Sample Selection in Fuzzy Decision Tree Induction , 2012, IEEE Transactions on Knowledge and Data Engineering.

[2]  Hui Xiong,et al.  Enhancing data analysis with noise removal , 2006, IEEE Transactions on Knowledge and Data Engineering.

[3]  Sunil Kumar Khatri,et al.  Improving patient matching: Single patient view for Clinical Decision Support using Big Data analytics , 2015, 2015 4th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions).

[4]  I. Turksen,et al.  Uncertainty and Fuzzy Decisions , 2014 .

[5]  Theresa Beaubouef,et al.  Rough Sets , 2019, Lecture Notes in Computer Science.

[6]  Ivo Düntsch,et al.  Rough Set Dependency Analysis in Evaluation Studies – An Application in the Study of Repeated Heart Attacks , 2002 .

[7]  Taghi M. Khoshgoftaar,et al.  Deep learning applications and challenges in big data analytics , 2015, Journal of Big Data.

[8]  Lori Bowen Ayre,et al.  Open Data: What It Is and Why You Should Care , 2017, Public Libr. Q..

[9]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[10]  Madjid Tavana,et al.  A practical taxonomy of methods and literature for managing uncertain spatial data in geographic information systems , 2016 .

[11]  Shu-Ching Chen,et al.  Multimedia Big Data Analytics , 2018, ACM Comput. Surv..

[12]  Sheng-De Wang,et al.  Fuzzy support vector machines , 2002, IEEE Trans. Neural Networks.

[13]  Naima Kaabouch,et al.  Techniques for dealing with uncertainty in cognitive radio networks , 2017, 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC).

[14]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[15]  Andreas Holzinger,et al.  Combining HCI, Natural Language Processing, and Knowledge Discovery - Potential of IBM Content Analytics as an Assistive Technology in the Biomedical Field , 2013, CHI-KDD.

[16]  Yunhao Liu,et al.  Big Data: A Survey , 2014, Mob. Networks Appl..

[17]  Ivan Zelinka,et al.  Big Data Movement: A Challenge in Data Processing , 2015 .

[18]  Erik Brynjolfsson,et al.  Big data: the management revolution. , 2012, Harvard business review.

[19]  Francisco Herrera,et al.  Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data , 2015, Fuzzy Sets Syst..

[20]  Fernando Iafrate,et al.  A Journey from Big Data to Smart Data , 2014 .

[21]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2022 .

[22]  Taghi M. Khoshgoftaar,et al.  A survey of transfer learning , 2016, Journal of Big Data.

[23]  Apoorva Gupta,et al.  Big Data analysis using Computational Intelligence and Hadoop: A study , 2015, 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom).

[24]  Xizhao Wang,et al.  Editorial: Uncertainty in learning from big data , 2015, Fuzzy Sets Syst..

[25]  Dongrui Wu,et al.  Fuzzy sets and systems in building closed-loop affective computing systems for human-computer interaction: Advances and new research directions , 2012, 2012 IEEE International Conference on Fuzzy Systems.

[26]  Daniel G. Brown,et al.  Classification and Boundary Vagueness in Mapping Presettlement Forest Types , 1998, Int. J. Geogr. Inf. Sci..

[27]  Kwan-Liu Ma,et al.  A framework for uncertainty-aware visual analytics , 2009, 2009 IEEE Symposium on Visual Analytics Science and Technology.

[28]  Jiye Liang,et al.  Inclusion degree: a perspective on measures for rough set data analysis , 2002, Inf. Sci..

[29]  Ronald R. Yager,et al.  Decision making under measure-based granular uncertainty , 2018 .

[30]  Kenneth D. Strang,et al.  Statistical Modeling and Visualizing Open Big Data Using a Terrorism Case Study , 2015, 2015 3rd International Conference on Future Internet of Things and Cloud.

[31]  Chris Fox,et al.  The Handbook of Computational Linguistics and Natural Language Processing , 2010 .

[32]  K U Jaseena,et al.  ISSUES , CHALLENGES , AND SOLUTIONS : BIG DATA MINING , 2014, NETCOM 2014.

[33]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[34]  Lotfi A. Zadeh Toward a perception-based theory of probabilistic reasoning with imprecise probabilities , 2003 .

[35]  Erik M. Fredericks,et al.  Towards Traceability Link Recovery for Self-Adaptive Systems , 2018, AAAI Workshops.

[36]  Rahat Iqbal,et al.  Big data analytics: Computational intelligence techniques and application areas , 2020, Technological Forecasting and Social Change.

[37]  Pierre-André G Maugis,et al.  Big data uncertainties. , 2016, Journal of forensic and legal medicine.

[38]  F. Knight The economic nature of the firm: From Risk, Uncertainty, and Profit , 2009 .

[39]  Rahat Iqbal,et al.  Type-2 fuzzy sets applied to multivariable self-organizing fuzzy logic controllers for regulating anesthesia , 2016, Appl. Soft Comput..

[40]  John Fulcher,et al.  Computational Intelligence: An Introduction , 2008, Computational Intelligence: A Compendium.

[41]  Guanghui Wang,et al.  Natural language processing systems and Big Data analytics , 2015 .

[42]  Robert DeLine Research Opportunities for the Big Data Era of Software Engineering , 2015, 2015 IEEE/ACM 1st International Workshop on Big Data Software Engineering.

[43]  Germano Lambert-Torres,et al.  Rough Set Theory - Fundamental Concepts, Principals, Data Extraction, and Applications , 2009 .

[44]  Janusz Kacprzyk,et al.  Granular, Soft and Fuzzy Approaches for Intelligent Systems - Dedicated to Professor Ronald R. Yager , 2016, Granular, Soft and Fuzzy Approaches for Intelligent Systems.

[45]  Chengqi Zhang,et al.  Active Learning without Knowing Individual Instance Labels: A Pairwise Label Homogeneity Query Approach , 2014, IEEE Transactions on Knowledge and Data Engineering.

[46]  Xizhao Wang,et al.  Learning from Uncertainty for Big Data: Future Analytical Challenges and Strategies , 2016, IEEE Systems, Man, and Cybernetics Magazine.

[47]  Shaik Saidulu,et al.  Machine Learning and Statistical Approaches for Big Data : Issues , Challenges and Research Directions , 2017 .

[48]  Michael I. Jordan Divide-and-conquer and statistical inference for big data , 2012, KDD.

[49]  Navneet Golchha Big Data – The information revolution , 2015 .

[50]  José Francisco Martínez Trinidad,et al.  A review of instance selection methods , 2010, Artificial Intelligence Review.

[51]  Stefan Jähnichen,et al.  Towards a taxonomy of standards in smart data , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[53]  D. PeterAugustine Enhancing the Efficiency of Parallel Genetic Algorithms for Medical Image Processing with Hadoop , 2014 .

[54]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[55]  Annick Lesne,et al.  Shannon entropy: a rigorous notion at the crossroads between probability, information theory, dynamical systems and statistical physics , 2014, Mathematical Structures in Computer Science.

[56]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[57]  Athanasios V. Vasilakos,et al.  Big data analytics: a survey , 2015, Journal of Big Data.

[58]  Md. Rafiqul Islam,et al.  Evolutionary optimization: A big data perspective , 2016, J. Netw. Comput. Appl..

[59]  Fabio Cuzzolin Belief Functions: Theory and Applications , 2014, Lecture Notes in Computer Science.

[60]  Sriram Vajapeyam,et al.  Understanding Shannon's Entropy metric for Information , 2014, ArXiv.

[61]  Mudassir Khan,et al.  Big Data Analytics Evaluation , 2018 .

[62]  Xiangfeng Wang,et al.  Machine learning for Big Data analytics in plants. , 2014, Trends in plant science.

[63]  Murtaza Haider,et al.  Beyond the hype: Big data concepts, methods, and analytics , 2015, Int. J. Inf. Manag..

[64]  Qihui Wu,et al.  A survey of machine learning for big data processing , 2016, EURASIP Journal on Advances in Signal Processing.

[65]  Xizhao Wang,et al.  Fuzziness based sample categorization for classifier performance improvement , 2015, J. Intell. Fuzzy Syst..

[66]  3rd International Conference on Future Internet of Things and Cloud, FiCloud 2015, Rome, Italy, August 24-26, 2015 , 2015, FiCloud.

[67]  Adem Karahoca,et al.  Data Mining and Knowledge Discovery in Real Life Applications , 2009 .

[68]  Lotfi A. Zadeh,et al.  Toward a generalized theory of uncertainty (GTU) - an outline , 2005, GrC.

[69]  Erin Smith Crabb "Time for Some Traffic Problems": Enhancing E-Discovery and Big Data Processing Tools with Linguistic Methods for Deception Detection , 2014, J. Digit. Forensics Secur. Law.

[70]  Robert LIN,et al.  NOTE ON FUZZY SETS , 2014 .

[71]  M. Hanumanthappa,et al.  A survey of machine learning algorithms for big data analytics , 2017, 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS).