Big Data Preprocessing: An Application on Online Social Networks

The mass adoption of social network services enabled online social networks a big data source. Machine learning and statistical analysis results are highly dependent on data preprocessing tasks. The purpose of data preprocessing is to revert the data to a format capable for the analysis and to ensure the high quality of data. However, not only management aspects for unstructured or semi-structured data remain largely unexplored but also new preprocessing techniques are required for addressing big data. In this chapter, the data preprocessing stages for big data sources emphasizing on online social networks are investigated. Special attention is paid to practical questions regarding low-quality data including incomplete, imbalanced, and noisy data. Furthermore, challenges and potential solutions of statistical and rule-based analysis for data cleansing are overviewed. The contribution of natural language processing, feature engineering, and machine learning methods is explored. Online social networks are investigated as (i) context, (ii) analysis practices, (iii) low-quality data, and most importantly (iv) how the latter are being addressed by techniques and frameworks. Last but not least, preprocessing on the broader field of distributed infrastructures is briefly overviewed.

[1]  J. Alberto Espinosa,et al.  Big Data: Issues and Challenges Moving Forward , 2013, 2013 46th Hawaii International Conference on System Sciences.

[2]  Hui Xiong,et al.  Introduction to special section on intelligent mobile knowledge discovery and management systems , 2013, ACM Trans. Intell. Syst. Technol..

[3]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[4]  Pascal Junod,et al.  An FPGA-Based 4 Mbps Secret Key Distillation Engine for Quantum Key Distribution Systems , 2017, J. Signal Process. Syst..

[5]  Jason Poulos,et al.  Missing Data Imputation for Supervised Learning , 2016, Appl. Artif. Intell..

[6]  Kostas E. Psannis,et al.  Social networking data analysis tools & challenges , 2016, Future Gener. Comput. Syst..

[7]  Liangyu Chen,et al.  An Unsupervised Framework of Exploring Events on Twitter: Filtering, Extraction and Categorization , 2015, AAAI.

[8]  Paolo Papotti,et al.  KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing , 2015, SIGMOD Conference.

[9]  Wenfei Fan,et al.  Determining the relative accuracy of attributes , 2013, SIGMOD '13.

[10]  Gueorgi Kossinets Effects of missing data in social networks , 2006, Soc. Networks.

[11]  Duncan Fyfe Gillies,et al.  A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data , 2015, Adv. Bioinformatics.

[12]  Ihab F. Ilyas,et al.  Distributed Data Deduplication , 2016, Proc. VLDB Endow..

[13]  Jure Leskovec,et al.  The Network Completion Problem: Inferring Missing Nodes and Edges in Networks , 2011, SDM.

[14]  Iman Saleh,et al.  Social-Network-Sourced Big Data Analytics , 2013, IEEE Internet Computing.

[15]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.

[16]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[17]  Ben Shneiderman,et al.  D-Dupe: An Interactive Tool for Entity Resolution in Social Networks , 2006, Graph Drawing.

[18]  Jignesh M. Patel,et al.  Big data and its technical challenges , 2014, CACM.

[19]  Rachida Dssouli,et al.  Big Data Pre-processing: A Quality Framework , 2015, 2015 IEEE International Congress on Big Data.

[20]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[21]  Paolo Papotti,et al.  BigDansing: A System for Big Data Cleansing , 2015, SIGMOD Conference.

[22]  Tajinder Singh,et al.  Role of Text Pre-processing in Twitter Sentiment Analysis , 2016 .

[23]  Krishna P. Gummadi,et al.  You are who you know: inferring user profiles in online social networks , 2010, WSDM '10.

[24]  Andreas Thor,et al.  Dedoop: Efficient Deduplication with Hadoop , 2012, Proc. VLDB Endow..

[25]  Ihab F. Ilyas,et al.  Data Cleaning: Overview and Emerging Challenges , 2016, SIGMOD Conference.

[26]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[27]  Yinghui Wu,et al.  Functional Dependencies for Graphs , 2016, SIGMOD Conference.

[28]  Gehao Sheng,et al.  An Integrated Data Preprocessing Framework Based on Apache Spark for Fault Diagnosis of Power Grid Equipment , 2017, J. Signal Process. Syst..

[29]  Yangyong Zhu,et al.  The Challenges of Data Quality and Data Quality Assessment in the Big Data Era , 2015, Data Sci. J..

[30]  Oren Etzioni,et al.  Open domain event extraction from twitter , 2012, KDD.

[31]  Murtaza Haider,et al.  Beyond the hype: Big data concepts, methods, and analytics , 2015, Int. J. Inf. Manag..

[32]  Rami Puzis,et al.  Computationally efficient link prediction in a variety of social networks , 2013, ACM Trans. Intell. Syst. Technol..

[33]  Pekka Pääkkönen,et al.  Evaluating the Quality of Social Media Data in Big Data Architecture , 2015, IEEE Access.

[34]  Michael Stonebraker,et al.  Temporal Rules Discovery for Web Data Cleaning , 2015, Proc. VLDB Endow..

[35]  Renée J. Miller,et al.  Continuous data cleaning , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[36]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[37]  Lior Rokach,et al.  Matching entities across online social networks , 2014, Neurocomputing.

[38]  Amir Hussain,et al.  Comparing Oversampling Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction Case Study , 2016, IEEE Access.

[39]  Divesh Srivastava,et al.  Data quality: The other face of Big Data , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[40]  Baowen Xu,et al.  A Unified Semi-supervised Framework for Author Disambiguation in Academic Social Network , 2014, DEXA.

[41]  Ahmed K. Elmagarmid,et al.  NADEEF: A Generalized Data Cleaning System , 2013, Proc. VLDB Endow..

[42]  Po-Ling Loh,et al.  High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity , 2011, NIPS.

[43]  Paolo Papotti,et al.  That's All Folks! LLUNATIC Goes Open Source , 2014, Proc. VLDB Endow..

[44]  Wenfei Fan,et al.  Data Quality: From Theory to Practice , 2015, SGMD.

[45]  Hanan Samet,et al.  TwitterStand: news in tweets , 2009, GIS.