Research on social data by means of cluster analysis

Abstract This paper presents a data mining study and cluster analysis of social data obtained on small producers and family farmers from six country cities in Ceara state, northeast Brazil. The analyzed data involve demographic, economic, agriculture and food insecurity information. The goal of the study is to establish profiles for the small producer families that reside in the region and to identify relevant features which differentiate these profiles. Moreover, we provide an efficient data mining methodology for analysis of social data sets which is capable of handling its natural challenges, such as mixed variables and abundance of null values. We use the Silhouette method for the estimation of the best number of natural groups within the data, along with the Partitioning Around Medoids clustering algorithm in order to compute the profiles. The Correlation-Based Feature Selection method is used to identify which social criteria are the most important to differentiate the families from each profile. Classification models based on support vector machines, multilayer perceptron and decision trees were developed aiming to predict in which of the identified clusters an arbitrary family would be best fit. We obtained a good separation of the families into two clusters, and a multilayer perceptron model with approximately 93.5% prediction accuracy.

[1]  Fernando Barbosa,et al.  A simple and practical control of the authenticity of organic sugarcane samples based on the use of machine-learning algorithms and trace elements determination by inductively coupled plasma mass spectrometry. , 2015, Food chemistry.

[2]  Lutz Hamel,et al.  Knowledge Discovery with Support Vector Machines , 2009 .

[3]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[4]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[5]  Rommel M. Barbosa,et al.  Comparative study of data mining techniques for the authentication of organic grape juice based on ICP-MS analysis , 2016, Expert Syst. Appl..

[6]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[7]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[8]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[9]  J. Gower A General Coefficient of Similarity and Some of Its Properties , 1971 .

[10]  David G. Stork,et al.  Pattern Classification , 1973 .

[11]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[12]  Rommel M. Barbosa,et al.  Classification of geographic origin of rice by data mining and inductively coupled plasma mass spectrometry , 2016, Comput. Electron. Agric..

[13]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[14]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[15]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[16]  Lipika Dey,et al.  A k-mean clustering algorithm for mixed numeric and categorical data , 2007, Data Knowl. Eng..

[17]  Panos M. Pardalos,et al.  A survey of data mining techniques applied to agriculture , 2009, Oper. Res..

[18]  Li Xiu,et al.  Application of data mining techniques in customer relationship management: A literature review and classification , 2009, Expert Syst. Appl..

[19]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[20]  J. Campos Paradigms and Public Policies on Drought in Northeast Brazil: A Historical Perspective , 2015, Environmental Management.

[21]  I. Scoones Livelihoods perspectives and rural development , 2009 .

[22]  Huan Liu,et al.  Data Mining in Social Media , 2011, Social Network Data Analytics.

[23]  F. Barbosa,et al.  The use of advanced chemometric techniques and trace element levels for controlling the authenticity of organic coffee , 2014 .

[24]  M. W Gardner,et al.  Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences , 1998 .

[25]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[26]  Mohammad Reza Pahlavan Rad,et al.  Application of Artificial Neural Networks to predict the final fruit weight and random forest to select important variables in native population of melon (Cucumis melo L.) , 2015 .

[27]  A. M. Buainain,et al.  Dez anos de evolução da agricultura familiar no Brasil: (1996 e 2006) , 2012 .

[28]  Fadi Thabtah,et al.  A machine learning framework for sport result prediction , 2019, Applied Computing and Informatics.

[29]  Huan Liu,et al.  Mining Social Media: A Brief Introduction , 2012 .