A pipeline for mining association rules from large datasets of retailers invoices

The concept of massive data generation nowadays affects several domains such as marketing including electronic invoices (e-invoices) of large retailers, web access log files, healthcare, life sciences and so on. Datasets dimensions grow up, due to the availability of several cheap connected devices, such as mobile devices, RFID and wireless sensors networks, from which to collect data. Often, the collected data need to be gathered into a consistent, integrated and comprehensive form, to be used for knowledge discovery. Without adequately cleaning, transforming and structuring the data before the analysis, it is hard to mine useful knowledge. Thus, users by using data mining can extract knowledge from large invoices documents. In this paper, a pipeline for preprocessing and mining association rules from large retailers commercial documents has been proposed. The preprocessing provides merging, cleaning, formatting and summarization. The methodology can improve the quality of large retailers data by reducing the quantity of irrelevant data, making the remaining data suitable to mine association rules (ARM). Analyzing a real invoices dataset (provided by an Italian retailer) by using the proposed methodology, it was possible to extract 36 significant association rules, highlighting the customers' behavior in the purchase of goods.

[1]  Diane M. Strong,et al.  Data quality in context , 1997, CACM.

[2]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[3]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[4]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[5]  Mario Cannataro,et al.  DMET-Miner: Efficient discovery of association rules from pharmacogenomic data , 2015, J. Biomed. Informatics.

[6]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[7]  R. Porkodi,et al.  A Comparative Analysis of Association Rule Mining Algorithms in Data Mining: A Study , 2015 .

[8]  Santosh V. Chobe,et al.  An Overview of Association Rule Mining Algorithms , 2014 .

[9]  Mingzhu Zhang,et al.  Survey on Association Rules Mining Algorithms , 2010 .

[10]  Giri Kumar Tayi,et al.  Examining data quality , 1998, CACM.

[11]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[12]  Michael J. Shaw,et al.  Knowledge management and data mining for marketing , 2001, Decis. Support Syst..

[13]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[14]  David C. Yen,et al.  Data mining techniques for customer relationship management , 2002 .

[15]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[16]  Mario Cannataro,et al.  Services4SNPs: A RESTful Platform for Association Rule Mining and Survival Analysis of Genotyping Data , 2018, BCB.

[17]  Mario Cannataro,et al.  Extracting Cross-Ontology Weighted Association Rules from Gene Ontology Annotations , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.