Performance Improvement of Open Source Based Business Intelligence System Using Database Modeling and Outlier Detection

With all the advanced technology nowadays, new data is being generated every minute. For example, the average size of the computer’s hard disk is 10 gigabytes in 2000, today on the Facebook website has increased 500 terabytes of new data per day [1]. Data is growing rapidly, but it is not enough valuable. Thus, it is important to extract information that is useful in the future from a large amount of data. Business intelligence (BI) systems make a prediction that supports a business decision by analyzing collected data [2]. However, the accuracy of prediction depends on a data quality. In practice, data is usually a very low quality that includes many incomplete and anomaly data. Moreover, another problem is if data size increases, query response will be slow. Previous research work, we proposed a framework based on open-source technologies for the BI systems that possibility to analyze big data efficiently and apply it to the supermarket’s BI system. Under this solution, we have studied Hadoop data storage system, Hive data warehouse software, Sqoop data transmission tool and etc., successfully implemented them. In this paper, we have added anomaly detection stage on the proposed framework to improve information about related products that are purchased together by eliminating anomaly. Also, we have made an experimental study to improve the speed of time-dependent reports by applying the dimensional model to Hive data warehouse. In dimensional model data is stored in context of the single table (centralized context), and in relational model the context is distributed over many tables. As a result of the experimental study, the dimensional model is more efficient; its query response time is shown to be at least two times faster than the relational model based data warehouse.

[1]  Jie Liu,et al.  Clinical data preprocessing and case studies of POMDP for TCM treatment knowledge discovery , 2012, 2012 IEEE 14th International Conference on e-Health Networking, Applications and Services (Healthcom).

[2]  Keun Ho Ryu,et al.  Application of a Mobile Chronic Disease Health-Care System for Hypertension Based on Big Data Platforms , 2018, J. Sensors.

[3]  Alfredo Cuzzocrea Analytics over Big Data: Exploring the Convergence of DataWarehousing, OLAP and Data-Intensive Cloud Infrastructures , 2013, 2013 IEEE 37th Annual Computer Software and Applications Conference.

[4]  Keun Ho Ryu,et al.  Unsupervised Novelty Detection Using Deep Autoencoders with Density Based Clustering , 2018, Applied Sciences.

[5]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[6]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[7]  Michael J. Shaw,et al.  Knowledge management and data mining for marketing , 2001, Decis. Support Syst..

[8]  Young Sung Cho,et al.  Effective Purchase Pattern Mining with Weight Based on FRAT Analysis for Recommender in e-Commerce , 2015 .

[9]  Kwang Sun Ryu,et al.  Discovering Medical Knowledge using Association Rule Mining in Young Adults with Acute Myocardial Infarction , 2013, Journal of Medical Systems.

[10]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[11]  Christer Carlsson,et al.  Past, present, and future of decision support technology , 2002, Decis. Support Syst..

[12]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[13]  William Yeoh,et al.  BUSI NESS INTELLIGENCE SYSTEMS: STATE-OF-THE-ART REVIEW AND CONTEMPORARY APPLICATIONS , 2009 .