Pragamana: Performance Comparison and Programming Alpha-miner Algorithm in Relational Database Query Language and NoSQL Column-Oriented Using Apache Phoenix

Process-Aware Information Systems (PAIS) is an IT system that support business processes and generate large amounts of event logs from the execution of business processes. An event log is represented as a tuple of CaseID, Timestamp, Activity and Actor. Process Mining is a new and emerging field that aims at analyzing the event logs to discover, enhance and improve business processes and check conformance between run time and design time business processes. The large volume of event logs generated are stored in the databases. Relational databases perform well for a certain class of applications. However, there are a certain class of applications for which relational databases are not able to scale. To handle such class of applications, NoSQL database systems emerged. Discovering a process model (workflow model) from event logs is one of the most challenging and important Process Mining task. The α-miner algorithm is one of the first and most widely used Process Discovery technique. Our objective is to investigate which of the databases (Relational or NoSQL) performs better for a Process Discovery application under Process Mining. We implement the α-miner algorithm on relational (row-oriented) and NoSQL (column-oriented) databases in database query languages so that our algorithm is tightly coupled to the database. We present a performance benchmarking and comparison of the α-miner algorithm on row-oriented database and NoSQL column-oriented database so that we can compare which database can efficiently store massive event logs and analyze it in seconds to discover a process model.

[1]  M. N. Vora,et al.  Hadoop-HBase for large-scale data , 2011, Proceedings of 2011 International Conference on Computer Science and Network Technology.

[2]  Kai-Uwe Sattler,et al.  Efficient Frequent Pattern Mining in Relational Databases , 2004, LWA.

[3]  Gunter Saake,et al.  Workload Representation across Different Storage Architectures for Relational DBMS , 2011, Grundlagen von Datenbanken.

[4]  Teodor-Florin Fortis,et al.  Benchmarking Database Systems for the Requirements of Sensor Readings , 2009 .

[5]  Chongxin Li,et al.  Transforming relational database into HBase: A case study , 2010, 2010 IEEE International Conference on Software Engineering and Service Sciences.

[6]  L. Suresh,et al.  Novel and efficient clustering algorithm using structured query language , 2008, 2008 International Conference on Computing, Communication and Networking.

[7]  Carlos Ordonez Programming the K-means clustering algorithm in SQL , 2004, KDD '04.

[8]  Nick Dimiduk,et al.  HBase in Action , 2012 .

[9]  Irving L. Traiger,et al.  A history and evaluation of System R , 1981, CACM.

[10]  Hasso Plattner,et al.  A common database approach for OLTP and OLAP using an in-memory column database , 2009, SIGMOD Conference.

[11]  P. Porouhan,et al.  Process mining using α-algorithm as a tool (A case study of student registration) , 2012, 2012 Tenth International Conference on ICT and Knowledge Engineering.

[12]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[13]  Andrea C. Arpaci-Dusseau,et al.  Analysis of HDFS under HBase: a facebook messages case study , 2014, FAST.

[14]  N. J. Mistry,et al.  Association Rule Mining Analyzation Using Column Oriented Database , 2013 .

[15]  Nicolás Marín,et al.  TBAR: An efficient method for association rule mining in relational databases , 2001, Data Knowl. Eng..

[16]  L. Suresh,et al.  Implementing K-means Algorithm using Row store and Column store databases : A case study , 2009 .

[17]  Ashish Sureka,et al.  Khanan: Performance Comparison and Programming \alpha α -Miner Algorithm in Column-Oriented and Relational Database Query Languages , 2015, BDA.

[18]  Kai-Uwe Sattler,et al.  SQL database primitives for decision tree classifiers , 2001, CIKM '01.

[19]  Wil M.P. van der Aalst Process Mining: Overview and Opportunities , 2012, TMIS.

[20]  Daniel J. Abadi,et al.  Column-stores vs. row-stores: how different are they really? , 2008, SIGMOD Conference.

[21]  Carlos Ordonez,et al.  SQLEM: fast clustering in SQL using the EM algorithm , 2000, SIGMOD '00.

[22]  Antony I. T. Rowstron,et al.  Scale-up vs scale-out for Hadoop: time to rethink? , 2013, SoCC.

[23]  Hans De Sterck,et al.  HBaseSI: Multi-row Distributed Transactions with Global Strong Snapshot Isolation on Clouds , 2011, Scalable Comput. Pract. Exp..