Event log imperfection patterns for process mining: Towards a systematic approach to cleaning event logs

Process-oriented data mining (process mining) uses algorithms and data (in the form of event logs) to construct models that aim to provide insights into organisational processes. The quality of the data (both form and content) presented to the modeling algorithms is critical to the success of the process mining exercise. Cleaning event logs to address quality issues prior to conducting a process mining analysis is a necessary, but generally tedious and ad hoc task. In this paper we describe a set of data quality issues, distilled from our experiences in conducting process mining analyses, commonly found in process mining event logs or encountered while preparing event logs from raw data sources. We show that patterns are used in a variety of domains as a means for describing commonly encountered problems and solutions. The main contributions of this article are in showing that a patterns-based approach is applicable to documenting commonly encountered event log quality issues, the formulation of a set of components for describing event log quality issues as patterns, and the description of a collection of 11 event log imperfection patterns distilled from our experiences in preparing event logs. We postulate that a systematic approach to using such a pattern repository to identify and repair event log quality issues benefits both the process of preparing an event log and the quality of the resulting event log. The relevance of the pattern-based approach is illustrated via application of the patterns in a case study and through an evaluation by researchers and practitioners in the field.

[1]  Max Jacobson,et al.  A Pattern Language: Towns, Buildings, Construction , 1981 .

[2]  David A. Mundie,et al.  The justification of a pattern for detecting intellectual property theft by departing insiders , 2012 .

[3]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[4]  Gregor Hohpe,et al.  Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions , 2003 .

[5]  Hany H. Ammar,et al.  Pattern-Oriented Analysis and Design: Composing Patterns to Design Software Systems , 2003 .

[6]  Pedro Rangel Henriques,et al.  A Formal Definition of Data Quality Problems , 2005, ICIQ.

[7]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[8]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications) , 2006 .

[9]  Moe Thandar Wynn,et al.  Understanding Process Behaviours in a Large Insurance Company in Australia: A Case Study , 2013, CAiSE.

[10]  Wil M. P. van der Aalst,et al.  Process Mining in Healthcare: Data Challenges When Answering Frequently Posed Questions , 2012, ProHealth/KR4HC.

[11]  Mourad Debbabi,et al.  Security Design Patterns: Survey and Evaluation , 2006, 2006 Canadian Conference on Electrical and Computer Engineering.

[12]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[13]  Wil M. P. van der Aalst,et al.  Workflow patterns put into context , 2012, Software & Systems Modeling.

[14]  Silvia Miksch,et al.  A Taxonomy of Dirty Time-Oriented Data , 2012, CD-ARES.

[15]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[16]  B. F. Castro Buschmann, Frank; Meunier, Regine; Rohnert, Hans; Sommerlad, Peter; Stal, Michael. Pattern-oriented software architecture: a system of patterns, John Wiley & Sons Ltd, 1996 , 1997 .

[17]  Doheon Lee,et al.  A Taxonomy of Dirty Data , 2004, Data Mining and Knowledge Discovery.

[18]  Richard Y. Wang,et al.  Anchoring data quality dimensions in ontological foundations , 1996, CACM.

[19]  David Chek Ling Ngo,et al.  Formal Specification of Design Patterns - A Balanced Approach , 2003, J. Object Technol..

[20]  Marco Ajmone Marsan,et al.  A class of generalized stochastic Petri nets for the performance evaluation of multiprocessor systems , 1984, TOCS.

[21]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[22]  Ricardo Seguel,et al.  Process Mining Manifesto , 2011, Business Process Management Workshops.

[23]  Wil M. P. van der Aalst,et al.  A Rule-Based Approach for Process Discovery: Dealing with Noise and Imbalance in Process Logs , 2005, Data Mining and Knowledge Discovery.

[24]  Martin Fowler,et al.  Patterns of Enterprise Application Architecture , 2002 .

[25]  Mathias Weske,et al.  Repairing Event Logs Using Timed Process Models , 2013, OTM Workshops.

[26]  Dirk Riehle,et al.  Understanding and Using Patterns in Software Development , 1996, Theory Pract. Object Syst..

[27]  Jan Mendling,et al.  Increasing Recall of Process Model Matching by Improved Activity Label Matching , 2013, BPM.

[28]  Arthur H. M. ter Hofstede,et al.  Semantics and verification of object-role models , 1991, Inf. Syst..

[29]  Marc Ehrig,et al.  Measuring Similarity between Semantic Business Process Models , 2007, APCCM.

[30]  van der Wmp Wil Aalst,et al.  Process mining in healthcare : opportunities beyond the ordinary , 2013 .

[31]  Wil M. P. van der Aalst,et al.  Workflow Patterns , 2004, Distributed and Parallel Databases.

[32]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[33]  Christopher Alexander,et al.  The Timeless Way of Building , 1979 .

[34]  Grady Booch,et al.  Core J2EE Patterns (Core Design Series): Best Practices and Design Strategies , 2003 .

[35]  Andrew P. Moore,et al.  A pattern for increased monitoring for intellectual property theft by departing insiders , 2011, PLoP '11.

[36]  Murray Silverstein,et al.  A Pattern Language , 1977 .

[37]  B. Rambabu,et al.  Appraisal of Efficient Techniques for Online Record Linkage and Deduplication using Q-Gram Based Indexing , 2014 .

[38]  Martin Fowler,et al.  Analysis patterns - reusable object models , 1996, Addison-Wesley series in object-oriented software engineering.

[39]  Andrew P. Moore,et al.  Pattern-Based Design of Insider Threat Programs , 2014 .

[40]  Luigi Pontieri,et al.  Outlier Detection Techniques for Process Mining Applications , 2008, ISMIS.

[41]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[42]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[43]  Moe Thandar Wynn,et al.  Process Mining for Clinical Processes , 2015, ACM Trans. Manag. Inf. Syst..

[44]  van der Wmp Wil Aalst,et al.  Wanna improve process mining results? : it’s high time we consider data quality issues seriously , 2013 .

[45]  Veda C. Storey,et al.  A Framework for Analysis of Data Quality Research , 1995, IEEE Trans. Knowl. Data Eng..

[46]  Wil M. P. van der Aalst,et al.  Rediscovering workflow models from event-based data using little thumb , 2003, Integr. Comput. Aided Eng..

[47]  Ralph Johnson,et al.  design patterns elements of reusable object oriented software , 2019 .

[48]  Moe Thandar Wynn,et al.  Measuring Patient Flow Variations: A Cross-Organisational Process Mining Approach , 2014, AP-BPM.