Setting up and managing real-world data mining and optimization cases studies: Motivations and challenges

The rapid evolution of tools and software systems to design experiments, automatically monitor, collect and warehouse large amounts of data, from applications such as life sciences and industrial processes has resulted in a new paradigm shift. This change of paradigm is so fast that some of the practices for optimization and management of these processes that were valid only 5–10 years ago may no longer be fully acceptable or sufficient for today's business optimization and management. This has a direct influence on the best practices for knowledge discovery and management of the discovered knowledge in real-world data mining applications. Establishing and managing a real-world data mining project in any domain, in particular in today's life science industry, is not a trivial task. A few approaches have been proposed in the literature. However, initiation and successful management of such efforts may depend on where a given case study fits in the overall classification of data mining approaches. Today's knowledge discovery from data can be classified in several ways: (i) data mining on engineered systems (e.g. complex equipment) or systems designed by nature (e.g. life sciences), (ii) explanatory or predictive data mining, (iii) data mining from static data (e.g. data warehouse) or dynamic data (e.g. data streams), (iv) user operated or automated data mining. There could still be other ways to classify data mining applications. This talk provides an overview of the above listed knowledge discovery applications. We provide examples where we demonstrate how small or large amounts of data, when understood from a real-world data mining point of view and the required data is properly integrated, can result in novel knowledge discovery case studies. We explain motivations and challenges of establishing real-world data mining case studies and also demonstrate how our case studies can lead to real world applications and even tools that could be deployed for better management of today's data rich environments.