Ubiquitous Knowledge Discovery is a new research area at the intersection of machine learning and data mining with mobile and distributed systems. In this paper the main characteristics of the objects of study are defined and a high-level framework for analyzing ubiquitous knowledge discovery systems is introduced. Next, a number of examples from a broad range of application areas are reviewed and analyzed in terms of this framework. Based on this material, important characteristics of this field are identified and a number of research challenges are discussed. Ubiquitous Knowledge Discovery Knowledge Discovery in ubiquitous environments (KDUbiq) is an emerging area of research at the intersection of the two major challenges of highly distributed and mobile systems and advanced knowledge discovery systems. Today, in many subfields of computer science and engineering, being intelligent and adaptive marks the difference between a system that works in a complex and changing environment and a system that does not work. Hence, projects across many areas, ranging from Web 2.0 to ubiquitous computing and robotics, aim to create systems which are “smart”, “intelligent”, “adaptive” etc., allowing to solve problems that could not be solved before. A central assumption of KDUbiq is that what seems to be a bewildering array of different methodologies and approaches for building smart, adaptive, intelligent systems, can be cast into a coherent, integrated set of key ideas centered on the notion of learning from experience. Focusing on these key ideas, KDUbiq provides a unifying framework for systematically investigating the mutual dependencies of otherwise quite unrelated technologies employed in building next-generation intelligent systems: machine learning, data mining, sensor networks, grids, P2P, data stream mining, activity recognition, Web 2.0, privacy, user modeling and others. Machine learning and data mining emerge as basic methodologies and indispensable building blocks for some of the most difficult computer science and engineering challenges of the next decade. From a high-level perspective, key characteristics of an ubiquitous knowledge discovery application are: C1. Time and space. The objects of analysis exist in time and space. Often they are able to move. C2. Dynamic environment. These objects might not be stable over the life-time of an application. Instead they might appear or disappear. They exist in a dynamic and unstable environment, evolving incrementally over time. C3. Information processing capability. The objects are endowed with information processing capabilities. C4. Locality. The objects never see the global picture, knowing only their local spatio-temporal environment. C5. Real-Time. Because they typically have to take decisions or even act upon their environment, analysis and inference has to be done in real-time, and not only on historic data; the models have to evolve incrementally in correspondence with the evolving environment. C6. Distributed. In many cases the object will be able to exchange information with other objects, thus forming a truly distributed environment. Objects to which these characteristics apply are humans, animals, and, increasingly, various kinds of computing devices. It is the latter, that form the objects of study for KDUbiq. For analyzing the different possible architectures of ubiquitous knowledge discovery systems within a highlevel framework, we introduce six dimensions of KDUbiq: 1. Application Area. 2. Ubiquitous Technologies. 3. Resource Aware Algorithms. 4. Ubiquitous Data Collection. 5. Privacy and Security. 6. HCI and User-Modeling. When designing a ubiquitous knowledge discovery system, major design decisions in each of these six dimensions have to be taken. These choices are mutually constraining each other and dependencies among them have to be carefully analyzed. KDUbiq thus adopts a systems view on how to build next generation knowledge discovery systems. Two important aspects to be ubiquity have to be distinguished, namely • the ubiquity of data, and • ubiquity of computing. In a prototypical application the ubiquity of the environment corresponds naturally to the ubiquity of the data – e.g. the spatio-temporally tagged data in case study 3 arise because the vehicles are moving, in case study 5 they arise because the collections are owned by different people. But there are borderline cases that are ubiquitous in one way but not in the other, e.g. clusters or grids for speeding up data analysis by distributing files and computations to various computers, or track mining from GPS data where the data a analyzed on a central server in an offline batch setting. To stimulate research and further define the field, the KDUbiq research network (www.kdubiq.org), funded by the European Commission, was launched in 2006. Currently it has more than 40 members. It is organized around working groups for each of the dimensions of KDUbiq. It has launched workshops series at KDD, ICDM, and PKDD and ECML/PKDD, including mining data streams from sensor data, on privacy-preserving data mining, on spatial data mining, on user modeling, on ubiquitous web mining. The general points that emerge from these activities are discussed in a joint book, currently under preparation, the KDUbiq “Blueprint on Ubiquitous Knowledge Discovery”. It aims for a comprehensive overview on the six design dimensions and the research agenda needed for implementing the KDUbiq vision. To provide a more specific description of the content of KDUbiq, in this document we analyze a number of case studies (in this extended abstract the descriptions have to be shortened). The following selection criteria have been used: (1) each case study focuses on a different domain; (2) it presents a challenging real-life problem; (3) there is a body of prior technical work addressing at least some of the six dimensions of ubiquitous knowledge discovery. Existing work is not necessarily done under the label of “Ubiquitous Knowledge Discovery”. The subject is new and draws on work scattered around many communities. For a review of earlier work on distributed data mining see [7]. Case Study 1: Autonomous driving vehicle The first case study provides an impressive example how machine learning can help to solve an important real world task: The DARPA grand challenge. The goal was to develop an autonomous robot capable of traversing unrehearsed road-terrain. The robot had to navigate a 228 km long course through the Mojave desert in no more than 10 hours. The challenge was won in 2005 by the robot Stanley, built by a Stanford-based team lead by Sebastian Thrun [17]. Modern vehicles fit the basic characteristics of ubiquitous knowledge discovery systems very well: they exist in a dynamic environment, moving in time and space, equipped with sensors, increasingly communicating with other devices, e.g. satellites, for navigation. What sets Stanley apart from traditional cars on the hardware side is the large number of additional sensors, computational power and actuators. Stanley uses machine learning for a number of learning tasks, both offline and online. An offline classification task solved with machine learning is obstacle detection, where a first order Markov model is used. A second online task is road finding: classifying images into drivable and non-drivable areas. An adaptive Mixture of Gaussians algorithm is used to model a distribution that changes over time. It would be impossible to train the system offline for all possible situations. Case Study 2: Activity recognition – inferring transportation routines from GPS-data The widespread use of GPS devices led to an explosive interest in this type of data. One emerging area is assistive technologies: A personal guidance system helping cognitively impaired persons to find their way through a complex transportation system. This application has been proposed by the project Opportunity Knocks [14][11]. The basic infrastructure is a mobile device equipped with GPS and connected to a server. An inference module running on the server is able learn a person’s transportation routines from the GPS data collected. It is able to give advice to persons, which route to take or where to get off a bus, and it can warn the user if he commits errors, e.g. takes the wrong bus line. Machine learning algorithms are used to infer likely routes, activities, transportation destinations and deviations from a normal route. It is an unsupervised learning task. The basic knowledge representation mechanism is a Dynamic Bayesian Network. In further work, Conditional Random Fields are used. Case Study 3: Intelligent Multi-Agent Systems Smart Home MavHome [3] is a project that aims to build an intelligent environment, a smart home, which is able to acquire and apply knowledge about its inhabitants and surroundings. A home is seen as a rational agent capable of perceiving the state of the home through sensors an acting upon the environment through effectors. MavHome uses a sensor network for perceiving light, humidity, smoke, gas (CO), motion, and door, window seat status sensors. Inhabitant localization is done using passive infrared sensors. The software architecture is based on CORBA for communication between agents. The system is based on combining multiple heterogeneous machine learning algorithms in order to identify repeatable behaviors (patterns), to predict inhabitant activity and to learn a control strategy. The information is used for automation and optimization of the conditions in the house. For detecting patterns a sequential pattern mining algorithm ED is used which minimizes description length, and processes data as they arrive, thus assuming a data stream setting. Behavior prediction is done via the ALZ algorithm, taking ideas from the well-known LZ78 text compression algorithm. The predictive performance on real-world data collected over a month, was 44% when ALZ was combined wit
[1]
Andreas Krause,et al.
Nonmyopic active learning of Gaussian processes: an exploration-exploitation approach
,
2007,
ICML '07.
[2]
Philip M. Long,et al.
Tracking Drifting Concepts By Minimizing Disagreements
,
2004,
Machine Learning.
[3]
Dino Pedreschi,et al.
Spatio-temporal Data Mining
,
2008,
Encyclopedia of GIS.
[4]
Krzysztof Z. Gajos,et al.
Opportunity Knocks: A System to Provide Cognitive Assistance with Transportation Services
,
2004,
UbiComp.
[5]
Yücel Saygin,et al.
Privacy in Spatiotemporal Data Mining
,
2008,
Mobility, Data Mining and Privacy.
[6]
Katharina Morik,et al.
Localized Alternative Cluster Ensembles for Collaborative Structuring
,
2006,
ECML.
[7]
Yelena Yesha,et al.
Data Mining: Next Generation Challenges and Future Directions
,
2004
.
[8]
Henry A. Kautz,et al.
Learning and inferring transportation routines
,
2004,
Artif. Intell..
[9]
Ran Wolff,et al.
Privacy-preserving association rule mining in large-scale distributed systems
,
2004,
IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..
[10]
David Gamarnik,et al.
Extension of the PAC framework to finite and countable Markov chains
,
1999,
COLT '99.
[11]
Diane J. Cook,et al.
A Multi-agent Approach to Controlling a Smart Environment
,
2006,
Designing Smart Homes.
[12]
Noel A Cressie,et al.
Statistics for Spatial Data.
,
1992
.
[13]
Katharina Morik,et al.
Distributed feature extraction in a p2p setting - a case study
,
2007,
Future Gener. Comput. Syst..
[14]
Vladimir Vapnik,et al.
Statistical learning theory
,
1998
.
[15]
Geoff Hulten,et al.
Mining time-changing data streams
,
2001,
KDD '01.
[16]
Thorsten Joachims,et al.
Detecting Concept Drift with Support Vector Machines
,
2000,
ICML.
[17]
Katharina Morik,et al.
Aspect-Based Tagging for Collaborative Media Organization
,
2006,
WebMine.
[18]
Philip S. Yu,et al.
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions
,
2007,
SDM.
[19]
Franco Turini,et al.
Knowledge Discovery from Geographical Data
,
2008,
Mobility, Data Mining and Privacy.
[20]
Assaf Schuster,et al.
A geometric approach to monitoring threshold functions over distributed data streams
,
2007,
ACM Trans. Database Syst..
[21]
Sebastian Thrun,et al.
A Personal Account of the Development of Stanley, the Robot That Won the DARPA Grand Challenge
,
2006,
AI Mag..
[22]
Katharina Morik,et al.
Automatic Feature Extraction for Classifying Audio Data
,
2005,
Machine Learning.
[23]
Henry A. Kautz,et al.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields
,
2007,
Int. J. Robotics Res..
[24]
B. John Oommen,et al.
Stochastic learning-based weak estimation of multinomial random variables and its applications to pattern recognition in non-stationary environments
,
2006,
Pattern Recognit..
[25]
Dino Pedreschi,et al.
Spatiotemporal Data Mining
,
2008,
Mobility, Data Mining and Privacy.
[26]
João Gama,et al.
Change Detection with Kalman Filter and CUSUM
,
2006,
Discovery Science.
[27]
Jennifer Neville,et al.
Relational Dependency Networks
,
2007,
J. Mach. Learn. Res..
[28]
Kun Liu,et al.
VEDAS: A Mobile and Distributed Data Stream Mining System for Real-Time Vehicle Monitoring
,
2004,
SDM.
[29]
Chengqi Zhang,et al.
Association Rule Mining
,
2002,
Lecture Notes in Computer Science.
[30]
Richard A. Davis,et al.
Structural Break Estimation for Nonstationary Time Series Models
,
2006
.
[31]
Leonidas J. Guibas,et al.
Wireless sensor networks - an information processing approach
,
2004,
The Morgan Kaufmann series in networking.
[32]
Gerhard Widmer,et al.
Learning in the Presence of Concept Drift and Hidden Contexts
,
1996,
Machine Learning.
[33]
Richard A. Davis,et al.
Introduction to time series and forecasting
,
1998
.
[34]
Ran Wolff,et al.
Association rule mining in peer-to-peer systems
,
2003,
IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).
[35]
Michèle Basseville,et al.
Detection of abrupt changes: theory and application
,
1993
.