Feature Selection Over Distributed Data Streams

Monitoring data streams in a distributed system has attracted considerable interest in recent years. The task of feature selection (e.g., by monitoring the information gain of various features) requires a very high communication overhead when addressed using straightforward centralized algorithms. While most of the existing algorithms deal with monitoring simple aggregated values such as frequency of occurrence of stream items, motivated by recent contributions based on geometric ideas we present an alternative approach. The proposed approach enables monitoring values of an arbitrary threshold function over distributed data streams through a set of constraints applied separately on each stream. We report numerical experiments on a real–world data that detect instances where communication between nodes is required, and compare the approach and the results to those recently reported in the literature.

[1]  Beatrice Gralton,et al.  Washington DC - USA , 2008 .

[2]  Samuel Madden,et al.  Fjording the stream: an architecture for queries over streaming sensor data , 2002, Proceedings 18th International Conference on Data Engineering.

[3]  R. Rockafellar Convex Analysis: (pms-28) , 1970 .

[4]  Assaf Schuster,et al.  A Geometric Approach to Monitoring Threshold Functions over Distributed Data Streams , 2010, Ubiquitous Knowledge Discovery.

[5]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[6]  Danny Raz,et al.  Efficient reactive monitoring , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[7]  Christopher Olston,et al.  Finding (recently) frequent items in distributed data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[8]  Christos Faloutsos,et al.  Online data mining for co-evolving time sequences , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[9]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[10]  João Gama,et al.  Ubiquitous Knowledge Discovery , 2011, IDA 2011.