Distributed feature extraction in a p2p setting - a case study

Finding the right data representation is essential for virtually every data mining application. In this work we describe an approach to collaborative feature extraction, selection and aggregation in distributed, loosely coupled domains. In contrast to other work in the field of distributed data mining, we focus on scenarios in which a large number of loosely coupled nodes apply data mining to different, usually very small and overlapping, subsets of the entire data space. The aim is not to find a global concept to cover all data, but to learn a set of local concepts. Our prototypical application is a distributed media organization platform, called Nemoz, that assists users in maintaining their media collections. We propose two models for collaborative feature extraction, selection and aggregation for supervised data mining. One is based on a centralized p2p architecture, and the other on a fully distributed p2p architecture. We compare both models on a real word data set and discuss their advantages and problems.

[1]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[2]  Anne-Marie Kermarrec,et al.  Lightweight probabilistic broadcast , 2003, TOCS.

[3]  Scott Shenker,et al.  Making gnutella-like P2P systems scalable , 2003, SIGCOMM '03.

[4]  Gerald Salton,et al.  Automatic text processing , 1988 .

[5]  T. Ben-David,et al.  Exploiting Task Relatedness for Multiple , 2003 .

[6]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[7]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[8]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[9]  Ingo Mierswa,et al.  Efficient Case Based Feature Construction for Heterogeneous Learning Tasks , 2006 .

[10]  Katharina Morik,et al.  Automatic Feature Extraction for Classifying Audio Data , 2005, Machine Learning.

[11]  Domenico Talia,et al.  A super-peer model for resource discovery services in large-scale Grids , 2005, Future Gener. Comput. Syst..

[12]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[13]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[14]  Pavel Brazdil,et al.  Proceedings of the European Conference on Machine Learning , 1993 .

[15]  Ran Wolff,et al.  Distributed Data Mining in Peer-to-Peer Networks , 2006, IEEE Internet Computing.

[16]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[17]  M WojtekKowalczyk,et al.  Towards Data Mining in Large and Fully Distributed Peer-to-Peer Overlay Networks , 2003 .

[18]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[19]  Katharina Morik,et al.  A Benchmark Dataset for Audio Classification and Clustering , 2005, ISMIR.

[20]  Ingo Mierswa,et al.  Efficient Case Based Feature Construction , 2005, ECML.

[21]  Thomas Hérault,et al.  Computing on large-scale distributed systems: XtremWeb architecture, programming models, security, tests and convergence with grid , 2005, Future Gener. Comput. Syst..

[22]  Ran Wolff,et al.  Association rule mining in peer-to-peer systems , 2003, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[23]  Bernd Schuller,et al.  Grid-enabled data warehousing for molecular engineering , 2004, Parallel Comput..

[24]  Sebastian Thrun,et al.  Discovering Structure in Multiple Learning Tasks: The TC Algorithm , 1996, ICML.

[25]  Salvatore J. Stolfo,et al.  Distributed data mining in credit card fraud detection , 1999, IEEE Intell. Syst..