Data Mining: The Next Generation

Data Mining (DM) has enjoyed great popularity in recent years, with advances in both research and commercialization. The first generation of DM research and development has yielded several commercially available systems, both stand-alone and integrated with database systems; produced scalable versions of algorithms for many classical DM problems; and introduced novel pattern discovery problems. In recent years, research has tended to be fragmented into several distinct pockets without a comprehensive framework. Researchers have continued to work largely within the parameters of their parent disciplines, building upon existing and distinct research methodologies. Even when they address a common problem (for example, how to cluster a dataset) they apply different techniques, different perspectives on what the important issues are, and different evaluation criteria. While different approaches can be complementary, and such a diversity is ultimately a strength of the field, better communication across disciplines is required if DM is to forge a distinct identity with a core set of principles, perspectives, and challenges that differentiate it from each of the parent disciplines. Further, while the amount and complexity of data continues to grow rapidly, and the task of distilling useful insight continues to be central, serious concerns have emerged about social implications of DM. Addressing these concerns will require advances in our theoretical understanding of the principles that underlie DM algorithms, as well as an integrated approach to security and privacy in all phases of data management and analysis. Researchers from a variety of backgrounds assembled at Dagstuhl to re-assess the current directions of the field, to identify critical problems that require attention, and to discuss ways to increase the flow of ideas across the different disciplines that DM has brought together. The workshop did not seek to draw up an agenda for the field of DM. Rather, it offers the participantsA¢â‚¬â„¢ perspective on two technical directionsA¢â‚¬â€compositionality and privacyA¢â‚¬â€and describes some important application challenges that drove the discussion. Both of these directions illustrate the opportunities for crossdisciplinary research, and there was broad agreement that they represent important and timely areas for further work; of course, the choice of these directions as topics for discussion also reflects the personal interests and biases of the workshop participants.

[1]  Ivan P. Fellegi,et al.  On the Question of Statistical Confidentiality , 1972 .

[2]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[3]  Ulf Leser,et al.  Systematic feature evaluation for gene name recognition , 2005, BMC Bioinformatics.

[4]  Jeremy Buhler,et al.  Designing seeds for similarity search in genomic DNA , 2005, J. Comput. Syst. Sci..

[5]  Alexandre V. Evfimievski,et al.  Limiting privacy breaches in privacy preserving data mining , 2003, PODS.

[6]  Ron Shamir,et al.  Scoring clustering solutions by their biological relevance , 2003, Bioinform..

[7]  Joel D. Martin,et al.  PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine , 2003, BMC Bioinformatics.

[8]  Shinichi Morishita,et al.  Constrained clusters of gene expression profiles with pathological features , 2004, Bioinform..

[9]  S. Fienberg,et al.  Bounding Entries in Multi-way Contingency Tables Given a Set of Marginal Totals , 2003 .

[10]  L. Cox Suppression Methodology and Statistical Disclosure Control , 1980 .

[11]  Alexandre V. Evfimievski,et al.  Privacy preserving mining of association rules , 2002, Inf. Syst..

[12]  Dan Suciu,et al.  A formal analysis of information disclosure in data exchange , 2004, SIGMOD '04.

[13]  Rakesh Agrawal,et al.  Privacy-preserving data mining , 2000, SIGMOD 2000.

[14]  I. Jonassen,et al.  Predicting gene regulatory elements in silico on a genomic scale. , 1998, Genome research.

[15]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[16]  Nabil R. Adam,et al.  Security-control methods for statistical databases: a comparative study , 1989, ACM Comput. Surv..

[17]  Stephen E. Fienberg,et al.  Bounds for Cell Entries in Two-Way Tables Given Conditional Relative Frequencies , 2004, Privacy in Statistical Databases.

[18]  Claude E. Shannon,et al.  Communication theory of secrecy systems , 1949, Bell Syst. Tech. J..

[19]  C. Papadimitriou,et al.  On the value of private information , 2001 .

[20]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001 .

[21]  Luc De Raedt,et al.  Data Mining and Machine Learning Techniques for the Identification of Mutagenicity Inducing Substructures and Structure Activity Relationships of Noncongeneric Compounds , 2004, J. Chem. Inf. Model..

[22]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .

[23]  Toshihisa Takagi,et al.  Kinase pathway database: an integrated protein-kinase and NLP-based protein-interaction resource. , 2003, Genome research.

[24]  Adrian Dobra,et al.  Assessing the Risk of Disclosure of Confidential Categorical Data , 2002 .

[25]  Charu C. Aggarwal,et al.  On the design and quantification of privacy preserving data mining algorithms , 2001, PODS.

[26]  George T. Duncan,et al.  Enhancing Access to Microdata while Protecting Confidentiality: Prospects for the Future , 1991 .