Journeys to Data Mining: Experiences from 15 Renowned Researchers

Data mining, an interdisciplinary field combining methods from artificial intelligence, machine learning, statistics and database systems, has grown tremendously over the last 20 years and produced core results for applications like business intelligence, spatio-temporal data analysis, bioinformatics, and stream data processing. The fifteen contributors to this volume are successful and well-known data mining scientists and professionals. Although by no means an exhaustive list, all of them have helped the field to gain the reputation and importance it enjoys today, through the many valuable contributions they have made. Mohamed Medhat Gaber has asked them (and many others) to write down their journeys through the data mining field, trying to answer the following questions: 1. What are your motives for conducting research in the data mining field? 2. Describe the milestones of your research in this field. 3. What are your notable success stories? 4. How did you learn from your failures? 5. Have you encountered unexpected results? 6. What are the current research issues and challenges in your area? 7. Describe your research tools and techniques. 8. How would you advise a young researcher to make an impact? 9. What do you predict for the next two years in your area? 10. What are your expectations in the long term? In order to maintain the informal character of their contributions, they were given complete freedom as to how to organize their answers. This narrative presentation style provides PhD students and novices who are eager to find their way to successful research in data mining with valuable insights into career planning. In addition, everyone else interested in the history of computer science may be surprised about the stunning successes and possible failures computer science careers (still) have to offer.

[1]  Geoffrey J. McLachlan,et al.  A note on the choice of a weighting function to give an efficient method for estimating the probability of misclassification , 1977, Pattern Recognit..

[2]  G. J. McLachlan,et al.  9 The classification and mixture maximum likelihood approaches to cluster analysis , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[3]  Geoffrey J. McLachlan,et al.  Analyzing Microarray Gene Expression Data , 2004 .

[4]  Ran Wolff,et al.  Distributed Decision-Tree Induction in Peer-to-Peer Systems , 2008 .

[5]  Yehuda Koren,et al.  The BellKor solution to the Netflix Prize , 2007 .

[6]  Geoffrey J. McLachlan,et al.  Extension of the mixture of factor analyzers model to incorporate the multivariate t-distribution , 2007, Comput. Stat. Data Anal..

[7]  J. S. Marron,et al.  Geometric representation of high dimension, low sample size data , 2005 .

[8]  G. McLachlan Estimation of the Errors of Misclassification on the Criterion of Asymptotic Mean Square Error , 1974 .

[9]  Geoffrey J. McLachlan,et al.  Mixtures of Factor Analyzers , 2000, International Conference on Machine Learning.

[10]  Qi Wang,et al.  On the privacy preserving properties of random data perturbation techniques , 2003, Third IEEE International Conference on Data Mining.

[11]  Geoffrey J. McLachlan,et al.  On a general method for matrix factorisation applied to supervised classification , 2009, 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshop.

[12]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[13]  Trevor Hastie,et al.  Regularized linear discriminant analysis and its application in microarrays. , 2007, Biostatistics.

[14]  Gene H. Golub,et al.  Matrix computations , 1983 .

[15]  Kui Wang,et al.  A Mixture model with random-effects components for clustering correlated gene-expression profiles , 2006, Bioinform..

[16]  G. McLachlan Iterative Reclassification Procedure for Constructing An Asymptotically Optimal Rule of Allocation in Discriminant-Analysis , 1975 .

[17]  G. McLachlan Estimating the Linear Discriminant Function from Initial Samples Containing a Small Number of Unclassified Observations , 1977 .

[18]  G. McLachlan,et al.  The efficiency of a linear discriminant function based on unclassified initial samples , 1978 .

[19]  The errors of allocation and their estimators in the two-population discrimination problem , 1973, Bulletin of the Australian Mathematical Society.

[20]  Geoffrey J. McLachlan,et al.  Modelling high-dimensional data by mixtures of factor analyzers , 2003, Comput. Stat. Data Anal..

[21]  Geoffrey J. McLachlan,et al.  A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays , 2006, Bioinform..

[22]  M WojtekKowalczyk,et al.  Towards Data Mining in Large and Fully Distributed Peer-to-Peer Overlay Networks , 2003 .

[23]  G. J. McLachlan,et al.  Correcting for selection bias via cross-validation in the classification of microarray data , 2008, 0805.2501.

[24]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[25]  G. McLachlan The relationship in terms of asymptotic mean square error between the separate problems of estimating each of the three types of error rate of the linear discriminant function , 1974 .

[26]  F. Marriott The interpretation of multiple observations , 1974 .

[27]  Geoffrey E. Hinton,et al.  Modeling the manifolds of images of handwritten digits , 1997, IEEE Trans. Neural Networks.

[28]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[29]  Mark J. van der Laan,et al.  Statistics Ready for a Revolution: Next Generation of Statisticians Must Build Tools for Massive Data Sets , 2010 .

[30]  Theofanis Sapatinas,et al.  Discriminant Analysis and Statistical Pattern Recognition , 2005 .

[31]  Tom M Mitchell,et al.  Mining Our Reality , 2009, Science.

[32]  Vladimir Nikulin,et al.  Penalized Principal Component Analysis of Microarray Data , 2009, CIBB.

[33]  Hillol Kargupta,et al.  Distributed probabilistic inferencing in sensor networks using variational approximation , 2008, J. Parallel Distributed Comput..

[34]  Ilker Hamzaoglu,et al.  Scalable, Distributed Data Mining - An Agent Architecture , 1997, KDD.

[35]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[36]  I. Johnstone,et al.  On Consistency and Sparsity for Principal Components Analysis in High Dimensions , 2009, Journal of the American Statistical Association.

[37]  G. McLachlan The bias of the apparent error rate in discriminant analysis , 1976 .

[38]  Kun Liu,et al.  Distributed Identification of Top-l Inner Product Elements and its Application in a Peer-to-Peer Network , 2008, IEEE Transactions on Knowledge and Data Engineering.

[39]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[40]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[41]  D. Hand,et al.  Idiot's Bayes—Not So Stupid After All? , 2001 .

[42]  R. Tibshirani,et al.  Empirical bayes methods and false discovery rates for microarrays , 2002, Genetic epidemiology.

[43]  Kirk D. Borne,et al.  PADMINI: A Peer-to-Peer Distributed Astronomy Data Mining System and a Case Study , 2010, CIDU.

[44]  G. McLachlan Asymptotic Results for Discriminant Analysis When the Initial Samples are Misclassified , 1972 .

[45]  Ujjwal Maulik,et al.  Clustering distributed data streams in peer-to-peer environments , 2006, Inf. Sci..

[46]  Kun Liu,et al.  A Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods , 2008, Privacy-Preserving Data Mining.

[47]  Kun Liu,et al.  Random projection-based multiplicative data perturbation for privacy preserving distributed data mining , 2006, IEEE Transactions on Knowledge and Data Engineering.

[48]  Geoffrey J. McLachlan,et al.  A Very Fast Algorithm for Matrix Factorization , 2010, ArXiv.

[49]  M. Kenward,et al.  Contribution to the discussion of the paper by Diggle, Tawn and Moyeed , 1998 .

[50]  Hillol Kargupta,et al.  A Scalable Local Algorithm for Distributed Multivariate Regression , 2008, Stat. Anal. Data Min..

[51]  G T Toussaint,et al.  An efficient method for estimating the probability of misclassification applied to a problem in medical diagnosis. , 1975, Computers in biology and medicine.

[52]  J. Friedman Regularized Discriminant Analysis , 1989 .

[53]  Hillol Kargupta,et al.  TR-CS _ 01 _ 07 A Game Theoretic Approach toward Multi-Party Privacy-Preserving Distributed Data Mining , 2007 .

[54]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[55]  Philip K. Chan,et al.  Advances in Distributed and Parallel Knowledge Discovery , 2000 .

[56]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[57]  M. R. Mickey,et al.  Estimation of Error Rates in Discriminant Analysis , 1968 .

[58]  Hillol Kargupta,et al.  Toward ubiquitous mining of distributed data , 2001, SPIE Defense + Commercial Sensing.

[59]  Christophe Ambroise,et al.  Selection bias in working with the top genes in supervised classification of tissue samples , 2006 .

[60]  Terence J. O'Neill Normal Discrimination with Unclassified Observations , 1978 .

[61]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[62]  P. Hall,et al.  Tilting methods for assessing the influence of components in a classifier , 2009 .

[63]  Trevor Hastie,et al.  Neural Networks and Related Methods for Classification - Discussion , 1994 .

[64]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[65]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[66]  Jianqing Fan,et al.  High Dimensional Classification Using Features Annealed Independence Rules. , 2007, Annals of statistics.

[67]  Michael F Ochs,et al.  Matrix factorization for recovery of biological processes from microarray data. , 2009, Methods in enzymology.

[68]  D. N. Geary Mixture Models: Inference and Applications to Clustering , 1989 .

[69]  G. McLachlan An Asymptotic Unbiased Technique for Estimating the Error Rates in Discriminant Analysis , 1974 .

[70]  Shili Lin,et al.  Feature selection in finite mixture of sparse normal linear models in high-dimensional feature space. , 2011, Biostatistics.

[71]  A. Raftery,et al.  Variable Selection for Model-Based Clustering , 2006 .

[72]  Yelena Yesha,et al.  Data Mining: Next Generation Challenges and Future Directions , 2004 .

[73]  Geoffrey J. McLachlan,et al.  Robust Cluster Analysis via Mixtures of Multivariate t-Distributions , 1998, SSPR/SPR.

[74]  M. Hills Allocation Rules and Their Error Rates , 1966 .

[75]  Peter Adams,et al.  The EMMIX software for the fitting of mixtures of normal and t-components , 1999 .

[76]  Daniel Q. Naiman,et al.  Classifying Gene Expression Profiles from Pairwise mRNA Comparisons , 2004, Statistical applications in genetics and molecular biology.

[77]  Jianqing Fan,et al.  A Selective Overview of Variable Selection in High Dimensional Feature Space. , 2009, Statistica Sinica.

[78]  Geoffrey J. McLachlan,et al.  Using the EM algorithm to train neural networks: misconceptions and a new algorithm for multiclass classification , 2004, IEEE Transactions on Neural Networks.

[79]  Geoffrey J. McLachlan,et al.  Robust mixture modelling using the t distribution , 2000, Stat. Comput..

[80]  Jill P. Mesirov,et al.  Automated High-Dimensional Flow Cytometric Data Analysis , 2010, RECOMB.

[81]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[82]  Haimonti Dutta,et al.  TagLearner: A P2P Classifier Learning System from Collaboratively Tagged Text Documents , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[83]  Hillol Kargupta,et al.  Approximate Distributed K-Means Clustering over a Peer-to-Peer Network , 2009, IEEE Transactions on Knowledge and Data Engineering.