TKDD Special Issue SIGKDD 2009

This Special Issue includes four articles selected from the papers accepted in the ACM SIGKDD 2009 conference. The annual ACM SIGKDD conference is a leading international forum for data mining researchers and practitioners from academia, industry, and government to share their research results, explore new ideas, and exchange experiences. This selection of articles covers significant research advances both on fundamental data mining problems such as classification, statistical learning, anomaly detection, and privacy preserving analysis as well as in important emerging application areas such as healthcare record management, market targeting, social networks, and Web analysis. The first article by Ye Chen, Dmitry Pavlov, and John F. Canny is entitled “Behavioral Targeting: The Art of Scaling Up Simple Algorithms.” This article introduces a MapReduce statistical learning algorithm for behaviortargeting analysis that leverages historical data to select the most relevant ads for users. The proposed method achieves optimal parallelism, linear time complexity with highly efficient in-memory caching scheme, and data structure for sparse representation. The second article, entitled “Centralized and Distributed Anonymization for High-dimensional Healthcare Data,” is authored by Noman Mohammed, Benjamin C. M. Fung, Patrick C. K. Hung, and Cheuk-Kwong Lee. This article addresses the privacy concerns which arise from sharing patient’s health care records. A scalable privacy model and anonymization algorithm is proposed which is able to preserve essential information for analysis. Chao Liu, Fan Guo, and Christos Faloutsos tackle the challenge of estimating the relevance of documents from petabyte-scale log in their article entitled “Bayesian Browsing Model: Exact Inference of Document Relevance from Petabyte-Scale Data.” A Bayesian browsing model is proposed to do exact inference with only one pass of the data and is fully parallelizable. In the fourth article, entitled “A Model-Agnostic Framework for Fast Spatial Anomaly Detection,” Mingxi Wu, Chris Jermaine, Sanjay Ranka, Xiuyao Song, and John Gums propose a generic framework for conducting likelihood ratio tests over rectangular regions that exhibit anomalous behaviors in a large sparse matrix and ranking these regions by the LRT statistics. Efficient pruning techniques are developed to deliver significant speedup.