Bellwether analysis: Searching for cost-effective query-defined predictors in large databases

How to mine massive datasets is a challenging problem with great potential value. Motivated by this challenge, much effort has concentrated on developing scalable versions of machine learning algorithms. However, the cost of mining large datasets is not just computational; preparing the datasets into the “right form” so that learning algorithms can be applied is usually costly, due to the human labor that is typically required and a large number of choices in data preparation, which include selecting different subsets of data and aggregating data at different granularities. We make the key observation that, for a number of practically motivated problems, these choices can be defined using database queries and analyzed in an automatic and systematic manner. Specifically, we propose a new class of data-mining problem, called bellwether analysis, in which the goal is to find a few query-defined predictors (e.g., first week sales of Peoria, IL of an item) that can be used to accurately predict the result of a target query (e.g., first year worldwide sales of the item) from a large number of queries that define candidate predictors. To make a prediction for a new item, the data needed to generate such predictors has to be collected (e.g., selling the new item in Peoria, IL for a week and collecting the sales data). A useful predictor is one that has high prediction accuracy and a low data-collection cost. We call such a cost-effective predictor a bellwether. This article introduces bellwether analysis, which integrates database query processing and predictive modeling into a single framework, and provides scalable algorithms for large datasets that cannot fit in main memory. Through a series of extensive experiments, we show that bellwethers do exist in real-world databases, and that our computation techniques achieve good efficiency on large datasets.

[1]  D. Lizotte,et al.  Budgeted Learning of Näıve-Bayes Classifiers , 2003 .

[2]  Ian Witten,et al.  Data Mining , 2000 .

[3]  Foster J. Provost,et al.  An expected utility approach to active feature-value acquisition , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[4]  Jeffrey F. Naughton,et al.  On the Computation of Multidimensional Aggregates , 1996, VLDB.

[5]  Bradley Efron,et al.  Local False Discovery Rates , 2005 .

[6]  Jiawei Han,et al.  Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration , 2003, Very Large Data Bases Conference.

[7]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[8]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[9]  Russell Greiner,et al.  Budgeted learning of nailve-bayes classifiers , 2002, UAI 2002.

[10]  Alan J. Lee,et al.  Linear Regression Analysis: Seber/Linear , 2003 .

[11]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[12]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[13]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[14]  Raghu Ramakrishnan,et al.  Bellwether analysis: predicting global aggregates from local regions , 2006, VLDB.

[15]  Theodore Johnson,et al.  The MD-join: an operator for complex OLAP , 2001, Proceedings 17th International Conference on Data Engineering.

[16]  Raghu Ramakrishnan,et al.  Exploratory mining in cube space , 2006, Data Mining and Knowledge Discovery.

[17]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[18]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.

[19]  Raghu Ramakrishnan,et al.  Cube-space data mining , 2008 .

[20]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[21]  Victor S. Sheng,et al.  Partial example acquisition in cost-sensitive learning , 2007, KDD '07.

[22]  Mark Craven,et al.  Supervised versus multiple instance learning: an empirical comparison , 2005, ICML.

[23]  Jiawei Han,et al.  High-Dimensional OLAP: A Minimal Cubing Approach , 2004, VLDB.

[24]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[25]  Foster J. Provost,et al.  Active feature-value acquisition for classifier induction , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[26]  Xindong Wu,et al.  Cost-constrained data acquisition for intelligent data preparation , 2005, IEEE Transactions on Knowledge and Data Engineering.

[27]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[28]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[29]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[30]  Peter D. Turney Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm , 1994, J. Artif. Intell. Res..

[31]  George A. F. Seber,et al.  Linear regression analysis , 1977 .

[32]  Jorma Rissanen,et al.  MDL-Based Decision Tree Pruning , 1995, KDD.

[33]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[34]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[35]  Raymond T. Ng,et al.  Iceberg-cube computation with PC clusters , 2001, SIGMOD '01.

[36]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[37]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[38]  Vinod Yegneswaran,et al.  Composite subset measures , 2006, VLDB.

[39]  G. Kane Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol 1: Foundations, vol 2: Psychological and Biological Models , 1994 .

[40]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[41]  Geoffrey I. Webb Discovering significant rules , 2006, KDD '06.

[42]  Jiawei Han,et al.  ACM Transactions on Knowledge Discovery from Data: Introduction , 2007 .

[43]  Edoardo Amaldi,et al.  On the Approximability of Minimizing Nonzero Variables or Unsatisfied Relations in Linear Systems , 1998, Theor. Comput. Sci..

[44]  Gregory F. Cooper,et al.  A Bayesian method for the induction of probabilistic networks from data , 1992, Machine Learning.

[45]  Raghu Ramakrishnan,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999, SIGMOD '99.

[46]  Victor S. Sheng,et al.  Feature value acquisition in testing: a sequential batch test algorithm , 2006, ICML.

[47]  Foster J. Provost,et al.  Aggregation-based feature invention and relational concept classes , 2003, KDD '03.

[48]  Kenneth A. Ross,et al.  Complex Aggregation at Multiple Granularities , 1998, EDBT.

[49]  Qiang Yang,et al.  Test-cost sensitive naive Bayes classification , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[50]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[51]  Dan Roth,et al.  Learning cost-sensitive active classifiers , 2002, Artif. Intell..

[52]  Russell Greiner,et al.  Learning and Classifying Under Hard Budgets , 2005, ECML.

[53]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[54]  JOHANNES GEHRKE,et al.  RainForest—A Framework for Fast Decision Tree Construction of Large Datasets , 1998, Data Mining and Knowledge Discovery.

[55]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[56]  James D. Hamilton Time Series Analysis , 1994 .

[57]  Jian Pei,et al.  Efficient computation of Iceberg cubes with complex measures , 2001, SIGMOD '01.

[58]  Qiang Yang,et al.  Decision trees with minimal costs , 2004, ICML.

[59]  Jiawei Han,et al.  MM-Cubing: computing Iceberg cubes by factorizing the lattice space , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[60]  Prem Melville,et al.  Active Feature Acquisition for Classifier Induction , 2010 .

[61]  Jonathan D. Cryer,et al.  Time Series Analysis , 1986 .

[62]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.