When is the Right Time to Refresh Knowledge Discovered From Data?

Knowledge discovery in databases KDD techniques have been extensively employed to extract knowledge from massive data stores to support decision making in a wide range of critical applications. Maintaining the currency of discovered knowledge over evolving data sources is a fundamental challenge faced by all KDD applications. This paper addresses the challenge from the perspective of deciding the right times to refresh knowledge. We define the knowledge-refreshing problem and model it as a Markov decision process. Based on the identified properties of the Markov decision process model, we establish that the optimal knowledge-refreshing policy is monotonically increasing in the system state within every appropriate partition of the state space. We further show that the problem of searching for the optimal knowledge-refreshing policy can be reduced to the problem of finding the optimal thresholds and propose a method for computing the optimal knowledge-refreshing policy. The effectiveness and the robustness of the computed optimal knowledge-refreshing policy are examined through extensive empirical studies addressing a real-world knowledge-refreshing problem. Our method can be applied to refresh knowledge for KDD applications that employ major data-mining models.

[1]  Jiawei Han,et al.  Maintenance of discovered association rules in large databases: an incremental updating technique , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[2]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[3]  Hsinchun Chen,et al.  AI for Global Disease Surveillance , 2009, IEEE Intelligent Systems.

[4]  Zhongju Zhang,et al.  Optimal Synchronization Policies for Data Warehouses , 2006, INFORMS J. Comput..

[5]  Richard F. Serfozo,et al.  Monotone optimal policies for Markov decision processes , 1976 .

[6]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[7]  Amit Basu,et al.  Data mining and revenue management methodologies in college admissions , 2010, CACM.

[8]  Rajeev Rastogi Guest Editor Introduction: Special Section on Online Analysis and Querying of Continuous Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[9]  Sanjay Ranka,et al.  An Efficient Algorithm for the Incremental Updation of Association Rules in Large Databases , 1997, KDD.

[10]  Paul E. Utgoff,et al.  ID5: An Incremental ID3 , 1987, ML Workshop.

[11]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[12]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[13]  Charu C. Aggarwal,et al.  Data Streams: Models and Algorithms (Advances in Database Systems) , 2006 .

[14]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[15]  Alok Gupta,et al.  GIST: A Model for Design and Management of Content and Interactivity of Customer-Centric Web Sites , 2004, MIS Q..

[16]  Kathleen M. Eisenhardt,et al.  Making Fast Strategic Decisions In High-Velocity Environments , 1989 .

[17]  Andreas Holzman,et al.  Statistical Tools for Nonlinear Regression , 2004 .

[18]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[19]  Jon M. Kleinberg,et al.  A Microeconomic View of Data Mining , 1998, Data Mining and Knowledge Discovery.

[20]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[21]  Johannes Gehrke,et al.  A framework for measuring changes in data characteristics , 1999, PODS '99.

[22]  K. Eisenhardt,et al.  Strategic decision processes in high velocity environments: four cases in the microcomputer industry , 1988 .

[23]  Arie Segev,et al.  Optimal update policies for distributed materialized views , 1991 .

[24]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[25]  Ram Rachamadugu,et al.  Policies for knowledge refreshing in databases , 2009 .

[26]  Abraham Silberschatz,et al.  What Makes Patterns Interesting in Knowledge Discovery Systems , 1996, IEEE Trans. Knowl. Data Eng..

[27]  LindenGreg,et al.  Amazon.com Recommendations , 2003 .

[28]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[29]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[30]  Peter J. Haas,et al.  The New Jersey Data Reduction Report , 1997 .

[31]  Alain Bensoussan,et al.  Maintaining Diagnostic Knowledge-Based Systems: A Control-Theoretic Approach , 2009, Manag. Sci..

[32]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[33]  Paul E. Utgoff,et al.  Incremental Induction of Decision Trees , 1989, Machine Learning.

[34]  Fazli Can,et al.  Incremental clustering for dynamic information processing , 1993, TOIS.

[35]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[36]  Hsinchun Chen,et al.  Intelligence and security informatics: information systems perspective , 2006, Decis. Support Syst..

[37]  Sumit Sarkar,et al.  Bayesian Models for Early Warning of Bank Failures , 2001, Manag. Sci..

[38]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[39]  Hasan Pirkul,et al.  Optimal Reorganization Policies for Stationary and Evolutionary Databases , 1990 .

[40]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[41]  P. B. Coaker,et al.  Applied Dynamic Programming , 1964 .

[42]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[43]  Melody Y. Kiang,et al.  Managerial Applications of Neural Networks: The Case of Bank Failure Predictions , 1992 .

[44]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[45]  K. Mani Chandy,et al.  Analytic models for rollback and recovery strategies in data base systems , 1975, IEEE Transactions on Software Engineering.

[46]  Jaideep Srivastava,et al.  Analytical modeling of materialized view maintenance , 1988, PODS '88.

[47]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[48]  Bart Baesens,et al.  Using Neural Network Rule Extraction and Decision Tables for Credit - Risk Evaluation , 2003, Manag. Sci..

[49]  Rodney Van Meter,et al.  Network attached storage architecture , 2000, CACM.

[50]  S. Al Statistical tools for nonlinear regression , 2013 .

[51]  Giovanni Giuffrida,et al.  Turning Datamining into a Management Science Tool: New Algorithms and Empirical Results.: New Algorithms and Empirical Results. , 2000 .

[52]  Greg Linden,et al.  Amazon . com Recommendations Item-to-Item Collaborative Filtering , 2001 .

[53]  Scott McCoy,et al.  The most important issues in knowledge management , 2002, CACM.

[54]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[55]  Ertem Tuncel,et al.  Incremental Maintenance of Online Summaries Over Multiple Streams , 2008, IEEE Transactions on Knowledge and Data Engineering.

[56]  Henry W. Block,et al.  L-superadditive structure functions , 1989, Advances in Applied Probability.

[57]  Hector Garcia-Molina,et al.  Applying update streams in a soft real-time database system , 1995, SIGMOD '95.

[58]  Douglas H. Fisher,et al.  A Case Study of Incremental Concept Induction , 1986, AAAI.

[59]  Mu-Chen Chen,et al.  An association-based clustering approach to order batching considering customer demand patterns , 2005 .