The Needles-in-Haystack Problem

We consider a new problem of detecting members of a rare class of data, the needles, which have been hidden in a set of records, the haystack. The only information regarding the characterization of the rare class is a single instance of a needle. It is assumed that members of the needle class are similar to each other according to an unknown needle characterization. The goal is to find the needle records hidden in the haystack. This paper describes an algorithm for that task and applies it to several example cases.

[1]  Toshihide Ibaraki,et al.  Logical analysis of numerical data , 1997, Math. Program..

[2]  Yves Kodratoff,et al.  Machine Learning — EWSL-91 , 1991, Lecture Notes in Computer Science.

[3]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[4]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[5]  Jaideep Srivastava,et al.  Data Mining for Network Intrusion Detection , 2002 .

[6]  Peter Clark,et al.  Rule Induction with CN2: Some Recent Improvements , 1991, EWSL.

[7]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[8]  Mohammed J. Zaki,et al.  ADMIT: anomaly-based data mining for intrusions , 2002, KDD.

[9]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[10]  Toshihide Ibaraki,et al.  An Implementation of Logical Analysis of Data , 2000, IEEE Trans. Knowl. Data Eng..

[11]  Syed Sibte Raza Abidi,et al.  Symbolic exposition of medical data-sets: a data mining workbench to inductively derive data-defining symbolic rules , 2002, Proceedings of 15th IEEE Symposium on Computer-Based Medical Systems (CBMS 2002).

[12]  Klaus Truemper,et al.  A MINSAT Approach for Learning in Logic Domains , 2002, INFORMS J. Comput..

[13]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[14]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[15]  Ning Zhong,et al.  Methodologies for Knowledge Discovery and Data Mining , 2002, Lecture Notes in Computer Science.

[16]  Evangelos Triantaphyllou,et al.  Data Mining and Knowledge Discovery via Logic-Based Methods , 2010 .

[17]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[18]  Nick Cercone,et al.  Discretization of Continuous Attributes for Learning Classification Rules , 1999, PAKDD.

[19]  Yoram Singer,et al.  A simple, fast, and effective rule learner , 1999, AAAI 1999.

[20]  Klaus Truemper,et al.  Learning Logic Formulas and Related Error Distributions , 2006 .

[21]  Rong Yan,et al.  On predicting rare classes with SVM ensembles in scene classification , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[22]  Salvatore J. Stolfo,et al.  Data Mining Approaches for Intrusion Detection , 1998, USENIX Security Symposium.

[23]  Salvatore J. Stolfo,et al.  Real time data mining-based intrusion detection , 2001, Proceedings DARPA Information Survivability Conference and Exposition II. DISCEX'01.

[24]  Evangelos Triantaphyllou,et al.  Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques , 2009 .

[25]  Vipin Kumar,et al.  Mining needle in a haystack: classifying rare classes via two-phase rule induction , 2001, SIGMOD '01.

[26]  Vipin Kumar,et al.  Evaluating boosting algorithms to classify rare classes: comparison and improvements , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[27]  Evangelos Triantaphyllou Data Mining and Knowledge Discovery via Logic-Based Methods: Theory, Algorithms, and Applications , 2010 .