Isolating critical data points from boundary region with feature selection

Immense databases may contain critical instances or chunks-a small heap of records or instances which has domain specific information. These chunks of information are useful in future decision making for improving classification accuracy for labeling of critical, unlabeled instances by reducing false positives and false negatives. Classification process may be assessed based on efficiency and effectiveness. Efficiency is concerned with the time to process the records by reducing attributes in the data set and effectiveness is the improvement in classification accuracy using crucial information. This work focuses on reducing the attributes in the large databases, put forwards an innovative procedure for computing criticality which isolates critical instances from the boundary region and are validated using real-world data set. This work also uses different attribute reduction technique used for fetching the critical instances to reduce the computational time. Results of the experiments show that only subsets of instances are isolated as critical nuggets. It is found that use of attribute reduction technique decreases the computational time. The data set with reduced attributes does not affect the classification accuracy and produces the same result as with the original data set. It also reveals that these critical records helps in improving classification accuracy substantially along with reduced computational time and are validated using real-life data sets.

[1]  Howard J. Hamilton,et al.  Interestingness measures for data mining: A survey , 2006, CSUR.

[2]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[3]  Kenneth McGarry,et al.  A survey of interestingness measures for knowledge discovery , 2005, The Knowledge Engineering Review.

[4]  E. Kannan,et al.  A CONSTRUCTIVE DISTANCE-BASED BOUNDARY DETECTION APPROACH WITH NUMERIC VARIABLES , 2014 .

[5]  Clara Pizzuti,et al.  Distance-based detection and prediction of outliers , 2006, IEEE Transactions on Knowledge and Data Engineering.

[6]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[7]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[8]  Michael Georgiopoulos,et al.  A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes , 2010, Data Mining and Knowledge Discovery.

[9]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[10]  Qinbao Song,et al.  A Fast Clustering-Based Feature Subset Selection Algorithm for High-Dimensional Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[11]  Evangelos Triantaphyllou,et al.  On Identifying Critical Nuggets of Information during Classification Tasks , 2013, IEEE Transactions on Knowledge and Data Engineering.

[12]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[13]  Maria E. Orlowska,et al.  Projected outlier detection in high-dimensional mixed-attributes data set , 2009, Expert Syst. Appl..

[14]  Evangelos Triantaphyllou,et al.  Data Mining and Knowledge Discovery via Logic-Based Methods , 2010 .