A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data

Recent approaches in classifying evolving data streams are based on supervised learning algorithms, which can be trained with labeled data only. Manual labeling of data is both costly and time consuming. Therefore, in a real streaming environment, where huge volumes of data appear at a high speed, labeled data may be very scarce. Thus, only a limited amount of training data may be available for building the classification models, leading to poorly trained classifiers. We apply a novel technique to overcome this problem by building a classification model from a training set having both unlabeled and a small amount of labeled instances. This model is built as micro-clusters using semi-supervised clustering technique and classification is performed with kappa-nearest neighbor algorithm. An ensemble of these models is used to classify the unlabeled data. Empirical evaluation on both synthetic data and real botnet traffic reveals that our approach, using only a small amount of labeled data for training, outperforms state-of-the-art stream classification algorithms that use twenty times more labeled data than our approach.

[1]  Bhavani M. Thuraisingham,et al.  R2D: A Bridge between the Semantic Web and Relational Visualization Tools , 2009, 2009 IEEE International Conference on Semantic Computing.

[2]  Bhavani M. Thuraisingham,et al.  A framework for a video analysis tool for suspicious event detection , 2005, MDM '05.

[3]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[4]  Ehab Al-Shaer,et al.  Towards autonomic risk-aware security configuration , 2008, NOMS 2008 - 2008 IEEE Network Operations and Management Symposium.

[5]  Philip S. Yu,et al.  A framework for on-demand classification of evolving data streams , 2006, IEEE Transactions on Knowledge and Data Engineering.

[6]  Latifur Khan,et al.  Hybrid DNA Sequence Similarity Scheme for Training Support Vector Machines * , 2003 .

[7]  Jason Michael Switzer,et al.  Semi-supervised subjectivity classification and application to jargon heavy corpora , 2010 .

[8]  Latifur Khan,et al.  Using Correlation Based Subspace Clustering for Multi-label Text Data Classification , 2010, 2010 22nd IEEE International Conference on Tools with Artificial Intelligence.

[9]  Latifur Khan,et al.  Ontology based policy interoperability in geo-spatial domain , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[10]  Bhavani M. Thuraisingham,et al.  Bi-directional Translation of Relational Data into Virtual RDF Stores , 2010, 2010 IEEE Fourth International Conference on Semantic Computing.

[11]  Satyen Abrol,et al.  Tweethood: Agglomerative Clustering on Fuzzy k-Closest Friends with Variable Depth for Location Mining , 2010, 2010 IEEE Second International Conference on Social Computing.

[12]  Latifur Khan,et al.  Software Fault Localization Using N-gram Analysis , 2008, WASA.

[13]  Latifur Khan,et al.  Multi-label ASRS Dataset Classification Using Semi Supervised Subspace Clustering , 2010, CIDU.

[14]  Kagan Tumer,et al.  Error Correlation and Error Reduction in Ensemble Classifiers , 1996, Connect. Sci..

[15]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[16]  Satyen Abrol,et al.  TWinner: understanding news queries with geo-content using Twitter , 2010, GIR.

[17]  Bhavani M. Thuraisingham,et al.  Message correlation in automated communication surveillance through singular value decomposition and word frequency association , 2005, MDM '05.

[18]  Bhavani M. Thuraisingham,et al.  A Multi-partition Multi-chunk Ensemble Technique to Classify Concept-Drifting Data Streams , 2009, PAKDD.

[19]  Cyrus Shahabi,et al.  Run-Time Optimizations of Join Queries forDistributed Databases over the Internet , 2003 .

[20]  Bhavani M. Thuraisingham,et al.  Detection and Resolution of Anomalies in Firewall Policy Rules , 2006, DBSec.

[21]  Bhavani M. Thuraisingham,et al.  Relationalizing RDF stores for tools reusability , 2009, WWW '09.

[22]  L. Khan,et al.  Change Detection of XML Documents Using Signatures , 2002 .

[23]  Bhavani M. Thuraisingham,et al.  An Effective Evidence Theory Based K-Nearest Neighbor (KNN) Classification , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[24]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[25]  Bhavani M. Thuraisingham,et al.  Geospatial Semantic Web, Definition , 2008, Encyclopedia of GIS.

[26]  Latifur Khan,et al.  Multimodal concept fusion using semantic closeness for image concept disambiguation , 2010, Multimedia Tools and Applications.

[27]  Qing Chen,et al.  Data stream classification techniques for multiple novel classes and dynamic feature spaces , 2010 .

[28]  Eduard Hovy,et al.  Improving the Precision of Lexicon-to-Ontology Alignment Algorithms , 1997 .

[29]  Bhavani M. Thuraisingham,et al.  A Token-Based Access Control System for RDF Data in the Clouds , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[30]  Latifur Khan,et al.  The randomized approximating graph algorithm for image annotation refinement problem , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[31]  Latifur Khan,et al.  A New Hierarchical Approach for Image Clustering , 2007 .

[32]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[33]  Latifur Khan,et al.  Selective placement and replication strategies for storing audio clips in a naval application , 1998, Other Conferences.

[34]  Ralf Klinkenberg,et al.  An Ensemble Classifier for Drifting Concepts , 2005 .

[35]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[36]  Johannes Gehrke,et al.  BOAT—optimistic decision tree construction , 1999, SIGMOD '99.