Leveraging One-Class SVM and Semantic Analysis to Detect Anomalous Content

Experiments were conducted to test several hypotheses on methods for improving document classification for the malicious insider threat problem within the Intelligence Community. Bag-of-words (BOW) representations of documents were compared to Natural Language Processing (NLP) based representations in both the typical and one-class classification problems using the Support Vector Machine algorithm. Results show that the NLP features significantly improved classifier performance over the BOW approach both in terms of precision and recall, while using many fewer features. The one-class algorithm using NLP features demonstrated robustness when tested on new domains.

[1]  Jay F. Nunamaker,et al.  Using Speech Act Profiling for Deception Detection , 2004, ISI.

[2]  Elizabeth D. Liddy,et al.  Improved Document Representation for Classification Tasks for the Intelligence Community , 2005, AAAI Spring Symposium: AI Technologies for Homeland Security.

[3]  Barbara J. Grosz,et al.  Natural-Language Processing , 1982, Artificial Intelligence.

[4]  P. Datta Characteristic concept representations , 1997 .

[5]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[6]  Karl-Michael Schneider Learning to Filter Junk E-Mail from Positive and Unlabeled Examples , 2004, IJCNLP.

[7]  Judee K. Burgoon,et al.  A Longitudinal Analysis of Language Behavior of Deception in E-mail , 2003, ISI.

[8]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[9]  J. Pennebaker,et al.  Lying Words: Predicting Deception from Linguistic Styles , 2003, Personality & social psychology bulletin.

[10]  Jiawei Han,et al.  PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[11]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[12]  Sergei Nirenburg,et al.  Ontology in information security: a useful theoretical foundation and methodological tool , 2001, NSPW '01.

[13]  Elizabeth D. Liddy,et al.  Semantic Analysis for Monitoring Insider Threats , 2004, ISI.

[14]  Jay F. Nunamaker,et al.  Detecting Deception in Synchronous Computer-Mediated Communication Using Speech Act Profiling , 2005, ISI.

[15]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[16]  Can Isik,et al.  Empirical selection of nlp-driven document representations for text categorization , 2006 .

[17]  Rong Zheng,et al.  Authorship Analysis in Cybercrime Investigation , 2003, ISI.

[18]  Evgeniy Gabrilovich,et al.  Feature Generation for Text Categorization Using World Knowledge , 2005, IJCAI.

[19]  James G. Shanahan,et al.  Boosting support vector machines for text classification through parameter-free threshold relaxation , 2003, CIKM '03.

[20]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[21]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[22]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[23]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[24]  Elizabeth D. Liddy,et al.  Information Security and Sharing , 2001 .

[25]  Farshad Fotouhi,et al.  Emergent Semantics from Users' Browsing Paths , 2003, ISI.

[26]  Salvatore J. Stolfo,et al.  One Class Support Vector Machines for Detecting Anomalous Windows Registry Accesses , 2003 .

[27]  Susan Gauch,et al.  ChatTrack: Chat Room Topic Detection Using Classification , 2004, ISI.

[28]  Amit P. Sheth,et al.  An Ontological Approach to the Document Access Problem of Insider Threat , 2005, ISI.

[29]  Jay F. Nunamaker,et al.  Detecting Deception through Linguistic Analysis , 2003, ISI.

[30]  Rémi Gilleron,et al.  Text Classification from Positive and Unlabeled Examples , 2002 .

[31]  Jiawei Han,et al.  Text classification from positive and unlabeled documents , 2003, CIKM '03.

[32]  Salvatore J. Stolfo,et al.  Behavior Profiling of Email , 2003, ISI.

[33]  Kevin A. Kwiat,et al.  An analytical framework for reasoning about intrusions , 2001, Proceedings 20th IEEE Symposium on Reliable Distributed Systems.

[34]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[35]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[36]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[37]  Robert H. Anderson Research and Development Initiatives Focused on Preventing, Detecting, and Responding to Insider Misuse of Critical Defense Information Systems. , 1999 .